yuhuili

yuhuili/EAGLE3-LLaMA3.1-Instruct-8B

 EAGLE...

Model Documentation

EAGLE

 EAGLE



| EAGLE | EAGLE-2 | EAGLE-3 | Blog |



Version License Maintenance Contributions welcome



##

benchmark



EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency) is a new baseline for fast decoding of Large Language Models (LLMs) with provable performance maintenance. This approach involves extrapolating the second-top-layer contextual feature vectors of LLMs, enabling a significant boost in generation efficiency.

  • EAGLE is:
  • certified by the third-party evaluation as the fastest speculative method so far.
  • achieving 2x speedup on gpt-fast.
  • 3x faster than vanilla decoding (13B).
  • 2x faster than Lookahead (13B).
  • 1.6x faster than Medusa (13B).
  • provably maintaining the consistency with vanilla decoding in the distribution of generated texts.
  • trainable (within 1-2 days) and testable on 8x RTX 3090 GPUs. So even the GPU poor can afford it.
  • combinable with other parallelled techniques such as vLLM, DeepSpeed, Mamba, FlashAttention, quantization, and hardware optimization.


  • EAGLE-2 uses the confidence scores from the draft model to approximate acceptance rates, dynamically adjusting the draft tree structure, which further enhances performance.

  • EAGLE-2 is:
  • 4x faster than vanilla decoding (13B).
  • 1.4x faster than EAGLE-1 (13B).


  • EAGLE-3 removes the feature prediction constraint in EAGLE and simulates this process during training using training-time testing. Considering that top-layer features are limited to next-token prediction, EAGLE-3 replaces them with a fusion of low-, mid-, and high-level semantic features. EAGLE-3 further improves generation speed while ensuring lossless performance.

  • EAGLE-3 is:
  • 5.6 faster than vanilla decoding (13B).
  • 1.8x faster than EAGLE-1 (13B).


  • demogif



    _Inference is conducted on 2x RTX 3090 GPUs at fp16 precision using the Vicuna 13B model._

    [//]:

    ()

    [//]:

    ()

    [//]:

    (Using EAGLE-2, the inference speed on 2 RTX 3060 GPUs can be faster than vanilla autoregressive decoding on an A100 GPU.)



    Support

    EAGLE has been merged in the following mainstream LLM serving frameworks (listed in alphabetical order).

  • AMD ROCm
  • AngelSlim
  • AWS NeuronX Distributed Core
  • CPM.cu
  • IntelĀ® Extension for Transformers
  • IntelĀ® LLM Library for PyTorch
  • MLC-LLM
  • NVIDIA NeMo Framework
  • NVIDIA TensorRT-LLM
  • NVIDIA TensorRT Model Optimizer
  • PaddleNLP
  • SGLang
  • SpecForge
  • vLLM




  • Reference

    For technical details and full experimental results, please check the paper of EAGLE, the paper of EAGLE-2, and the paper of EAGLE-3.
    
    @inproceedings{li2024eagle, 
    	author = {Yuhui Li and Fangyun Wei and Chao Zhang and Hongyang Zhang}, 
    	title = {{EAGLE}: Speculative Sampling Requires Rethinking Feature Uncertainty}, 
    	booktitle = {International Conference on Machine Learning},
    	year = {2024}
    }
    @inproceedings{li2024eagle2, 
    	author = {Yuhui Li and Fangyun Wei and Chao Zhang and Hongyang Zhang}, 
    	title = {{EAGLE-2}: Faster Inference of Language Models with Dynamic Draft Trees}, 
    	booktitle = {Empirical Methods in Natural Language Processing},
    	year = {2024}
    }
    @inproceedings{li2025eagle3,
        author = {Yuhui Li and Fangyun Wei and Chao Zhang and Hongyang Zhang},
        title = {{EAGLE-3}: Scaling up Inference Acceleration of Large Language Models via Training-Time Test}, 
        booktitle = {Annual Conference on Neural Information Processing Systems},
        year = {2025}
    }
    

    Files & Weights

    FilenameSizeAction
    pytorch_model.bin 0.79 GB