yuhuili
yuhuili/EAGLE3-LLaMA3.1-Instruct-8B
EAGLE...
Model Documentation

EAGLE
| EAGLE | EAGLE-2 | EAGLE-3 | Blog |
##
EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency) is a new baseline for fast decoding of Large Language Models (LLMs) with provable performance maintenance. This approach involves extrapolating the second-top-layer contextual feature vectors of LLMs, enabling a significant boost in generation efficiency.
EAGLE-2 uses the confidence scores from the draft model to approximate acceptance rates, dynamically adjusting the draft tree structure, which further enhances performance.
EAGLE-3 removes the feature prediction constraint in EAGLE and simulates this process during training using training-time testing. Considering that top-layer features are limited to next-token prediction, EAGLE-3 replaces them with a fusion of low-, mid-, and high-level semantic features. EAGLE-3 further improves generation speed while ensuring lossless performance.
_Inference is conducted on 2x RTX 3090 GPUs at fp16 precision using the Vicuna 13B model._
[//]:
()
[//]:()
[//]:(Using EAGLE-2, the inference speed on 2 RTX 3060 GPUs can be faster than vanilla autoregressive decoding on an A100 GPU.)
Support
EAGLE has been merged in the following mainstream LLM serving frameworks (listed in alphabetical order).Reference
For technical details and full experimental results, please check the paper of EAGLE, the paper of EAGLE-2, and the paper of EAGLE-3.
@inproceedings{li2024eagle,
author = {Yuhui Li and Fangyun Wei and Chao Zhang and Hongyang Zhang},
title = {{EAGLE}: Speculative Sampling Requires Rethinking Feature Uncertainty},
booktitle = {International Conference on Machine Learning},
year = {2024}
}
@inproceedings{li2024eagle2,
author = {Yuhui Li and Fangyun Wei and Chao Zhang and Hongyang Zhang},
title = {{EAGLE-2}: Faster Inference of Language Models with Dynamic Draft Trees},
booktitle = {Empirical Methods in Natural Language Processing},
year = {2024}
}
@inproceedings{li2025eagle3,
author = {Yuhui Li and Fangyun Wei and Chao Zhang and Hongyang Zhang},
title = {{EAGLE-3}: Scaling up Inference Acceleration of Large Language Models via Training-Time Test},
booktitle = {Annual Conference on Neural Information Processing Systems},
year = {2025}
}
Files & Weights
| Filename | Size | Action |
|---|---|---|
| pytorch_model.bin | 0.79 GB |