Eagle 3 Speculators: When To Use Them?
Deep Dives
Explore related topics with these Wikipedia articles, rewritten for enjoyable reading:
-
Autoregressive model
12 min read
Explains why LLMs generate tokens sequentially and why this creates a bottleneck that speculative decoding attempts to solve - essential background for understanding the inference optimization problem
-
Knowledge distillation
13 min read
The technique of training smaller models to mimic larger ones is directly relevant to how draft models like EAGLE-3 are trained to predict what the target model would generate
EAGLE-3 is a family of draft models (“speculators”) for speculative decoding. As with other speculative setups, a small model proposes several future tokens and a larger target model verifies them in a single pass. When the draft model’s guesses are accurate and cheap enough, this reduces the total number of heavy forward passes, and therefore the overall cost of inference, while keeping the target model’s output unchanged.
The EAGLE-3 speculators are designed to raise the acceptance rate of drafted tokens and make better use of each verification pass. They do this through architectural changes (multi-layer feature fusion) and a training setup that more closely matches how speculative decoding is actually run at inference time. The aim is to shift more of the work onto a small, fast model and let the large model act mainly as a validator.
In this article, I will look at EAGLE-3 in practice using the released speculators with vLLM. I will experiment with Qwen3 32B on an A100 80 GB GPU and focus on end-to-end behavior: throughput, acceptance length, and wall-clock latency. In particular, I will compare high-concurrency continuous batching, where the GPU is already saturated, with batch size 1, where speculative decoding has more opportunity to lower the effective cost per generated token.
The following notebook shows how to run and evaluate EAGLE-3 speculators:
EAGLE-3: High-Accuracy Draft Models for Fast Speculative Decoding
This excerpt is provided for preview purposes. Full article content is available on the original publication.