← Back to Library

Eagle 3 Speculators: When To Use Them?

Deep Dives

Explore related topics with these Wikipedia articles, rewritten for enjoyable reading:

  • Autoregressive model 12 min read

    Explains why LLMs generate tokens sequentially and why this creates a bottleneck that speculative decoding attempts to solve - essential background for understanding the inference optimization problem

  • Knowledge distillation 13 min read

    The technique of training smaller models to mimic larger ones is directly relevant to how draft models like EAGLE-3 are trained to predict what the target model would generate

EAGLE-3 is a family of draft models (“speculators”) for speculative decoding. As with other speculative setups, a small model proposes several future tokens and a larger target model verifies them in a single pass. When the draft model’s guesses are accurate and cheap enough, this reduces the total number of heavy forward passes, and therefore the overall cost of inference, while keeping the target model’s output unchanged.

The EAGLE-3 speculators are designed to raise the acceptance rate of drafted tokens and make better use of each verification pass. They do this through architectural changes (multi-layer feature fusion) and a training setup that more closely matches how speculative decoding is actually run at inference time. The aim is to shift more of the work onto a small, fast model and let the large model act mainly as a validator.

In this article, I will look at EAGLE-3 in practice using the released speculators with vLLM. I will experiment with Qwen3 32B on an A100 80 GB GPU and focus on end-to-end behavior: throughput, acceptance length, and wall-clock latency. In particular, I will compare high-concurrency continuous batching, where the GPU is already saturated, with batch size 1, where speculative decoding has more opportunity to lower the effective cost per generated token.

The following notebook shows how to run and evaluate EAGLE-3 speculators:

EAGLE-3: High-Accuracy Draft Models for Fast Speculative Decoding

Read more

Read full article on The Kaitchup →