← Back to Library

Olmo 3 Is Here!

Deep Dives

Explore related topics with these Wikipedia articles, rewritten for enjoyable reading:

  • Hallucination (artificial intelligence) 10 min read

    The AA-Omniscience benchmark discussed in the article specifically measures how often LLMs hallucinate. This Wikipedia article explains the phenomenon, its causes, and why it's a critical problem in AI deployment for high-stakes domains like law and healthcare.

Hi Everyone,

In this edition of The Weekly Kaitchup, I discuss:

  • Olmo 3

  • Eagle 3 Speculators to Easily Speed Up Inference with vLLM

  • AA-Omniscience: How Often LLMs Hallucinate?


Black Friday Subscription Discount

For Black Friday, I’m offering a 30% discount on the yearly subscription to The Kaitchup:

With this subscription, you get instant access to all the AI notebooks (180+) and all the articles and tutorials (200+).


Olmo 3

AI2 has released the third generation of their fully open models: Olmo 3, which includes “Thinking” models at 7B and 32B parameters. There’s also an instruct variant of the 7B model. Intermediate checkpoints (pretraining, SFT, DPO) are available as well, all grouped in this collection:

They’ve also released the full training dataset (also in the collection above).

The technical report is about 100 pages. I haven’t gone through it yet, but from a quick look, their post-training recipe seems to be an improved version of their TULU pipeline.

I’ll cover these models in more detail in an upcoming article, focusing on the quantized versions I’m currently preparing with LLM Compressor.

You can already find some of my quantized models here:

First impressions

The models look very strong. The 32B variant scores slightly below Qwen3-32B on benchmarks, even though Qwen3 is a 7-month-old model. That said, I’d trust Olmo 3’s reported scores much more as a proxy for real-world performance: AI2 has published the entire training data, so we can directly inspect how much the model was optimized toward specific benchmarks. I’ll know more next week, after spending more time with this model.

The 7B model appears to also perform below Qwen3 8B (again, that’s only according to benchmarks…). The NVFP4 version seems to preserve accuracy in my early experiments.

I’m running my own evaluations now and will add Olmo 3 to The Kaitchup Index next week.


Eagle 3 Speculators to Easily Speed Up Inference with vLLM

Speculative decoding speeds up LLMs by using two models together. A small, cheap “speculator” model quickly drafts several next tokens in one go, and then the large “verifier” model checks that whole chunk in a single forward pass. Any tokens the verifier agrees with are accepted, so you effectively get multiple tokens for the price of one expensive step on the big model, without changing its behavior or quality. We saw how it works in

...
Read full article on The Kaitchup →