← Back to Library

Quantizing Olmo 3: Most Efficient and Accurate Formats

Deep Dives

Explore related topics with these Wikipedia articles, rewritten for enjoyable reading:

  • Quantization (signal processing) 13 min read

    The article focuses on quantizing neural network weights to different bit formats (INT8, FP8, NVF4). Understanding the mathematical foundations of quantization—how continuous values are mapped to discrete levels and the tradeoffs involved—provides essential context for why these techniques reduce model size while potentially sacrificing accuracy.

  • Reinforcement learning from human feedback 13 min read

    The article mentions Olmo 3's training pipeline includes DPO (Direct Preference Optimization) and RL stages. RLHF is the foundational technique that these methods build upon, explaining how language models are aligned to human preferences through reward modeling and policy optimization.

  • Floating-point arithmetic 7 min read

    The quantization formats discussed (fp8-dynamic, nvfp4, w4a16) all involve different floating-point representations. Understanding how floating-point numbers work—mantissa, exponent, precision tradeoffs—helps readers grasp why FP8 and FP4 formats can approximate full-precision weights with significant memory savings.

Quantizing Olmo 3: Most Efficient and Accurate Formats

By Benjamin Marie

Olmo 3 is a standard 7B/32B decoder-only transformer. No MoE, no exotic attention, no flashy new architecture tricks. Most of the change is in the training pipeline: in the data (Dolma 3), the midtraining mix (Dolmino), the long-context stage (Longmino), and the “thinking” stack (Dolci, SFT, DPO, RL).

In this article, I briefly go through what actually changed compared to Olmo 2, and what didn’t work.

Then I move to quantization. I quantized Olmo 3 7B and 32B with several standard recipes that are practical to run on a single consumer GPU:

  • gptq-w4a16-g128

  • fp8-dynamic

  • nvfp4

  • awq with custom mappings for Olmo 3

  • W8A8 (INT8) with SmoothQuant

Finally, I look at how these variants behave on benchmarks: accuracy, PASS@k, and token efficiency, plus some notes on hardware choices (RTX 5090 vs RTX 6000) and what actually matters once you run long contexts with concurrent queries.

I used the same quantization script I released last week:

Get the notebook (#190)

I released my quantized models here:

Read full article on The Kaitchup →