Unsloth's Quantization-Aware Training (QAT) vs Post-Training Quantization (PTQ) for Small Models
Deep Dives
Explore related topics with these Wikipedia articles, rewritten for enjoyable reading:
-
Quantization (signal processing)
13 min read
The article's core topic is quantization of neural network parameters. Understanding the mathematical foundations of quantization—how continuous values are mapped to discrete levels and the inherent error this introduces—provides essential context for why PTQ degrades small models more than large ones.
-
Fixed-point arithmetic
12 min read
The article discusses INT4 and INT8 quantization schemes, which use fixed-point representations. Understanding how fixed-point arithmetic works, its precision limitations, and why it's computationally cheaper than floating-point helps readers grasp the engineering tradeoffs in model deployment.
-
Knowledge distillation
13 min read
QAT is conceptually related to knowledge distillation—both involve training a smaller or compressed model to match a larger one's behavior. Readers interested in model compression techniques would benefit from understanding this foundational approach that predates and complements quantization methods.
Quantization is a common way to shrink large language models (LLMs). In practice, it’s a form of compression that reduces parameter precision, typically from 16-bit (BF16/FP16) to lower-precision formats like 8-bit or 4-bit. Most deployments apply this via post-training quantization (PTQ).
On very large models, PTQ often preserves downstream accuracy remarkably well. But on smaller models, those with only a few billion parameters, or even sub-billion, PTQ can cause substantial accuracy degradation.
An alternative is quantization-aware training (QAT), which trains the model to be robust to quantization effects. QAT is usually expensive, and on bigger models I rarely find the gains worth the cost. For small models, though, it can make a difference without spending too much compute.
Unsloth now supports QAT, letting us train models to be quantization-aware while adapting them to our task and data. Thanks to Unsloth’s efficiency, this is probably the most affordable way to fine-tune a model that remains robust under quantization. In this article, I put Unsloth’s QAT to the test on a deliberately hard setting: English→French translation with a very small model, Gemma 3 270M. In earlier work, I had good success fine-tuning this model for translation, but as we’ll see, introducing quantization through PTQ can make things fragile. Can QAT limit the damage?
I evaluate two QAT schemes available in Unsloth for this setup, INT4 and INT8-INT4, comparing final accuracy against PTQ and costs. I use full fine-tuning (not LoRA), since the model is already quite small.
Here’s the notebook I used to run these Unsloth QAT experiments:
Quantization-Aware Training: int4, fp8-int4, fp8-fp8, and int8-int4
This excerpt is provided for preview purposes. Full article content is available on the original publication.