Accelerate Models with Quantization: Recipes for NVFP4, GPTQ, AWQ, SmoothQuant, AutoRound, and FP8
Deep Dives
Explore related topics with these Wikipedia articles, rewritten for enjoyable reading:
-
Quantization (signal processing)
13 min read
The article discusses various quantization methods (INT4, INT8, FP8, FP4) for neural networks. Understanding the mathematical foundations of quantization—how continuous values are mapped to discrete representations and the inherent trade-offs in precision—provides essential context for why these techniques work and their accuracy implications.
-
Floating-point arithmetic
7 min read
The article references FP8, FP4, and different numeric formats extensively. Understanding IEEE floating-point representation, how precision relates to bit width, and the historical development of floating-point standards illuminates why reduced-precision formats like FP8 can maintain reasonable accuracy while dramatically improving performance.
-
CUDA
12 min read
The article mentions 'highly optimized CUDA kernels' as key to achieving fast inference with quantized models. Understanding NVIDIA's parallel computing architecture, its history, and how it enables GPU-accelerated machine learning provides valuable context for why hardware-specific optimizations matter for running these quantized models efficiently.
Running LLMs is easy. Quantizing LLMs is also easy. But running quantized LLMs? That often doesn’t work as expected. This is one of the reasons GGUF is so popular: it’s a format that can be easily run by popular frameworks like Ollama and llama.cpp.
However, if you want state-of-the-art quantization accuracy and to take advantage of highly optimized CUDA kernels for INT4, FP8, and FP4 models, you often need to get your hands a bit dirty.
In this article, I explore six different quantization recipes that yield models optimized to run very fast with vLLM. We’ve already applied most of them in previous articles using different frameworks:
W4A16: INT4 quantized weights with GPTQ, AWQ, and AutoRound, calibrated/tuned
W8A8: INT8 quantized weights and quantized activations, calibrated with SmoothQuant
FP8-Dynamic: FP8 quantized weights and dynamically quantized activations
NVFP4: FP4 quantized weights and activations, calibrated
All these recipes can be run on a single consumer GPU, but you’ll need a recent one (for FP8 and NVFP4 in particular), such as an RTX 50xx. I used an RTX 5090 (from RunPod) and was able to quantize 8B models. None of these recipes took more than an hour.
I also provide a single customizable script capable of running each of these recipes. You can find it here:
In the following sections, we’ll test each recipe with Qwen3 4B Instruct and also its Thinking variants to measure the impact on reasoning and long-sequence generation. I report both inference throughput and accuracy on popular benchmarks.
Note: I focused on Qwen3 in this article, but I could quantize Olmo 3 with the same script. You can find my quantized Olmo 3 here (still ongoing):
6 Quantization Recipes
This excerpt is provided for preview purposes. Full article content is available on the original publication.
