Scaling RL and Self-Verifiable Reasoning: INTELLECT-3 and DeepSeekMath-V2
Deep Dives
Explore related topics with these Wikipedia articles, rewritten for enjoyable reading:
-
Reinforcement learning from human feedback
13 min read
The article discusses RLVR (Reinforcement Learning with Verifiable Rewards) and RL training pipelines extensively. Understanding RLHF provides crucial context for how modern AI models learn from feedback signals, which is the foundation these newer techniques build upon.
-
Mixture of experts
12 min read
INTELLECT-3 is described as a 106B-parameter MoE model with 12B active parameters. Understanding the Mixture of Experts architecture explains why models can have large total parameters but only use a fraction during inference, which is central to the efficiency claims in the article.
-
Automated theorem proving
14 min read
DeepSeekMath-V2 is specifically designed for verifying mathematical proofs. The historical context of automated theorem proving—from early symbolic AI to modern neural approaches—provides valuable background for understanding why self-verifiable mathematical reasoning represents a significant advance.
Hi Everyone,
In this edition of The Weekly Kaitchup, I discuss:
INTELLECT-3: A Better GLM-4.5-Air
DeepSeekMath-V2: A New Math Model to Verify Mathematical Proof
I’ll be at NeurIPS in San Diego next week!
If you’d like me to attend specific talks or ask questions to certain authors, or if you have particular recommendations on what to see, let me know in the comments. I’ll mainly focus on work around quantization, PEFT, and evaluation, and I’ll share a full report with all the interesting things I learn.
Also, since San Diego is quite far from my little corner of the French countryside, I’ll probably skip my usual Monday article and publish it on Tuesday/Wednesday instead.
Black Friday Subscription Discount
For Black Friday, I’m offering a 30% discount on the yearly subscription to The Kaitchup:
With this subscription, you get instant access to all the AI notebooks (180+), articles, and tutorials (200+).
INTELLECT-3: A Better GLM-4.5-Air
GLM models are very popular now as they perform well on most tasks. Among open-weight models, I prefer them over recent DeepSeek and Kimi models.
The GLM-4.5-Air is far smaller, and thus easier to run, than the GLM 4.6, but it is 4 months old! Thanks to prime intellect, we just got a very good update.
INTELLECT-3 is a 106B-parameter MoE model (12B active) trained end-to-end (not really end-to-end, but they market it like this) with RLVR on top of GLM-4.5-Air.
The work is primarily about infrastructure: it exposes a production-style RL stack for long-context, tool-using models, not just a checkpoint.
The stack covers asynchronous RL, standardized environments, and large-scale sandboxed code execution, all built around open-weight models. The result is a reproducible recipe for scaling RLVR to 512 H200 GPUs with long contexts and agentic behavior. No, this recipe is not for everyone. Using prime-intellect GPU pricing, that’s ~$1,300/hour. They mentioned they used these GPUs for 2 months, so that’s nearly a $1M model, if you want to do the same in the cloud.
RLVR stack and system design
The core engine is prime-rl, their asynchronous off-policy RL framework that splits responsibilities across a CPU orchestrator, a trainer, and an inference pool. Training uses FSDP2-based data parallelism and torchtitan-style parallelism for MoE models. And inference uses a fleet of OpenAI-compatible vLLM servers extended to accept hot weight updates. The orchestrator is stateless and cheap: it streams rollouts from inference, forms batches, feeds
...This excerpt is provided for preview purposes. Full article content is available on the original publication.
