The Limits of GRPO-like Methods for Reinforcement Learning
Deep Dives
Explore related topics with these Wikipedia articles, rewritten for enjoyable reading:
-
Reinforcement learning from human feedback
13 min read
The article discusses GRPO and RLVR methods for training LLMs - understanding RLHF provides essential context for how these reward-based training approaches evolved and why the limitations discussed matter for AI alignment
-
Knowledge distillation
13 min read
The article explicitly contrasts RLVR with distillation, claiming distillation can 'inject new knowledge and genuinely expand the set of solvable problems' - understanding this technique helps readers grasp why it might succeed where GRPO fails
-
American Invitational Mathematics Examination
13 min read
AIME is repeatedly used as a key benchmark in the article (AIME24, AIME25) to evaluate reasoning capabilities - knowing its difficulty level and structure helps readers understand why it's a meaningful test of mathematical reasoning in LLMs
Hi Everyone,
In this edition of The Weekly Kaitchup, I discuss:
The limits of current GRPO-like methods
The SYNTH/Bagettotron releases
Book Update
Everything is now bundled into a single 140-page PDF plus 9 companion notebooks. If you bought the book, you received it earlier this week.
Current chapters:
Parameter-Efficient Fine-Tuning
Prepare Your Training Dataset
LLM Quantization
Fine-Tuning Quantized LLMs
Efficient Inference with vLLM
One chapter is still in progress: LLM Evaluation. I’ll publish this chapter in December. Then, regular updates are planned in 2026 to keep the content relevant.
You can still grab the book at 30% off until November 30.
I read this very interesting paper on the limits of current RLVR-like methods (GRPO, GSPO, etc.) used to post-train LLMs:
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
RLVR (Reinforcement Learning with Verifiable Rewards) has been credited for recent “reasoning LLMs,” but this work shows it mostly sharpens sampling efficiency rather than expanding a model’s underlying reasoning capacity.
Something often underestimated with LLMs: The Sampling Effect
LLM outputs can vary a lot under stochastic decoding (temperature, top-p, etc.). We explored these effects on a quantized model, here:
Benchmark scores you read in papers, especially on hard sets, often reflect average accuracy. Run an AIME25 prompt 100 times and you may see ten-plus distinct answers. That’s where LLMs are today…
If you ran a lot of GRPO trainings, it’s probably something you already saw: RL-trained variants win when you can sample only a few outputs (small k, e.g., pass@1), yet the original base models overtake them as you allow more samples (large k, e.g., pass@128–1024). In other words, RL concentrates probability mass on already-rewarded trajectories without discovering fundamentally new reasoning paths. The authors did a large-scale study documenting this across coding, vision, and language tasks.
Put simply: if a model can’t already answer a question, GRPO probably won’t make it do so. Methods like GRPO mainly increase the chance of producing the correct answer. They don’t create new knowledge.
The key evidence is pass@k curves. If RL truly enlarged a model’s reasoning space, it should dominate base models even at high k, because “more draws” would expose more of its purported new capabilities. Instead, base models eventually match and surpass RL variants, implying that the RL model’s correct solutions already exist within the base model’s distribution. RL just makes those few “good paths” easier to
...This excerpt is provided for preview purposes. Full article content is available on the original publication.
