← Back to Library

The Limits of GRPO-like Methods for Reinforcement Learning

Deep Dives

Explore related topics with these Wikipedia articles, rewritten for enjoyable reading:

  • Reinforcement learning from human feedback 13 min read

    The article discusses GRPO and RLVR methods for training LLMs - understanding RLHF provides essential context for how these reward-based training approaches evolved and why the limitations discussed matter for AI alignment

  • Knowledge distillation 13 min read

    The article explicitly contrasts RLVR with distillation, claiming distillation can 'inject new knowledge and genuinely expand the set of solvable problems' - understanding this technique helps readers grasp why it might succeed where GRPO fails

  • American Invitational Mathematics Examination 13 min read

    AIME is repeatedly used as a key benchmark in the article (AIME24, AIME25) to evaluate reasoning capabilities - knowing its difficulty level and structure helps readers understand why it's a meaningful test of mathematical reasoning in LLMs

Hi Everyone,

In this edition of The Weekly Kaitchup, I discuss:

  • The limits of current GRPO-like methods

  • The SYNTH/Bagettotron releases


Book Update

Everything is now bundled into a single 140-page PDF plus 9 companion notebooks. If you bought the book, you received it earlier this week.

Current chapters:

  1. Parameter-Efficient Fine-Tuning

  2. Prepare Your Training Dataset

  3. LLM Quantization

  4. Fine-Tuning Quantized LLMs

  5. Efficient Inference with vLLM

One chapter is still in progress: LLM Evaluation. I’ll publish this chapter in December. Then, regular updates are planned in 2026 to keep the content relevant.

You can still grab the book at 30% off until November 30.


I read this very interesting paper on the limits of current RLVR-like methods (GRPO, GSPO, etc.) used to post-train LLMs:

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

RLVR (Reinforcement Learning with Verifiable Rewards) has been credited for recent “reasoning LLMs,” but this work shows it mostly sharpens sampling efficiency rather than expanding a model’s underlying reasoning capacity.

Something often underestimated with LLMs: The Sampling Effect

LLM outputs can vary a lot under stochastic decoding (temperature, top-p, etc.). We explored these effects on a quantized model, here:

Benchmark scores you read in papers, especially on hard sets, often reflect average accuracy. Run an AIME25 prompt 100 times and you may see ten-plus distinct answers. That’s where LLMs are today…

If you ran a lot of GRPO trainings, it’s probably something you already saw: RL-trained variants win when you can sample only a few outputs (small k, e.g., pass@1), yet the original base models overtake them as you allow more samples (large k, e.g., pass@128–1024). In other words, RL concentrates probability mass on already-rewarded trajectories without discovering fundamentally new reasoning paths. The authors did a large-scale study documenting this across coding, vision, and language tasks.

Put simply: if a model can’t already answer a question, GRPO probably won’t make it do so. Methods like GRPO mainly increase the chance of producing the correct answer. They don’t create new knowledge.

The key evidence is pass@k curves. If RL truly enlarged a model’s reasoning space, it should dominate base models even at high k, because “more draws” would expose more of its purported new capabilities. Instead, base models eventually match and surpass RL variants, implying that the RL model’s correct solutions already exist within the base model’s distribution. RL just makes those few “good paths” easier to

...
Read full article on The Kaitchup →