Distribution Matching Prevents Mode Collapse in Training Reasoning Models

Community Article Published March 17, 2026

Upvote

Why reasoning models lose diversity — and how distribution matching fixes it.

This post summarizes "Whatever Remains Must Be True: Filtering Drives Reasoning in LLMs, Shaping Diversity" by Germán Kruszewski, Pierre Erbacher, Jos Rozen, and Marc Dymetman, accepted at ICLR 2026.

Reinforcement learning has become the dominant approach for training LLMs to reason. It works. But it comes with a side effect that's increasingly hard to ignore: the models get more accurate and less diverse, often forgetting solutions they could originally find.

This post argues that this isn't a mysterious failure — it's a predictable consequence of how these models are optimized. And once you separate the question of what you want the model to learn from how you train it to get there, the fix becomes surprisingly clear.

Why diversity matters

Modern reasoning systems — math solvers, theorem provers, code models — increasingly rely on sampling many candidate solutions at inference time. So performance isn't just about getting one right answer; it's about the distribution of answers the model produces.

Two metrics capture this:

Precision (pass@1): the probability that a single sample is correct.
Coverage (pass@k): the probability that at least one of k samples is correct.

These two goals are in tension. In formal theorem proving, for instance, rare proof paths can be the only routes to solving hard problems. A model that collapses onto a single strategy may become faster at easy theorems while quietly forgetting the hard ones it used to occasionally crack.

The hidden objective behind RL training

Standard RLVR training optimizes something like: reward correct answers, penalize incorrect ones, and add a KL penalty to keep the model from drifting too far from the base.

This seems reasonable. But there's a mathematical subtlety worth unpacking.

Optimizing this objective is equivalent to minimizing the Reverse KL divergence toward a reward-weighted version of the base model's distribution. Reverse KL is mode-seeking: it heavily penalizes the model for placing probability mass where the target has almost none, but it's relatively indifferent to ignoring modes of the target altogether. In practice, this means the model concentrates on a few high-probability correct solutions and quietly abandons others.

This is why RLVR-trained models often show higher pass@1 but lower pass@k than their base models. The precision gain is real, but it comes at the cost of coverage.

Separating what from how

The central conceptual move in this work is a simple but clarifying distinction: what distribution do you want the model to represent, and how do you train it to get there?

Standard RL conflates these two questions. It defines a reward, picks an optimization algorithm, and lets the implicit objective emerge from that choice. The paper argues this is where diversity gets lost — not because the target is wrong, but because the method used to approximate it is mode-seeking by construction.

So: what distribution do we actually want? The answer is surprisingly natural. Start with the base model, and filter out all the wrong answers:

$p(y \mid x) \propto \pi_{\text{base}}(y \mid x) \cdot v(y, x)$

where $v(y, x) \in \{0, 1\}$ is a verifier. This target distribution has two appealing properties: every output it assigns probability to is correct, and the relative probabilities of correct solutions are exactly preserved from the base model. It keeps everything the base model knows — it just removes the wrong answers.

This is the unique distribution that (i) always outputs correct solutions, and (ii) stays as close as possible to the base model in KL divergence. The paper calls this framework Distributional Matching with Verifiable Rewards (DMVR).

One important clarification on how this connects to RL: standard RLVR doesn't optimize toward this exact target, but it does optimize toward a smoothed version of it — specifically, a reward-exponentiated variant $p_\beta(y \mid x) \propto \pi_{\text{base}}(y \mid x) \exp(v(y,x)/\beta)$ — and this smoothed version converges to the filtered distribution as $\beta \to 0$ . This also explains why RL doesn't create new skills, as claimed in recent work: training reweights existing behaviors; it doesn't discover new ones. The DMVR framework makes this explicit — the target distribution is constructed from what the base model already knows.

Once the what is fixed and well-motivated, attention can shift entirely to the how. As we will see next, the diversity problem isn't in the target; it's in using Reverse KL to approximate it.

Divergence as a dial

Once you have a target distribution, training is a question of how to match it. Different divergences produce different tradeoffs:

Reverse KL is mode-seeking: high precision, lower diversity. This is what RLVR implicitly uses.
Forward KL is mass-covering: higher diversity, but it can spread mass onto low-quality regions.

Neither extreme is obviously right for all settings. To interpolate between them, the paper introduces α-divergences, which form a continuous family:

α → 1: Reverse KL (recovers RLVR-style training)
α → 0: Forward KL (recovers KL-DPG, and — when sampling is done offline from the base model — Rejection Sampling Fine-Tuning)
α = 0.5: the squared Hellinger distance

This gives rise to α-DPG, which lets you tune the precision–diversity tradeoff with a single parameter. Notably, this unifies several existing approaches that previously looked like different algorithms into points on the same spectrum.

What the experiments show

The method is evaluated on LEAN, a formal theorem-proving environment where proofs are automatically verified. This is a useful testbed because correctness is binary, many valid proofs exist, and coverage directly affects how many theorems the model can solve.

A Pareto frontier emerges. Different values of α trace out a clean tradeoff between precision (pass@1) and coverage (pass@256). Most α-DPG models sit on or near this frontier, while many baselines do not. Low values of α (e.g., α = 0.25) achieve the best coverage among all methods, while high values (α ≥ 0.995) match or exceed standard RL methods on precision — typically while retaining higher coverage than those same methods.

Mode-seeking methods help the medium, hurt the hard. Problems are categorized by how often the base model solves them: easy (>80% of samples correct), medium (20–80%), hard (<20%), or unsolvable (zero correct in 256 attempts). After training with GRPO or high-α α-DPG, many medium problems become easy — but a notable number of previously hard problems become unsolvable. The model gets better at what it was already decent at, and forgets the rest.

GRPO: many medium problems become easy, but hard problems become unsolvable.

Low-α α-DPG (and GRPO with strong KL regularization) shows a more conservative pattern: fewer medium problems converted to easy, but almost no hard problems lost.

α-DPG (α=0.25): more conservative improvements, with hard problems remaining solvable.

Diversity predicts coverage. Higher diversity in proof tactics and premises strongly correlates with better pass@256, and anticorrelates with pass@1. More exploration means a higher chance of stumbling onto a correct solution; more concentration means each individual sample is more likely to be right.

What this suggests going forward

The diversity problem in RL-trained reasoning models isn't mysterious. It follows from the geometry of Reverse KL optimization against a target with restricted support. When you use a mode-seeking divergence to approximate a filtered distribution, you get a mode-seeking model.

The deeper point is the what vs. how separation. Once you write down the target distribution explicitly — filter the base model, preserve the relative probabilities of correct solutions — it becomes clear that this target is actually quite good. It encodes everything the base model can already do correctly, nothing more and nothing less. The diversity loss that plagues RL training isn't a property of this target; it's an artifact of how RL approximates it.

That separation is practically useful. With the what fixed, you can focus your engineering attention entirely on the how — specifically, which divergence to minimize when training a policy to approximate the target. α-DPG makes this a single tunable parameter, moving continuously from "maximize reward aggressively" to "preserve all valid solutions." The experiments show this traces a genuine Pareto frontier, so you can choose a point on that curve that fits your deployment needs — whether that's a model that nails single-shot answers or one that reliably finds solutions given enough attempts.

For systems that scale at inference time by sampling many candidates, this matters. A model that retains diverse solution strategies will keep finding new theorems as you give it more attempts. A model that collapses to a few strategies hits a ceiling quickly, no matter how many samples you draw.

Code will be available at github.com/naver/alpha-dpg.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote