Papers
arxiv:2604.03128

Self-Distilled RLVR

Published on Apr 3
· Submitted by
steven young
on Apr 6
#1 Paper of the day
Authors:
,
,
,
,
,
,
,
,
,

Abstract

RLSD combines reinforcement learning with verifiable rewards and self-distillation to achieve stable training with fine-grained updates and reliable policy direction from environmental feedback.

AI-generated summary

On-policy distillation (OPD) has become a popular training paradigm in the LLM community. This paradigm selects a larger model as the teacher to provide dense, fine-grained signals for each sampled trajectory, in contrast to reinforcement learning with verifiable rewards (RLVR), which only obtains sparse signals from verifiable outcomes in the environment. Recently, the community has explored on-policy self-distillation (OPSD), where the same model serves as both teacher and student, with the teacher receiving additional privileged information such as reference answers to enable self-evolution. This paper demonstrates that learning signals solely derived from the privileged teacher result in severe information leakage and unstable long-term training. Accordingly, we identify the optimal niche for self-distillation and propose RLSD (RLVR with Self-Distillation). Specifically, we leverage self-distillation to obtain token-level policy differences for determining fine-grained update magnitudes, while continuing to use RLVR to derive reliable update directions from environmental feedback (e.g., response correctness). This enables RLSD to simultaneously harness the strengths of both RLVR and OPSD, achieving a higher convergence ceiling and superior training stability.

Community

Paper submitter
This comment has been hidden (marked as Resolved)

nice breakdown of this one here if anyone wants the tldr https://arxivexplained.com/papers/self-distilled-rlvr the part about rlvr is what got me

A very insightful work for community.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.03128
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.03128 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.03128 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.03128 in a Space README.md to link it from this page.

Collections including this paper 3