Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning
Abstract
Post-training of reasoning large language models can be improved by correcting distribution mismatches between supervised fine-tuning and reinforcement learning stages through importance sampling reweighting of the SFT loss.
Post-training of reasoning LLMs is a holistic process that typically consists of an offline SFT stage followed by an online reinforcement learning (RL) stage. However, SFT is often optimized in isolation to maximize SFT performance alone. We show that, after identical RL training, models initialized from stronger SFT checkpoints can significantly underperform those initialized from weaker ones. We attribute this to a mismatch typical in current SFT-RL pipelines: the distribution that generates the offline SFT data can differ substantially from the policy optimized during online RL, which learns from its own rollouts. We propose PEAR (Policy Evaluation-inspired Algorithm for Offline Learning Loss Re-weighting), an SFT-stage method that corrects this mismatch and better prepares the model for RL. PEAR uses importance sampling to reweight the SFT loss, with three variants operating at the token, block, and sequence levels. It can be used to augment standard SFT objectives and incurs little additional training overhead once probabilities for the offline data are collected. We conduct controlled experiments on verifiable reasoning games and mathematical reasoning tasks on Qwen 2.5 and 3 and DeepSeek-distilled models. PEAR consistently improves post-RL performance over canonical SFT, with pass at 8 gains up to a 14.6 percent on AIME2025. Our results suggest that PEAR is an effective step toward more holistic LLM post-training by designing and evaluating SFT with downstream RL in mind rather than in isolation.
Community
A good objective for supervised post-training is commonly taken as one that optimizes for performance after supervised stage. But when this supervised stage is followed by an online RL stage, SFT stage gains may not be preserved after online RL. This paper experiments with a variety of supervised objectives, and finds that the out-of-the-box performance of these objectives often change after subsequent RL.
This highlights a mismatch between these two goals. This paper proposes a reweighing mechanism for standard supervised losses designed to weigh each token using the effect of learning on that token on RL stage. This paper presents an approach based inspired by off-policy evaluation to compute weights based on the likelihood of continuation from each starting point. The paper includes multiple practical variants based on that principle and demonstrates the effectiveness.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Trust-Region Adaptive Policy Optimization (2025)
- GIFT: Unlocking Global Optimality in Post-Training via Finite-Temperature Gibbs Initialization (2026)
- Reuse your FLOPs: Scaling RL on Hard Problems by Conditioning on Very Off-Policy Prefixes (2026)
- Orchestrating Tokens and Sequences: Dynamic Hybrid Policy Optimization for RLVR (2026)
- Save the Good Prefix: Precise Error Penalization via Process-Supervised RL to Enhance LLM Reasoning (2026)
- Sparse-RL: Breaking the Memory Wall in LLM Reinforcement Learning via Stable Sparse Rollouts (2026)
- Training-Trajectory-Aware Token Selection (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper