Abstract
Surgical Post-Training (SPoT) enhances LLM reasoning capabilities by using data rectification and binary cross-entropy objectives to prevent catastrophic forgetting while maintaining efficiency.
Enhancing the reasoning capabilities of Large Language Models (LLMs) via post-training is often constrained by the trade-off between efficiency and catastrophic forgetting. While prior research emphasizes the role of on-policy data in mitigating forgetting, we uncover--and validate both theoretically and empirically--an overlooked yet critical mechanism: the implicit regularization inherent in Direct Preference Optimization's (DPO) reward estimate. This motivates our Surgical Post-Training (SPoT), a new paradigm designed to optimize reasoning efficiently while preserving learned prior knowledge. SPoT consists of: (1) a data rectification pipeline that employs an Oracle to surgically correct erroneous steps via minimal edits, generating data proximal to the model's distribution; and (2) a reward-based binary cross-entropy objective. Unlike the relative ranking in DPO, this objective treats reasoning correctness as a binary classification problem, enforcing decoupled supervision signals. Empirically, with only 4k rectified math data pairs, SPoT improves Qwen3-8B's accuracy by 6.2% on average across in-domain and OOD tasks, requiring merely 28 minutes of training on 8x H800 GPUs. Code: https://github.com/Visual-AI/SPoT
Community
Injecting new knowledge into LLMs via SFT often triggers catastrophic forgetting due to a "pull-up" effect, where boosting a target response unintentionally raises the probability of incorrect ones. While RL methods like GRPO are more robust, they are resource-heavy and struggle to synthesize knowledge not already latent in the model.
SPoT bridges this gap by introducing a "surgical" approach to post-training:
- The "Pull-up" & DPO Failure: We identify why SFT causes amnesia and why DPO’s relative ranking is insufficient for rigid knowledge injection.
- Reward-Based Regularization: SPoT uses a binary reward objective (pointwise instead of pairwise) to "tether" the model to the correct distribution.
- Minimal-Edit Rectification: By using precise, minimal data edits, SPoT injects new facts with high efficiency while preserving the model’s pre-existing reasoning and general capabilities.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Entropy-Adaptive Fine-Tuning: Resolving Confident Conflicts to Mitigate Forgetting (2026)
- TMS: Trajectory-Mixed Supervision for Reward-Free, On-Policy SFT (2026)
- RIFT: Repurposing Negative Samples via Reward-Informed Fine-Tuning (2026)
- GIFT: Unlocking Global Optimality in Post-Training via Finite-Temperature Gibbs Initialization (2026)
- Learning While Staying Curious: Entropy-Preserving Supervised Fine-Tuning via Adaptive Self-Distillation for Large Reasoning Models (2026)
- Recycling Failures: Salvaging Exploration in RLVR via Fine-Grained Off-Policy Guidance (2026)
- KEPO: Knowledge-Enhanced Preference Optimization for Reinforcement Learning with Reasoning (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper