arxiv:2603.01683

Surgical Post-Training: Cutting Errors, Keeping Knowledge

Published on Mar 2

· Submitted by

Linius Lin on Mar 4

The University of Hong Kong

Upvote

Authors:

Abstract

Surgical Post-Training (SPoT) enhances LLM reasoning capabilities by using data rectification and binary cross-entropy objectives to prevent catastrophic forgetting while maintaining efficiency.

AI-generated summary

Enhancing the reasoning capabilities of Large Language Models (LLMs) via post-training is often constrained by the trade-off between efficiency and catastrophic forgetting. While prior research emphasizes the role of on-policy data in mitigating forgetting, we uncover--and validate both theoretically and empirically--an overlooked yet critical mechanism: the implicit regularization inherent in Direct Preference Optimization's (DPO) reward estimate. This motivates our Surgical Post-Training (SPoT), a new paradigm designed to optimize reasoning efficiently while preserving learned prior knowledge. SPoT consists of: (1) a data rectification pipeline that employs an Oracle to surgically correct erroneous steps via minimal edits, generating data proximal to the model's distribution; and (2) a reward-based binary cross-entropy objective. Unlike the relative ranking in DPO, this objective treats reasoning correctness as a binary classification problem, enforcing decoupled supervision signals. Empirically, with only 4k rectified math data pairs, SPoT improves Qwen3-8B's accuracy by 6.2% on average across in-domain and OOD tasks, requiring merely 28 minutes of training on 8x H800 GPUs. Code: https://github.com/Visual-AI/SPoT

View arXiv page View PDF GitHub 5 Add to collection

Community

linius

Paper submitter 1 day ago

Injecting new knowledge into LLMs via SFT often triggers catastrophic forgetting due to a "pull-up" effect, where boosting a target response unintentionally raises the probability of incorrect ones. While RL methods like GRPO are more robust, they are resource-heavy and struggle to synthesize knowledge not already latent in the model.

SPoT bridges this gap by introducing a "surgical" approach to post-training:

The "Pull-up" & DPO Failure: We identify why SFT causes amnesia and why DPO’s relative ranking is insufficient for rigid knowledge injection.
Reward-Based Regularization: SPoT uses a binary reward objective (pointwise instead of pairwise) to "tether" the model to the correct distribution.
Minimal-Edit Rectification: By using precise, minimal data edits, SPoT injects new facts with high efficiency while preserving the model’s pre-existing reasoning and general capabilities.