Abstract
Surgical Post-Training (SPoT) enhances LLM reasoning capabilities by using data rectification and binary cross-entropy objectives to prevent catastrophic forgetting while maintaining efficiency.
Enhancing the reasoning capabilities of Large Language Models (LLMs) via post-training is often constrained by the trade-off between efficiency and catastrophic forgetting. While prior research emphasizes the role of on-policy data in mitigating forgetting, we uncover--and validate both theoretically and empirically--an overlooked yet critical mechanism: the implicit regularization inherent in Direct Preference Optimization's (DPO) reward estimate. This motivates our Surgical Post-Training (SPoT), a new paradigm designed to optimize reasoning efficiently while preserving learned prior knowledge. SPoT consists of: (1) a data rectification pipeline that employs an Oracle to surgically correct erroneous steps via minimal edits, generating data proximal to the model's distribution; and (2) a reward-based binary cross-entropy objective. Unlike the relative ranking in DPO, this objective treats reasoning correctness as a binary classification problem, enforcing decoupled supervision signals. Empirically, with only 4k rectified math data pairs, SPoT improves Qwen3-8B's accuracy by 6.2% on average across in-domain and OOD tasks, requiring merely 28 minutes of training on 8x H800 GPUs. Code: https://github.com/Visual-AI/SPoT
Community
Injecting new knowledge into LLMs via SFT often triggers catastrophic forgetting due to a "pull-up" effect, where boosting a target response unintentionally raises the probability of incorrect ones. While RL methods like GRPO are more robust, they are resource-heavy and struggle to synthesize knowledge not already latent in the model.
SPoT bridges this gap by introducing a "surgical" approach to post-training:
- The "Pull-up" & DPO Failure: We identify why SFT causes amnesia and why DPO’s relative ranking is insufficient for rigid knowledge injection.
- Reward-Based Regularization: SPoT uses a binary reward objective (pointwise instead of pairwise) to "tether" the model to the correct distribution.
- Minimal-Edit Rectification: By using precise, minimal data edits, SPoT injects new facts with high efficiency while preserving the model’s pre-existing reasoning and general capabilities.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper