Papers
arxiv:2603.01683

Surgical Post-Training: Cutting Errors, Keeping Knowledge

Published on Mar 2
· Submitted by
Linius Lin
on Mar 4
Authors:
,

Abstract

Surgical Post-Training (SPoT) enhances LLM reasoning capabilities by using data rectification and binary cross-entropy objectives to prevent catastrophic forgetting while maintaining efficiency.

AI-generated summary

Enhancing the reasoning capabilities of Large Language Models (LLMs) via post-training is often constrained by the trade-off between efficiency and catastrophic forgetting. While prior research emphasizes the role of on-policy data in mitigating forgetting, we uncover--and validate both theoretically and empirically--an overlooked yet critical mechanism: the implicit regularization inherent in Direct Preference Optimization's (DPO) reward estimate. This motivates our Surgical Post-Training (SPoT), a new paradigm designed to optimize reasoning efficiently while preserving learned prior knowledge. SPoT consists of: (1) a data rectification pipeline that employs an Oracle to surgically correct erroneous steps via minimal edits, generating data proximal to the model's distribution; and (2) a reward-based binary cross-entropy objective. Unlike the relative ranking in DPO, this objective treats reasoning correctness as a binary classification problem, enforcing decoupled supervision signals. Empirically, with only 4k rectified math data pairs, SPoT improves Qwen3-8B's accuracy by 6.2% on average across in-domain and OOD tasks, requiring merely 28 minutes of training on 8x H800 GPUs. Code: https://github.com/Visual-AI/SPoT

Community

Paper submitter

Injecting new knowledge into LLMs via SFT often triggers catastrophic forgetting due to a "pull-up" effect, where boosting a target response unintentionally raises the probability of incorrect ones. While RL methods like GRPO are more robust, they are resource-heavy and struggle to synthesize knowledge not already latent in the model.

SPoT bridges this gap by introducing a "surgical" approach to post-training:

  • The "Pull-up" & DPO Failure: We identify why SFT causes amnesia and why DPO’s relative ranking is insufficient for rigid knowledge injection.
  • Reward-Based Regularization: SPoT uses a binary reward objective (pointwise instead of pairwise) to "tether" the model to the correct distribution.
  • Minimal-Edit Rectification: By using precise, minimal data edits, SPoT injects new facts with high efficiency while preserving the model’s pre-existing reasoning and general capabilities.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.01683 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.01683 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.01683 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.