If you’ve ever fine-tuned a model on a new dataset and then tested it on the old task — you know the pain. The model that was great at medical QA yesterday now gives nonsense answers after you fine-tuned it on legal data today. This is catastrophic forgetting, and it’s been a known problem since McCloskey & Cohen (1989).
What’s surprising is how bad it still is with modern LLMs. Recent work (arXiv:2308.08747, IEEE 2025) shows that forgetting actually intensifies as model scale increases from 1B to 7B parameters. The bigger your model, the harder it forgets.
What People Actually Do Today
In practice, most teams don’t even try to solve this. They use workarounds:
· Separate models per domain — One fine-tuned model for medical, one for legal, one for finance. Works, but N domains = N models = N deployments. Infrastructure costs scale linearly and can consume up to 88% of a multi-year AI budget.
· Full retrain from scratch — Merge all data, retrain everything. Takes days or weeks, blocks iteration, and you still need all the original data on hand.
· RAG instead of fine-tuning — Give up on learning entirely and use retrieval. Works for some use cases, but retrieval quality degrades on complex reasoning tasks that benefit from weight-level learning.
The CL Methods That Exist (and Why They Fall Short)
The research community has been working on this for decades. Here’s an honest assessment of where things stand:
· EWC (Elastic Weight Consolidation) — Adds a penalty based on the Fisher information matrix to protect important weights. The problem: computing Fisher over the full prior dataset is expensive, the regularization coefficient is fragile to tune, and it still drifts 10–60% on real multi-domain workloads. (Kirkpatrick et al., 2017)
· Experience Replay — Mix a fraction of old training data into each new training batch. Robust when it works, but requires storing and managing prior data. In regulated industries (healthcare, finance), you may not be allowed to retain old training data. Also doesn’t scale cleanly — the replay buffer grows with every domain.
· Knowledge Distillation — Use the old model as a teacher while training on new data. Computationally expensive (you’re running two models), and recent work shows it struggles at 7B+ scale where the teacher’s logits become unstable.
· PackNet / Progressive Neural Networks — Freeze or grow the architecture per task. Freezing limits capacity; growing adds parameters linearly. Neither is practical for production deployment with 10+ domains.
· Gradient Projection (OGD, A-GEM) — Project gradients to avoid interfering with previous tasks. Elegant in theory, but requires storing gradient subspaces and becomes increasingly restrictive as domains accumulate.
· Self-Distillation Fine-Tuning (SDFT) — Recent work from MIT (2025) leverages the model’s own in-context learning to generate self-play data before fine-tuning. Promising early results, but adds a pre-processing step and hasn’t been validated at scale across many sequential domains.
· Collaborative Neural Learning (CNL) — Freezes conflicting neurons and only updates collaborative ones (Yang et al., 2026). Claims 59–82% forgetting reduction, but still not zero — and identifying conflicting neurons adds overhead.
The Real Question
After decades of research, we still don’t have a clean, production-ready solution for: “train one model on domain A, then B, then C, then D — and have it remember all of them.” No replay buffers, no growing architectures, no fragile hyperparameters.
That’s the problem I’ve been working on for the past 6 months. Tried EWC, replay, KD, gradient projection — 50+ experiments. Most of them failed. But I found something that works, and I’m running multi-domain benchmarks on Mistral-7B right now. Will share results here soon.
Curious to hear from others: what’s your experience with catastrophic forgetting? Have you found anything that actually works in production? What CL methods have you tried?