system: OPERATIONAL
← back to all hacks
RESEARCH MEDIUM NEW

Forgotten but recoverable: why LLM machine unlearning keeps leaking back

Multiple 2025-2026 papers show 'unlearned' knowledge in LLMs is routinely recoverable — via quantization, adversarial prompting, and now reasoning traces. Treating unlearning as erasure is a mistake.

2026-06-08 // 7 min affects: open-weight-llms, llama, reasoning-models

What is this?

Machine unlearning is the family of techniques that try to make a trained language model “forget” a specific slice of what it learned — a person’s data after a deletion request, copyrighted text, or hazardous knowledge such as the bioweapon and cyber content in the WMDP benchmark. It is increasingly invoked as a compliance and safety control: rather than retrain a model from scratch (expensive) every time something must be removed, you run an unlearning procedure that cheaply suppresses the target.

A steady line of research from 2024 through 2026 keeps arriving at the same uncomfortable conclusion: most unlearning does not erase knowledge, it hides it — and the hiding is shallow. The newest entry, Towards Unveiling Vulnerabilities of Large Reasoning Models in Machine Unlearning (arXiv:2604.04255, Iowa State University, posted April 2026), extends the problem to reasoning models. It joins REBEL (arXiv:2602.06248, February 2026), the ICLR 2025 quantization paper, a step-by-step reasoning attack (June 2025), and a systematization of knowledge (June 2025) in showing that “forgotten” is not the same as “gone.”

How it works

The core problem is one of evaluation. Standard unlearning benchmarks query the model with benign, direct questions (“Who is X?”) and declare success when the answer no longer appears. But suppressing a model’s most likely output is not the same as removing the underlying representation. Several independent recovery channels exploit that gap:

Recovery channel        What it exploits                         Reported effect
----------------------  ---------------------------------------  ----------------------------
Quantization            Unlearning nudges weights only slightly; Forget-knowledge retained
                        low-precision rounding undoes the nudge  rises ~21% -> ~83% at 4-bit
Adversarial prompting   Benign-query metrics miss residual       REBEL ASR up to 60% (TOFU),
(evolutionary search)   knowledge reachable by harder prompts    93% (WMDP)
Reasoning probes        Step-by-step elicitation pulls "erased"  62.5% of crafted prompts
                        facts back into the output               recovered target facts
Reasoning-model attack  Long rationales are a weak optimization  Misleading-but-convincing
                        surface during unlearning itself         traces; wrong final answers

The quantization result is the most vivid. Because utility-preserving unlearning only perturbs weights gently, simply converting the unlearned model to 4-bit — a routine deployment step — restores an average of roughly 83% of the “forgotten” knowledge, versus ~21% retained at full precision. REBEL attacks from the prompt side: an evolutionary loop evolves adversarial queries that pull residual knowledge back out, reaching attack success rates up to 60% on TOFU and 93% on WMDP, while ordinary benign queries would have scored the same models as “successfully unlearned.” No exploit payload is needed to understand the lesson, and none is reproduced here.

Why it matters

The risk surface is two-sided. On the privacy side, organizations that run unlearning to satisfy a deletion or right-to-erasure request may be telling regulators and users that data is gone when it is recoverable by anyone who quantizes the model or prompts it cleverly. On the safety side, the WMDP numbers are the alarming ones: hazardous knowledge that a safety team believed it had stripped out can resurface at high rates, especially after the quantization that almost every open-weight deployment performs.

The deeper point is methodological. A defense that is only ever measured against the easiest possible test will look far stronger than it is. The 2026 reasoning-model work sharpens this: as models are trained to “think” in long chains, those chains create new extraction surface — the very reasoning that improves capability also gives an attacker more places to coax suppressed content back out. Unlearning evaluated with benign single-turn questions is, in effect, security theater.

Defenses

  1. Do not treat unlearning as erasure. For genuine deletion or compliance, the only robust guarantee remains not training on the data, or retraining without it. Unlearning is a mitigation, not a delete button.
  2. Evaluate adversarially, not benignly. Test unlearned models with paraphrase, multi-turn, and reasoning-style probes — and with evolutionary attackers like REBEL — not just direct questions. Report the attack success rate of recovery, not only benign forget-loss.
  3. Include quantization in the threat model. Measure forgotten-knowledge recovery at the precisions you actually ship (4-bit, 8-bit), since 4-bit can undo unlearning while 8-bit often does not.
  4. Prefer robustness-aware unlearning. Methods that flatten the loss landscape around the unlearned point (sharpness-aware minimization and successors) are reported to resist relearning and recovery better than point-minimization methods.
  5. Layer with access control. Where hazardous or private content must not leak, combine unlearning with output filtering, retrieval restrictions, and least-privilege access rather than relying on the model having truly forgotten.

Status

WorkReferenceDateReported finding
Quantization recoveryarXiv:2410.16454 (ICLR 2025)2024-104-bit quantization restores ~83% of forgotten knowledge
Reasoning-elicitation attackarXiv:2506.172792025-0662.5% of crafted prompts recover target facts
SoK: unlearning for LLMsarXiv:2506.092272025-06Systematizes recovery as a structural weakness
REBELarXiv:2602.062482026-02Evolutionary recovery up to 60% (TOFU) / 93% (WMDP)
LRM unlearning vulnerabilityarXiv:2604.042552026-04Reasoning traces are a new unlearning attack surface

The durable, transferable point is not a single flaw in a single method: it is that the field’s measurement has consistently overstated forgetting. Across quantization, adversarial prompting, and reasoning probes — and now reasoning models specifically — knowledge that benign benchmarks call “unlearned” keeps coming back. Until evaluation routinely includes these recovery channels, an unlearning claim should be read as “harder to retrieve,” not “removed.”

Sources