Sequential data poisoning: splitting a backdoor across post-training stages
A June 3, 2026 paper shows that poison spread across SFT and preference data — negligible at each stage alone — combines into a working backdoor. Per-stage audits create a 'single-attacker illusion'.
What is this?
On June 3, 2026, researchers from the University of Waterloo, the University of Ottawa, the University of Chicago and the Vector Institute posted Sequential Data Poisoning in LLM Post-Training (arXiv:2606.04929). The paper studies a threat model that most poisoning research has ignored: modern alignment is not one training job but a pipeline of stages — supervised fine-tuning (SFT) followed by reinforcement learning from human feedback (RLHF) via PPO, or direct preference optimization (DPO). Each stage draws data from a different, potentially untrusted source.
The central result is what the authors call the “single-attacker illusion”: a poison contribution that looks harmless when each stage is audited in isolation can combine, across stages, into a reliable backdoor. Defenders who screen the SFT set and the preference set separately — the normal practice — can each conclude “this looks clean” and still ship a compromised model.
How it works
The setup distributes a backdoor across the post-training pipeline rather than concentrating it in one dataset. Two (or more) adversaries each contribute poison to a different stage. The paper reports two regimes:
Pipeline Stage 1 (SFT) Stage 2 (RLHF/DPO) Single-stage effect Combined effect
------------ ------------------- ---------------------- -------------------- ----------------
SFT -> DPO poison SFT data poison preference data each raises ASR a bit additive; split
on its own budget beats
concentrating
SFT -> PPO poison SFT data poison reward-model near-zero ASR alone backdoor surfaces
(RM) data for either stage only in combination
In the SFT → DPO case the contributions are roughly additive: each stage’s poison raises the attack success rate (ASR) somewhat, and the paper finds that splitting a fixed poison budget across both stages outperforms spending it all in either stage alone. The SFT → PPO case is sharper and more concerning: neither the SFT poison nor the reward-model poison produces a meaningful ASR by itself, yet their combination surfaces the backdoor. The malicious behavior is essentially invisible at the level of any one dataset and only emerges from the interaction between stages.
No reproducible trigger strings or poisoning recipes are reproduced here — the canonical reference is the paper itself. The takeaway is structural: the security boundary you care about is the whole post-training pipeline, not any single dataset within it.
Why it matters
The result reframes a defensive assumption. Prior poisoning work — including Anthropic’s 2025 finding that a small, near-constant number of samples can poison models of any size and the follow-up near-constant poison-count analysis — already showed that the absolute poison budget needed is alarmingly low. Sequential poisoning adds a second axis: the budget can be fragmented across procurement boundaries so no single audit ever sees enough to raise an alarm.
This maps cleanly onto how alignment data is actually sourced in 2026. SFT instruction data, human preference labels, and reward-model training data frequently come from different vendors, crowdsourcing platforms, scraped corpora, or synthetic-generation pipelines — different teams, different trust assumptions, different review at each step. A supplier who can influence only the preference set, and another who can influence only the SFT mix, individually pass review. The composition is where the risk lives, and composition is exactly what stage-by-stage data governance does not test.
Defenses
There is no patch here — this is a class of risk in how alignment pipelines are assembled. The mitigations are about provenance and end-to-end evaluation.
-
Evaluate the pipeline end-to-end, not stage-by-stage. The core lesson of the paper is that per-stage dataset audits miss interaction effects. Run backdoor and trigger evaluations on the final post-trained model against a held-out, independently constructed probe set — and treat a clean SFT audit and a clean preference audit as necessary but not sufficient.
-
Track data provenance across every stage. Maintain a bill of materials for SFT data, preference data, and reward-model data: source, vendor, collection method, and review status. Sequential poisoning exploits the fact that these are usually governed independently. Cross-referencing suppliers across stages lets you flag when the same upstream actor touches more than one stage.
-
Diversify and isolate suppliers per stage. If one vendor provides both your SFT corpus and your preference labels, a single compromised supplier holds both halves of the attack. Separating suppliers — and limiting any one source’s share within a stage — raises the bar for cross-stage collusion.
-
Hold out trusted, in-house evaluation data. Keep a poison-free, internally curated benchmark of trigger-style and behavioral probes that never enters any training set. Re-run it after each major post-training change. The PPO result shows backdoors that only appear post-composition, so the gate must be after the last stage.
-
Prefer auditable preference and reward pipelines. RLHF reward-model data and DPO preference pairs are harder to inspect than SFT examples, yet the paper shows they are load-bearing for the attack. Sample, log, and spot-check preference and RM data with the same rigor applied to instruction data.
-
Red-team the composition explicitly. Add “split-budget” poisoning to your internal red-team playbook: assume an adversary can touch one stage only, and test whether two such limited adversaries combine into something your per-stage screening would clear.
Status
| Item | Reference | Date | Notes |
|---|---|---|---|
| Sequential Data Poisoning in LLM Post-Training | arXiv:2606.04929 | 2026-06-03 | Introduces the “single-attacker illusion” threat model |
| SFT → DPO regime | Same paper | 2026-06-03 | Additive; splitting a fixed budget beats concentrating it |
| SFT → PPO regime | Same paper | 2026-06-03 | Neither stage alone is significant; backdoor surfaces only in combination |
| Near-constant poison count | arXiv:2510.07192 | 2025-10 | Context: absolute poison budget needed is low |
| Small-sample poisoning | Anthropic Research | 2025-10 | Context: a small number of samples can poison models of any size |
The framing to take away is not “another data-poisoning paper”. It is that the unit of audit for a poisoned model has to be the whole post-training pipeline, because an adversary can keep each individual contribution below the threshold any single review would catch — and let the stages do the rest.