RESEARCH MEDIUM NEW

Beyond shallow safety: mid-sequence injection still flips aligned LLMs

A June 3, 2026 arXiv paper shows safety alignment can be redirected not just at the first tokens but at any generation step — and a model's hidden-state refusal directions don't predict its robustness.

2026-06-08 // 6 min affects: safety-aligned-llms, open-weight-llms

What is this?

On June 3, 2026, Kyungmin Park and Taesup Kim posted Inference-Time Vulnerability Beyond Shallow Safety: Alignment Along Generation Trajectories (arXiv:2606.04778, cs.AI/cs.CL/cs.LG). The paper takes a result that has shaped LLM-safety thinking since 2024 — shallow safety — and shows it is narrower than the real problem.

“Shallow safety,” named by Qi et al. in Safety Alignment Should Be Made More Than a Few Tokens Deep (arXiv:2406.05946), is the observation that an aligned model’s refusal behaviour is concentrated in the first few output tokens. Get the model past that opening — for example with an assistant prefill such as “Sure, here is how to…” (see our note on sockpuppeting) — and it tends to continue compliantly.

The new paper’s claim is that the first-token weakness is just a special case. Short token injections at any point in the generation, not only the start, can substantially change the model’s subsequent safety behaviour.

How it works

The threat model is generation-stream control, not a clever prompt. It applies wherever an attacker — or a downstream component — can insert tokens into the model’s own output as it is being produced: open-weight and self-hosted deployments, APIs that accept assistant prefills, and pipelines that splice intermediate text back into the context.

# Conceptual sketch — no working payload.
# Tokens marked [INJECT] are short attacker-controlled spans
# slipped into the assistant's own output stream.

t0  user:      <benign-looking request>
t1  assistant: I can't help with that, but [INJECT]
t2  assistant: <continues along the injected direction...>

Two findings make this more than a restatement of the prefill result:

Position is not special. A short injection mid-sequence — well after the “safe” opening tokens — can still redirect the rest of the trajectory. Defences that only harden the first tokens leave the remainder of the generation exposed.
Hidden states don’t tell you you’re safe. The authors report that a model’s alignment with refusal directions in its hidden states does not predict its robustness to such injection. A representation can look “aligned” while the generated text, under perturbation, goes the other way. That is a caution for representation-based defences that read internal activations to decide whether a response is safe — a line of work also explored in Jailbreaking Leaves a Trace (arXiv:2602.11495).

The proposed fix is a training-time one: align the model on generation trajectories built by simulating mid-sequence perturbations, rather than on outputs alone. Training on the perturbed process improves robustness to mid-sequence injection and, the authors report, generalises back to the early-token attacks that shallow-safety work first identified.

Why it matters

Much production safety tooling assumes the dangerous moment is the prompt (input filtering) or the first token (prefill checks, opening-token alignment). This paper argues the vulnerable surface is the entire trajectory. For anyone running open-weight or self-hosted models — where the generation stream is fully controllable — that widens the attack surface considerably, and it weakens the case for trusting a single hidden-state probe as a safety signal.

It also reframes a defensive debate: robust alignment may need to be trained against the process of generation, not just graded on its final answer.

Defenses

Don’t trust position. Validate and constrain the assistant message sequence at the API boundary; reject client-supplied assistant prefills and any path that lets untrusted text re-enter as model output. This is the prefill-jailbreak lesson, generalised to the whole stream.
Treat hidden-state safety probes as one signal, not proof. Per this paper, refusal-direction alignment in activations does not guarantee a safe generation under perturbation. Pair any representation-level detector with output-side checks.
Add output- and trajectory-level guards. Re-screen the completed and streaming output, not only the prompt and the first tokens.
For model trainers: consider trajectory-level alignment — exposing the model to simulated mid-sequence perturbations during safety training — as described in the paper.
Keep the threat model honest. Mid-sequence injection assumes control of the generation stream (open-weight, self-hosted, or prefill-capable APIs). Hosted chat endpoints that forbid assistant prefills raise the bar but do not, on their own, address splice-back pipelines.

Status

Item	Detail
Paper	Inference-Time Vulnerability Beyond Shallow Safety (arXiv:2606.04778)
Posted	June 3, 2026
Type	Research finding + training-time defence (no exploit released)
Builds on	Qi et al., …More Than a Few Tokens Deep (arXiv:2406.05946)
Affected	Safety-aligned LLMs; greatest exposure in open-weight / self-hosted settings