When the AI reviewer can't read the figure: cross-modal attacks on peer review
A June 2026 arXiv paper (PaperGuard) shows AI peer reviewers are vulnerable not only through text but through figures — black-box prompt injection and white-box image perturbations both flip verdicts.
What is this?
On June 2026, researchers published Does AI Reviewer See the Full Picture? Attacking and Defending Multimodal Peer Review on arXiv (2606.12716, accepted to ICML 2026). The paper studies a question that earlier work on AI peer review left open: if reviewers are increasingly multimodal large language models (MLLMs) that look at a paper’s figures as well as its text, can an attacker manipulate the verdict through the images, not just the prose?
The answer is yes. The authors introduce PaperGuard, described as the first benchmark built specifically to evaluate and defend AI-assisted peer review against cross-modal attacks. Their headline finding, across state-of-the-art models, is that AI reviewers are “pervasively vulnerable” — and that existing robustness studies miss most of the surface because they are “overwhelmingly text-only.”
This sits in a now-established lineage. At NeurIPS 2025, “Give a Positive Review Only” documented in-paper prompt injection against AI reviewers, and our own coverage of font-mapping injection showed hidden text payloads flipping reviews from reject to accept. The new result extends that threat from the text channel into the figure channel.
How it works
PaperGuard is built on three pillars, per the abstract.
First, a multimodal peer-review dataset: real papers from AI/ML and broader scientific domains are parsed to extract their key figures — method diagrams, results plots — so the benchmark reflects how an MLLM reviewer actually consumes a submission.
Second, a unified attack suite that combines two threat models on two modalities:
- Black-box prompt injection — adversarial instructions planted in the submission (the same class as the text-side “give a positive review” attacks), now also carried inside or alongside figures.
- White-box gradient attacks — optimized perturbations using GCG on the text channel and PGD on the image channel. PGD (projected gradient descent) produces small pixel-level changes to a figure that are visually unremarkable to a human but steer the model’s reading of it.
The cross-modal angle is the point: a figure is not just decoration to an MLLM reviewer, it is evidence the model reasons over. A perturbation a human editor would never notice can change what the model “sees” in a results plot. No payload is reproduced here, and none is needed to understand the lesson — every modality the reviewer ingests is an untrusted input channel.
Third, the authors propose a lightweight defense (see below), motivated by the fact that academic papers are long-context documents where a single hostile instruction is easy to hide.
Why it matters
Peer review is a YMYL-adjacent trust process: funding, careers, and the scientific record depend on it. Venues are already wrestling with AI in the loop — ICML and NeurIPS have issued policies on LLM use in reviewing precisely because the integrity stakes are high.
Two things make the multimodal result worse than the text-only case. First, defenders’ blind spot: detection tooling and venue policies have focused on text payloads, so an image-channel attack walks past controls that were never designed to inspect figures. Second, plausible deniability: a PGD perturbation leaves a figure looking normal, so unlike a clumsy “ignore previous instructions” string, there is little to flag in manual spot-checks.
The broader 2026 picture is consistent. A companion June 2026 paper, Gaming AI-Assisted Peer Reviews Poses New Risks to the Scientific Community, argues that as reviewing leans on AI, the incentive to game it grows. Cross-modal attacks are the technical expression of that incentive.
Defenses
The mandatory takeaways, several drawn from the paper’s own proposal:
- Treat figures as untrusted input. Any pipeline that feeds images to an MLLM reviewer must assume those images can be adversarial, exactly as it assumes the text can be.
- Localize, don’t just classify. PaperGuard’s defense uses chunk-based embedding search to find and neutralize harmful instructions inside a long document rather than scoring the whole paper at once — a more tractable approach for paper-length context.
- Keep a human in the decision. AI-assisted review should inform, not issue, accept/reject decisions; a human reviewer who never relies on the model’s verdict alone is the backstop against both text and image manipulation.
- Sanitize and re-render figures. Re-encoding or down-sampling submitted images before they reach the model can disrupt pixel-precise PGD perturbations, at some cost to fidelity.
- Policy + detection together. Venue rules against undisclosed AI use only bite if paired with detection that actually covers every modality the reviewer consumes.
Status
| Item | Value |
|---|---|
| Paper | arXiv:2606.12716, June 2026 (ICML 2026) |
| Attack channels | Text (prompt injection, GCG) + images (PGD perturbation) |
| Defense proposed | Chunk-based embedding search to localize hostile instructions |
| Prior art | NeurIPS 2025 “Give a Positive Review Only”; font-mapping injection (May 2026) |
| Disposition | Research benchmark; no operational exploit released here |