system: OPERATIONAL
← back to all hacks
MULTIMODAL MEDIUM

CrossMPI: image-only prompt injection steers what VLMs read and see

A May 15, 2026 Xidian University arXiv paper introduces CrossMPI: imperceptible image perturbations that change how vision-language models interpret both the image and the user's text prompt, with 66% average success across five LVLMs.

2026-05-28 // 6 min affects: minigpt-4, blip-2, instructblip, bliva, qwen2.5-vl

What is this?

On May 15, 2026, Hao Yang, Zhuo Ma, Yang Liu, Yilong Yang, Guancheng Wang and JianFeng Ma from Xidian University posted A Cross-Modal Prompt Injection Attack against Large Vision-Language Models with Image-Only Perturbation on arXiv (2605.16090, cs.CR/cs.CV). The paper introduces CrossMPI, a technique that uses nearly imperceptible image perturbations to control how a vision-language model interprets both the image and the accompanying text instruction — without touching the user’s prompt at all.

The framing matters. Earlier multimodal prompt injection attacks either embedded visible text in images, or biased the model’s reading of the image only. CrossMPI is cross-modal: a pixel-level perturbation reshapes the model’s joint interpretation of image and text. In one example from the paper, an attacker-modified photo of an airplane causes the model to answer the user’s question “Does this airplane belong to Air Canada?” with “a mobile phone.” The image still looks like an airplane to a human; the model is steered into a different task entirely.

CSO Online’s May 18, 2026 coverage describes the paper’s enterprise relevance: copilots, document-processing agents and vision-enabled workflows increasingly fuse images and text, and the textual sanitisation defences shipped today do not cover this attack surface.

How it works

A large vision-language model (LVLM) encodes an image into a sequence of visual tokens via a vision encoder, mixes those tokens with the user’s text tokens, and runs the joint sequence through a transformer stack. Most prior adversarial-image work optimises perturbations against the visual embedding space — the output of the vision encoder, typically around 10^5 parameters. CrossMPI argues this is the wrong target.

The authors instead optimise against the model’s hidden state space — the internal representations after visual and textual information have been fused, on the order of 10^7 parameters. The larger parameter space is harder to optimise in, so the paper introduces two constraints.

Fusion-critical layer selection. Not every transformer layer matters equally for cross-modal integration. The paper measures which layers carry the most multimodal information and restricts optimisation to those. Contrary to standard adversarial-attack intuition, the most effective layers are not the final output layers — they sit in the middle of the model, where visual evidence and textual intent first merge.

Distance-decremental perturbation budget assignment. The image is not perturbed uniformly. The paper uses Grad-ECLIP saliency to identify semantically critical regions of the image, then allocates more perturbation budget close to those regions and progressively less as pixel distance grows. The visible result is a perturbation concentrated where the model is “looking” — but bounded so the image stays visually faithful to a human reader.

Component                       Purpose                                Effect on LVLM
------------------------------  -------------------------------------  -----------------------------------
Hidden-state-space optimisation Target fused multimodal representation Cross-modal control (image+text)
                                rather than vision encoder output
Fusion-critical layer selection Restrict gradient flow to middle      Avoids wasted optimisation in
                                layers that fuse modalities            non-fusion layers
Distance-decremental budget     Concentrate noise near salient pixels  Imperceptible to human readers;
                                via Grad-ECLIP saliency map            preserves visual semantics
Cross-modal perturbation        Joint output / fusion / frequency-     Black-box transferability across
optimisation                    domain objective                       LVLM architectures

The paper benchmarks against five open-source LVLMs — MiniGPT-4, BLIP-2, InstructBLIP, BLIVA and Qwen2.5-VL — and reports an average attack success rate of 66.36%, about 41 percentage points above prior baselines. The perturbations transfer in black-box settings, meaning an attacker who does not have weights for the target system can craft them against a substitute model.

No payload is reproduced here. The arXiv preprint and its HTML rendering are the canonical references for researchers who want to reproduce the result.

Why it matters

CrossMPI is a research demonstration on open-source LVLMs, not an exploit observed against a production system. Two properties still make it worth attention.

First, the attack surface is invisible to text-only defences. Most enterprise LLM guardrails today operate on the textual prompt — input filters, instruction-hierarchy checks, output validators. None of them inspect pixels. If your pipeline accepts an image from any untrusted source — a user upload, a webpage screenshot, a document, a screen capture taken by an agent — that image can carry an instruction your text-side filters will never see.

Second, the result transfers. Black-box transferability is the property that separates a curious lab finding from a deployable attack class. CrossMPI does not require knowing the target model’s exact weights; perturbations crafted against one open model retain useful success on others. The authors explicitly note that the technique could “mislead VLM-based web agents” and “disrupt real-world object detectors.”

The structural lesson is the same one AudioHijack pushed for the audio modality: every new modality a model accepts is a new channel for prompt injection, and text-only mitigations will not cover any of them.

Defenses

No defence retires this class of attack as of late May 2026. The paper itself evaluates several and reports their limits. The shortest defensible list, drawn from the paper and from standard adversarial-vision practice:

  1. Input transformations as a cheap first line. Random resizing, rotation, and especially JPEG re-encoding disrupt high-frequency adversarial structure. The paper measures all three and finds them helpful but not sufficient — useful only as one layer among several.
  2. Certified or smoothing-based defences. SmoothVLM is the most effective mitigation the authors tested, dropping attack success rate below 5% in several scenarios. Randomised smoothing comes at a latency and accuracy cost; teams running VLMs on high-throughput pipelines should evaluate that trade-off explicitly.
  3. Adversarial training on multimodal perturbations. Training the vision-language stack with samples of this attack class is the standard durable defence direction. CrossMPI provides a reproducible recipe to generate training data for that work.
  4. Treat images from untrusted origins as untrusted instructions. An image uploaded by an end user, scraped from the web, or captured from a screen is content, not a system prompt. Agents should not let the model derive tool-call authority from an image without an independent textual confirmation step.
  5. Restrict the action surface for vision-enabled agents. A VLM-driven agent that cannot send mail, cannot browse to arbitrary URLs and cannot move money on its own cannot be made to do those things from a hijacked image. Apply the Agents Rule of Two: at most two of “untrusted input / sensitive tool / exfiltration channel” at once.
  6. Log the image alongside the action. When a VLM agent takes a sensitive action, retain the input image so post-hoc forensics can identify a CrossMPI-style overlay. Adversarial perturbations are detectable after the fact even when they evade real-time defences.
  7. Watch for the cross-modal pattern, not just images. The same property — a continuous, high-dimensional, non-textual input that gets fused with text inside the model — applies to audio, video and sensor inputs. Defences should be designed modality-agnostic.

Status

ItemReferenceDateNotes
PaperarXiv:2605.16090 v12026-05-15cs.CR / cs.CV
AuthorsXidian University teamHao Yang, Zhuo Ma, Yang Liu, Yilong Yang, Guancheng Wang, JianFeng Ma
Press coverageCSO Online2026-05-18Enterprise context, Gartner commentary
Affected open LVLMs5 testedMiniGPT-4, BLIP-2, InstructBLIP, BLIVA, Qwen2.5-VL
Reported ASR66.36% average+41 pp over prior baselines; black-box transferable
Defences evaluatedResize, rotate, JPEG, SmoothVLM, DPSSmoothVLM most effective (<5% ASR in some scenarios); none fully eliminate
Real-world exploitationNot reportedControlled research setting, open-source models

The text-only era of prompt injection defence is ending. CrossMPI is not the first multimodal injection paper, but it tightens an uncomfortable result: an attacker with no access to your text prompt and no obvious change to the user’s view of an image can still rewrite what your model thinks the user just asked. For teams shipping vision-language features, the question is no longer whether to defend the image channel — it is how many layers of defence are enough.

Sources