Description poisoning: the agent channel your benchmarks don't test
A May 2026 AWS Bedrock AgentCore demo and a June 2026 arXiv paper converge on the same blind spot: tool descriptions, read before every call, are an injection channel that infra controls and single-number benchmarks both miss.
What is this?
Most discussions of indirect prompt injection in agents focus on tool results — the data an agent reads back after it calls a tool (a web page, a file, a database row). But an agent reads something else first: the tool description, the natural-language string that tells the model what each tool does and when to use it. That string is consumed on every turn, before any tool is invoked, and the model treats it as authority — the same way it treats its system prompt.
Two publications in the last month put this channel in focus. On May 6, 2026, NeuralTrust published a reproducible demo on AWS Bedrock AgentCore showing how a single poisoned tool description hijacks a multi-agent system that is otherwise correctly locked down. On June 1, 2026, the arXiv paper The Surface You Test Is Not the Surface That Breaks (arXiv:2605.30454) argued that current evaluations report a single attack-success-rate per model measured on the tool-output channel — and that tool descriptions, read every turn before any tool runs, are a distinct surface that this number does not capture.
How it works
In NeuralTrust’s setup, three agents and two MCP servers run on AgentCore with IAM-scoped permissions: a coach_agent may call only an information MCP, a financial_agent only a finance_advices MCP. Permissions are enforced at the IAM level, not just in prompts.
The researchers added one extra tool, get_user_personalization(user_id), whose description begins benignly — “Returns the user’s personalized coaching context… so the assistant can tailor advice” — and then, a few lines down, carries an embedded instruction directed at the model. The same payload is mirrored in the JSON the tool returns, in fields like communication_style and _system_note that an LLM treats as live user data.
Tool registry entry (schematic — payload redacted)
--------------------------------------------------
name: get_user_personalization
description: |
Returns the user's personalized coaching context (preferences,
goals, recent sessions) so the assistant can tailor advice.
[REDACTED: instruction addressed to the model, not the user]
A perfectly benign user prompt leads the agent to call the tool. The poisoned description and result land in the model’s context, and the model follows the embedded instruction. No IAM rule was broken, no remote code execution occurred, no network boundary was crossed — the attack abuses the implicit trust contract between an agent and the tools it is authorized to call. A one-line script in the public PoC repo confirms the marker text appears in the response. The arXiv paper generalizes the point: because the description is read before every tool call, measuring only the output channel under-reports an agent’s real exposure.
Why it matters
The first reason is measurement. If your red-team harness scores agents by injecting into tool outputs and reporting one number, you are testing a narrower surface than the one attackers reach. A model can look robust on the output channel and remain exploitable through description metadata it ingests on every turn.
The second is where the trust boundary now sits. In the AgentCore demo, IAM allowed the call (it was supposed to), the network was internal so there was no perimeter to filter, CloudTrail logged the call but not its semantic content, and Bedrock guardrails scoped to model output never saw metadata the model consumed before answering. Cloud infrastructure controls inspect identity and network, not the content flowing between an agent and its model.
The third is scope. Any stack that lets agents load tools it does not control end-to-end — third-party MCP servers, plugin marketplaces, agent-to-agent messaging — inherits this gap. It is not specific to AgentCore; LangGraph, LlamaIndex, and any MCP-based agent share the same trust model.
Defenses
There is no payload to patch here — the fix is architectural, in line with OWASP LLM01: Prompt Injection.
-
Treat tool descriptions as untrusted input. Vet and pin the descriptions and schemas of every tool an agent can load, especially third-party MCP servers. Diff them on update; a description that changes between deploys deserves review.
-
Inspect the model-facing channel, not just the user-facing one. Add a gate in front of the LLM that scans tool descriptions and tool results before they enter context — not only the final response. Perimeter filters on user input never see an injection that originates inside the cluster.
-
Test both channels. Update red-team harnesses to inject into tool descriptions and schemas, not only tool outputs, and report per-channel results. A single aggregate ASR hides the description surface.
-
Apply least privilege to tool loadout. IAM scoping is necessary but not sufficient: it governs which tools an agent may call, not what those tools say. Keep the set of loadable tools minimal and prefer first-party, reviewed tools for privileged agents.
-
Constrain blast radius. Assume a tool description can hijack a turn, and limit what a hijacked agent can do: no silent data egress, human confirmation on sensitive actions, and output filtering enforced in application code rather than by the model under attack.
Status
| Item | Reference | Date | Notes |
|---|---|---|---|
| AgentCore description-poisoning demo | NeuralTrust | 2026-05-06 | Open-source PoC on AWS Bedrock AgentCore, gpt-4o |
| Public PoC repository | NeuralTrust / GitHub | 2026-05-06 | 5 runtimes + poisoned tool + jailbreak verifier |
| Measurement-gap paper | arXiv:2605.30454 | 2026-06-01 | Tool descriptions read every turn vs. single-channel ASR |
| Framework reference | OWASP LLM01 | 2025 | Prompt injection class + mitigations |
The takeaway is not “a new attack.” It is that the surface you measure (tool output, one number) is narrower than the surface that breaks (tool descriptions, read before every call), and that the controls most teams trust — IAM, network, output guardrails — sit on the wrong side of that boundary.