AGENTS MEDIUM

Trust No Tool: cognitive poisoning of LLM agents through tool feedback

A May 17, 2026 arXiv paper introduces 'cognitive poisoning' — a malicious tool that wins the agent's trust over many benign-looking turns and only weaponises the final action. The defence target shifts from prompts to trajectory.

2026-05-26 // 7 min affects: llm-agents, tool-using-agents, agentic-workflows, mcp

What is this?

On May 17, 2026, Lecheng Yan and co-authors (Southern University of Science and Technology, Alibaba DAMO Academy, University of Aberdeen) posted Trust No Tool: Evaluating and Defending LLM Agents under Untrusted Tool Feedback (arXiv:2605.17453) under the cs.CR / cs.CL sections. The paper formalises a new agent-security failure mode the authors call cognitive poisoning and ships three artifacts to study it: TRUST-Bench (1,970 hidden-trigger tool-compromise episodes with matched safe controls), an asymmetric evaluation metric named GuardedJoint, and a defence framework called VISTA-Guard.

The contribution is conceptual as much as technical. Most published agent-security benchmarks assume that once a tool has been selected, its outputs are trustworthy. Yan et al. show that this assumption survives prompt-injection literature, OWASP LLM Top 10, and MCP guidance largely intact — and that it is exactly where black-box tool ecosystems break.

How it works

The paper splits an agent run into two phases: an exploratory phase (several tool calls, the agent probes the environment) and a final-action phase (one executable side-effecting call: write file, send money, modify resource).

A cognitive-poisoning tool is one that:

Behaves plausibly during exploration. Every probe returns answers that look helpful and consistent with the task.
Accumulates trust through that benign-looking feedback. No single message is obviously malicious; standard prompt-injection detectors and zero-shot LLM judges flag nothing.
Triggers only when a hidden state condition is met — typically a parameter combination of the final action — and at that point steers the agent into a dangerous bundle (e.g. an rm -rf on a different path than the user requested, a transfer to a substituted account number, an API call against the wrong tenant).

The authors describe the decisive risk as a property of state-action composition:

“what the tool appeared to do over time, what the trajectory revealed about its behavior, and what exact final tool-and-parameter bundle the agent is about to execute.”

This is structurally different from the existing literature. Indirect prompt injection (Greshake et al., 2023) is about a hidden instruction in retrieved content. Tool metadata poisoning is about a malicious tool description. MCP back-end flaws (Akamai, May 12, 2026) are about classical web bugs in the connector. Cognitive poisoning is about the interaction trajectory itself.

To study the regime empirically, the team builds TRUST-Bench by minimally transforming tasks from existing tool-use suites:

# Each TRUST-Bench episode = (task, exploratory trajectory, final action)
# Each malicious variant has a matched safe control that
# keeps the outward tool interface, action semantics, and
# safety constraints identical.
#
# 1,970 hidden-trigger compromise episodes
# 1,970 paired safe controls
# 4 final-action families (file ops, code exec, finance, identity)
# Hidden trigger = state condition over the trajectory

VISTA-Guard, the proposed defence, takes the full trajectory and abstracts it into three structured representations: a trajectory_digest, a world_model_state, and action_param_features. The final-action risk is then scored from that triplet rather than from the prompt text alone. The framework is described as backbone-agnostic — i.e. it does not require a specific underlying LLM. Under GuardedJoint (an asymmetric penalty that punishes safety failures harder than utility failures), VISTA-Guard reports 84.2 in-domain and 56.9 on balanced out-of-distribution evaluation. Prompt-centric heuristics, scalarised features, and zero-shot judges collapse to near zero under the same metric.

Why it matters

Three takeaways generalise beyond the specific defence.

First, agent security cannot be reduced to prompt filtering. The single-shot view (“did the input contain a malicious instruction?”) misses the entire class of attacks that build over multiple turns. Today’s production guardrails — Lakera Guard, Microsoft Prompt Shields, NeMo Guardrails, LLM-Guard — are mostly prompt- or output-centric; the paper’s experiments suggest they will not see cognitive-poisoning trajectories coming.

Second, the tool ecosystem is the new attack surface. MCP, OpenAI tool-calling, Anthropic tools, Claude Skills, custom agent frameworks — all of them broker calls to third-party tools whose behaviour the host system does not control. Akamai’s May 12, 2026 disclosure of CVE-2025-66335 and the broader MCP back-end pattern showed how classical web vulnerabilities arrive at this layer. Trust No Tool shows how attacker-controlled feedback arrives at the same layer, without a CVE-class bug.

Third, the defence target moves from text to state. If the paper’s framing holds, future agent-security work will need a notion of trajectory state and a notion of final-action risk distinct from input moderation. That is closer to the trust models used in operating-system security (capabilities, taint tracking) than to the moderation models of chat safety.

Defenses

The authors’ framework is not a drop-in product, but the design choices map to concrete controls a team can apply today.

Treat tool feedback as untrusted input. Every string returned by a tool — including a tool the model has used many times before — should be sanitised, schema-validated, and stripped of instructions before re-entering the model context. The “lethal trifecta” framing from Simon Willison applies: untrusted content + sensitive data + side-effecting tools is the dangerous combination.
Score the final action, not just the prompt. Before any side-effecting call (file write, money transfer, send email, deploy, delete), evaluate the call against the trajectory that produced it. Anomalous parameter values, unexpected tool combinations, or destinations that diverge from the user’s stated intent are higher-signal than a single moderation pass.
Apply least privilege at the tool layer. OWASP’s Practical Guide for Secure MCP Server Development (2026) and the OWASP LLM Top 10 LLM06 / LLM07 entries converge on this: each tool’s back-end credential should only carry the rights the tool itself requires. Cognitive poisoning that hijacks a final action is bounded by what that action’s account is allowed to do.
Require human confirmation on high-impact final actions. For irreversible or high-cost calls, mandate a structured confirmation step that surfaces the executable parameters to the user, not just a natural-language summary the model generated. The paper’s threat model exactly targets the gap between summary and parameters.
Log full trajectories, not just final outputs. Cognitive poisoning is invisible without the full sequence of tool calls and responses. Production agent platforms need replayable trajectories with input/output pairs, parameters, and timestamps to detect this class post-hoc.
Diversify and rotate tool providers for high-trust actions. Where feasible, route the final, side-effecting step through a tool independently developed and audited from the ones used during exploration. The cognitive-poisoning model assumes the same tool is trusted across phases.

Status

Item	Reference	Date	Notes
Paper submitted	arXiv:2605.17453 v1	2026-05-17	cs.CR / cs.CL, CC BY 4.0
Threat model named	Trust No Tool	2026-05-17	”Cognitive poisoning”
TRUST-Bench released	Paper	2026-05-17	1,970 hidden-trigger episodes + matched safe controls
GuardedJoint metric	Paper	2026-05-17	Asymmetric safety-utility penalty
VISTA-Guard framework	Paper	2026-05-17	84.2 in-domain, 56.9 balanced OOD
Related: MCP back-end pattern	Akamai	2026-05-12	Same attack surface, classical bugs
Related: MindGuard	arXiv:2508.20412	2025	Metadata-poisoning detection (different threat model)

The paper’s framing is the immediately useful piece. Whether VISTA-Guard becomes a practical defence depends on follow-up work that the authors invite — replications under richer trajectory shapes, evaluation on closed-source agents, and integration with existing guardrail stacks. The narrower claim — that the agent-security frontier is moving from prompt text to interaction trajectory — is the one to internalise now.