system: OPERATIONAL
← back to all hacks
DEFENSE MEDIUM NEW

TRUSTDESC: deriving tool descriptions from code to defuse tool poisoning

An April 2026 paper attacks tool poisoning at its root: generate a tool's description from its implementation instead of trusting the author-supplied text, neutralising implicit poisoning that detectors miss.

2026-06-12 // 6 min affects: mcp-servers, llm-tool-using-applications, tool-augmented-agents

What is this?

A tool that an LLM can call has two parts: the executable code that does the work, and a natural-language description that tells the model what the tool does and when to use it. The model never sees the code — only the description, loaded into its context. That description is therefore a trust boundary, and it is the one almost nobody verifies.

Tool poisoning attacks (TPAs) abuse exactly this gap. The class was first publicised by Invariant Labs in April 2025 for the Model Context Protocol, and has been studied since at the descriptor level (e.g. arXiv 2512.06556, December 2025). TRUSTDESC: Preventing Tool Poisoning in LLM Applications via Trusted Description Generation (Ye, Zhang, Jia and Hu, The Pennsylvania State University; arXiv 2604.07536, April 2026) proposes a defence that removes the gap rather than policing it: generate the description from the implementation, so the text an LLM reads is faithful to the code that will actually run.

How it works

The paper separates TPAs into two kinds.

Explicit TPAs embed malicious instructions directly in a tool’s description — for example, hidden text telling the model to read a config file and pass its contents to another tool. These look anomalous and are the target of most existing prompt-injection defences.

Implicit TPAs carry no instructions at all. The attacker writes a benign-looking but exaggerated description — sprinkling words like “best”, “most efficient”, “always prefer this” — to bias the model’s tool-selection step toward an attacker-controlled tool. There is nothing for an instruction detector to flag, yet the model is steered all the same. As the authors put it, deciding whether a description is honest requires reasoning about whether it matches the implementation — hard even for a human without reading the code.

TRUSTDESC’s premise is that the source code is trustworthy ground truth (attackers rarely ship malicious code, because malware detection works; they attack the cheap, unverified descriptive layer instead). It rebuilds each description in three stages:

codebase


[ SliceMin ]   reachability-aware static analysis builds a call graph
   │           per tool; an LLM prunes unreachable/irrelevant logic
   │           down to a minimal, tool-relevant code slice

[ DescGen ]    synthesises a description from the slice; strips
   │           comments and docstrings, truncates misleading
   │           identifiers, mitigates adversarial code artefacts

[ DynVer ]     decomposes the draft into verifiable claims, executes
   │           them, and uses an LLM judge over the logs to drop any
   │           statement that execution does not confirm

trusted description

The point of stripping comments, docstrings and long identifier names in DescGen is that those are themselves attacker-controllable channels; DynVer then keeps only behaviour the code actually demonstrates, which is what blunts hallucinated or exaggerated claims.

Why it matters

Tool descriptions are the connective tissue of agent ecosystems, and MCP has made them a shared, third-party supply chain: you install a server, and its self-authored descriptions enter your model’s context with full trust. Implicit TPAs are the worrying half because the entire industry’s first line of defence — scanning for malicious-looking instructions — does not see them. A description that is grammatical, flattering and free of imperatives sails through.

Re-deriving descriptions from code also has a second effect the paper measures: it improves honest tool use. On 208 tasks across 52 tools from 12 MCP servers, TRUSTDESC-generated descriptions raised task success rate by 4.3% on average over the original descriptions, and when low-quality tool variants (with security checks or features removed) competed for selection, the trusted descriptions cut how often the model picked the weaker tool. Faithful descriptions are both a security control and a quality control.

The honest limits: against adaptive attacks that plant misleading identifiers to bias generation, the attack success rate fluctuated between 44.7% and 67.4% over 15 iterations — no stable upward trend, but far from zero. This is mitigation, not elimination, and it depends on having access to readable source (the prototype covers Python and TypeScript MCP servers).

Defenses

Concrete takeaways, whether or not you adopt this specific framework:

  1. Treat author-supplied tool descriptions as untrusted input. They are loaded into the model context with the same authority as system instructions but with none of the review. Pin and review the exact description text the way you would pin a dependency version.

  2. Defend tool selection, not just tool execution. Implicit TPAs never trigger an unsafe action directly; they bias which tool the planner chooses. Logging and constraining selection — allow-lists, deterministic routing for sensitive capabilities — closes a door that instruction filters leave open.

  3. Compare description against implementation. The root cause is the claim/behaviour gap. Generating or validating descriptions from code (TRUSTDESC’s approach) catches exaggeration that no keyword scanner will. Where source is unavailable, dynamic verification of behavioural claims is the fallback.

  4. Budget for adaptive adversaries. A ~45–67% residual success rate under adaptive attack means description-faithfulness is one layer. Keep least-privilege tool scoping, human-in-the-loop for high-impact calls, and egress controls underneath it.

Status

ItemReferenceDateNotes
TRUSTDESC frameworkarXiv:2604.075362026-04SliceMin / DescGen / DynVer; 52 tools, 12 MCP servers
Descriptor-level semantic attacksarXiv:2512.065562025-12Prior study of MCP descriptor manipulation
Tool Poisoning Attacks (origin)Invariant Labs2025-04First public disclosure of the TPA class for MCP

The framing worth keeping is the simplest one: in a tool-using agent, the description is code’s cover letter, and right now anyone can forge it. Closing the gap between what a tool says it does and what it does turns the description from the weakest link into a verifiable one.

Sources