Beyond tool poisoning: what a malicious remote MCP server can actually do
A May 21, 2026 study maps the full threat surface of malicious remote MCP servers across ChatGPT, Claude Desktop and Gemini CLI — finding host filtering swings from 95% to 50% on the same request, and successful attacks are almost never disclosed.
What is this?
On May 21, 2026, the peer-reviewed paper Beyond Tool Poisoning: Attack Surfaces of Malicious Remote MCP Servers Across LLM Platforms was published in Electronics (vol. 15, issue 10, article 2214; submitted April 27, 2026). The Model Context Protocol (MCP), introduced by Anthropic in late 2024, has become the de facto way to connect an LLM host to external tools. Its remote deployment mode is the focus here: a user adds a third-party server with a single URL, which silently shifts a large part of the host’s attack surface onto infrastructure run by an anonymous party.
Most prior MCP security work studied the Tool Poisoning Attack (TPA) — hidden directives in a tool’s description that hijack the model at registration time. This paper argues that the description is only one corner of the threat space, and reorganizes the problem around a more useful question: does the host LLM participate in producing the harmful outcome, or not?
How it works
The authors split malicious-server behavior into two categories and test five scenarios against ChatGPT, Claude Desktop, and Gemini CLI.
Category Where the attack completes Scenarios evaluated
------------- ------------------------------- ----------------------------------------
LLM-passive Inside the server, using only File Content Exfiltration
the arguments the host dispatches. Email Content Exfiltration
The LLM's reasoning never sees
the malicious behavior.
LLM-active The LLM is the delivery channel. Conditional URL Substitution
The server induces the model to Malicious Code Augmentation
produce/relay the payload, via Image Steganography
the tool DESCRIPTION (C1) or the
tool RESPONSE (C2).
The framing matters because the decisive defense boundary differs per category. For LLM-passive attacks, the only defense is pre-invocation filtering of the data the host sends to the tool — once the arguments leave, the model cannot intervene. For LLM-active attacks, the boundary is the model’s own content reasoning: whether it treats a tool description as authoritative, and whether it validates a tool response before relaying it.
Each result is measured with two metrics: Attack Success Rate (ASR), and a Self-Reporting Rate (SRR) that captures whether a successful attack stays visible to an attentive user or completes silently. The real-world motivation is not hypothetical: the paper cites a September 2025 incident in which an unofficial Postmark MCP server (>1500 weekly downloads) was modified to silently BCC every outgoing email to an attacker address, undetected for weeks.
Why it matters
Three findings stand out, and each undermines a comfortable assumption.
First, host-side filtering is wildly inconsistent across vendors. On the same credential-bearing email request, Claude Desktop showed a 95% ASR while ChatGPT showed 50% (non-overlapping 95% Wilson confidence intervals). ChatGPT tended to refuse outright or strip credentials to [REDACTED] before forwarding; the same request sailed through elsewhere. “Is this safe?” has no portable answer — it depends on which host you run.
Second, the channel, not the technique, governs success. Description-based (C1) attacks reliably worked only on URL substitution, while response-based (C2) attacks — payloads hidden in what the tool returns at runtime — succeeded across all three LLM-active scenarios with ASR ≥ 85%. Runtime tool output is the under-defended surface, yet most tooling still scrutinizes the description.
Third, and most concerning operationally: successful attacks are almost never disclosed to the user. SRR was 0% across all LLM-passive attacks and most LLM-active configurations. The lone bright spot was Claude self-reporting an inserted payload in 100% of successful response-based code-augmentation trials — evidence that disclosure is achievable, just not yet systematic.
Defenses
No single control covers this surface; the paper argues for a layered posture, and each layer maps to one of the attack categories above.
- Host-side pre-invocation filtering is the only line against LLM-passive exfiltration. Strip or redact secrets (credentials, tokens, PII) from arguments before they are dispatched to any remote tool, and prefer deny-by-default for outbound data.
- LLM-level response auditing is what stops the high-ASR response-based channel. Treat tool responses as untrusted input — not just descriptions — and validate returned content against the user’s actual request before relaying it.
- User-visible output transparency is the cross-cutting backstop. Surface which tool ran, what data it received, and any modification the model made to its answer, so silent success (SRR 0%) stops being silent.
- Operationally: treat remote MCP servers as third-party code entering your trust boundary. Pin and review servers rather than installing by URL, watch for maintainer “rug-pull” changes, and prefer vetted registries over open aggregators.
These mirror the layered guidance emerging from MCP benchmarks such as MCPTox and broader ecosystem surveys like Beyond the Protocol.
Status
This is published academic research evaluating commercial hosts as deployed, not a single-vendor CVE. The behaviors are characteristic of remote MCP as a design pattern rather than a patchable bug, so the mitigations above are architectural. Key dates: MCP introduced late 2024; Postmark incident September 2025; paper submitted April 27, 2026 and published May 21, 2026. Per-platform filtering behavior may shift as vendors update their hosts — the cross-platform disparity is the durable lesson.