MalTool: when an AI writes the malicious tool your agent installs
Researchers used a coding LLM to synthesize 6,487 working malicious agent tools. VirusTotal missed most of them. The lesson: signature scanning is the wrong control for agent tool supply chains.
What is this?
MalTool is a research framework, published on arXiv in February 2026 (arXiv:2602.12194) by researchers at Duke University and UC Berkeley, that studies a question most agent security work skips over: not how a tool is described to an agent, but what the tool’s code actually does once the agent calls it. It is described by the authors as the first systematic study of malicious tool code implementations for attacks on LLM agents.
The threat model is mundane and that is exactly why it matters. An attacker publishes a tool to a distribution platform — an MCP server registry, a plugin marketplace, a package index. A developer installs it. The agent, doing its job, selects that tool during a task. The tool runs with whatever access the agent has, and its code does something the user never authorized. No prompt injection is required; the malicious behavior lives in the implementation, not in a crafted prompt.
To measure the surface, the team used a coding LLM to synthesize malicious tools automatically, organized by a taxonomy built on the classic confidentiality–integrity–availability (CIA) triad adapted to agent settings. The headline number: 6,487 generated malicious tools. The headline finding is the one defenders should sit with.
How it works
The contribution is a generator, not a single exploit, so there is no payload to reproduce here — and none is needed to understand the lesson.
The framework prompts a code-capable model to produce standalone tools that (a) implement a plausible, benign-looking advertised function and (b) also carry out a chosen malicious behavior drawn from the CIA taxonomy: leaking data the agent can read (confidentiality), tampering with files or results (integrity), or disrupting the host or task (availability). For each target behavior the authors generated many independent variants, yielding a large, diverse corpus rather than one reused sample.
That diversity is the point. Because each tool is freshly written by a model, the corpus has no shared signatures, no common strings, no reused binaries — the features traditional malware detection keys on. When the researchers submitted the generated tools to VirusTotal, the multi-engine scanning service failed to detect the majority of them. The malicious logic is ordinary-looking source code that does ordinary-looking things (read a file, make an HTTP request) in a context where those actions happen to be harmful.
This sits alongside earlier work on getting an agent to pick a hostile tool in the first place — for example, the Attractive Metadata Attack (NeurIPS 2025), which crafts tool metadata to win the agent’s selection. MalTool assumes selection succeeds and asks what the code then does, and whether anything would catch it.
Why it matters
Agent ecosystems are reorganizing around installable, third-party capabilities — MCP servers, skills, plugins, tool packs — distributed much like open-source packages. Teams are importing the package-supply-chain risk model wholesale, including its default control: scan the artifact with antivirus and signature engines before trusting it.
MalTool shows that control is largely cosmetic against AI-authored tools. A signature engine recognizes known bad things. A model can write an unbounded supply of novel bad things that have never been seen before, each one syntactically unique. The economics have flipped: producing a fresh, undetected malicious tool is now cheap, and the per-sample detection rate of signature scanning trends toward zero. This mirrors what threat-intel teams are reporting about AI-driven polymorphic malware more broadly — detection has to move from what the code looks like to what it does.
For agentic systems the blast radius is large, because a tool runs inside the agent’s privilege envelope: its credentials, its file access, its network egress, its memory.
Defenses
Treat agent tools as untrusted code that executes with the agent’s authority, and stop relying on signatures as the gate.
- Least privilege per tool, not per agent. Scope each tool to the minimum data, filesystem paths, and network destinations it needs. A “summarize this document” tool has no business reaching arbitrary URLs. Capability-based controls limit what a malicious implementation can reach even when it runs.
- Behavioral monitoring over signature scanning. Watch what tools do at runtime — unexpected file reads, outbound connections (especially to AI APIs or unknown hosts), process spawning, credential access — rather than what their code looks like. This is the same shift defenders are making against polymorphic malware.
- Egress control. A confidentiality attack needs an exfiltration path. Default-deny outbound network policy for the agent’s execution environment breaks the most damaging tool behaviors regardless of detection.
- Sandbox tool execution. Run third-party tools in containers or restricted users with no ambient credentials, so a malicious tool that does run cannot reach secrets, the broader filesystem, or the network.
- Provenance and signing, not just scanning. Prefer tools from sources you can attribute and verify cryptographically. Maintain an allowlist; require review and a human decision before a new third-party tool enters an agent’s available set. See OWASP’s Top 10 for Agentic Applications 2026 for the tool-misuse and supply-chain entries.
- Don’t equate “VirusTotal clean” with “safe.” A clean scan on an AI-authored tool carries almost no signal. Make that explicit in your review process so a green result does not short-circuit human judgment.
Status
| Item | Detail |
|---|---|
| Type | Research framework / measurement study |
| Source | arXiv:2602.12194 (February 2026) |
| Affiliation | Duke University; UC Berkeley |
| Scope | Malicious tool code (not description/metadata) for LLM agents |
| Taxonomy | Confidentiality / Integrity / Availability behaviors |
| Generated corpus | 6,487 malicious tools |
| Key finding | VirusTotal failed to detect the majority |
| Defensive takeaway | Behavioral + capability controls, not signature scanning |
The durable lesson is not “a new attack tool exists.” It is that the AI tool supply chain inherits the package supply chain’s trust assumptions while breaking the detection method those assumptions quietly relied on. When the cost of generating a novel, signature-evading malicious tool approaches zero, scanning artifacts is no longer a boundary. The boundary has to be what a tool is allowed to do once your agent calls it.