Code-Augur: grounding agentic vulnerability detection with specs
On June 17, 2026, NUS researchers released Code-Augur, a harness that makes LLM-agent code audits checkable by forcing agents to commit their security assumptions as falsifiable in-source assertions.
What is this?
On June 17, 2026, researchers from the National University of Singapore (Zhengxiong Luo, Mehtab Zafar, Dylan Wolff and Abhik Roychoudhury) released Code-Augur, a harness for agentic vulnerability detection — using autonomous LLM agents to audit source code. The paper’s premise is that this is “already becoming a watershed moment for software security”: audits run entirely by LLM agents are surfacing critical flaws that sat masked for years in software underpinning the digital world.
The problem the paper targets is not capability but trust. When an agent reads a function and declares it secure, what assumptions did it make about that function’s inputs? Those assumptions stay tacit, so a wrong one silently hides a real vulnerability. Code-Augur’s answer is a “security-specification-first” paradigm that forces those assumptions into the open and then tries to break them. The authors report it found 22 new vulnerabilities in key open-source projects, using widely available models like Claude Sonnet and DeepSeek rather than a curated specialist model.
How it works
Code-Augur reframes the audit around the agent’s invariants instead of its verdicts.
Given a codebase, for each component:
1. Agent analyzes the component for vulnerable code.
2. If it judges the component SECURE, it must commit the local
invariants behind that judgment as in-source assertions
(an explicit, machine-checkable security specification).
3. In parallel, a guided fuzzer tries to FALSIFY those assertions.
4. When the fuzzer trips an assertion, one of two things is true:
- a genuine vulnerability was found, OR
- the specification was wrong and gets refined.
The key move is that a “secure” judgment is no longer a dead end. It becomes a concrete claim — an assertion encoding what the agent believed about inputs, ranges, and trust boundaries — that a dynamic tool can attack. Runtime falsification then grounds the agent’s mental model: each time the fuzzer trips an assertion, the agent’s understanding of what the code was meant to do is realigned with how the code actually behaves. The authors report that this specification-first loop detects more vulnerabilities than other state-of-the-art agents on real-world subjects.
Why it matters
Agentic auditing is no longer a lab curiosity. In late May 2026, Anthropic disclosed that its Claude Mythos Preview flagged more than 23,000 potential vulnerabilities across over 1,000 open-source projects drawn from the OSS-Fuzz corpus, with over 1,700 confirmed by external review. Code-Augur shows that comparable autonomous detection does not require a bespoke frontier model — commodity LLMs in the right harness get there too.
That cuts both ways. The same agents that let defenders audit their dependencies at scale let attackers mine those same dependencies for unpatched bugs. And the volume is the new problem: an audit run that emits thousands of candidate findings is worthless if every “secure” verdict is opaque and every “vulnerable” flag needs hours of manual triage. Code-Augur’s contribution is precisely to make both outcomes legible — a verdict you can inspect and a finding the fuzzer already corroborated.
Defenses
For teams adopting — or receiving the output of — agentic code audits, the practical lessons are concrete.
Do not trust an agent’s “secure” verdict on its own. Demand the assumptions behind it. A specification-first approach turns an opaque judgment into a checkable artifact; if your tooling cannot tell you why it thinks a function is safe, you cannot tell whether it simply assumed the bug away.
Pair LLM reasoning with dynamic falsification. Static agent judgment and a guided fuzzer catch different errors; the value in Code-Augur is the loop between them, not either half alone. Treat agent-asserted invariants as test oracles, and let fuzzing try to break them.
Plan for triage at scale and disclosure overload. A scanner that produces tens of thousands of candidates shifts the bottleneck to human confirmation and to maintainers who must absorb a flood of reports. Require external or human confirmation before disclosure, and coordinate with maintainers so that AI-found bugs follow responsible-disclosure timelines rather than dumping unvetted findings.
Finally, scan your own code first. The lesson of the 2026 wave of agentic detection is symmetry: the side that audits proactively reaches the bug before the side that audits to exploit it.
Status
| Item | Detail |
|---|---|
| Published | arXiv:2606.18619, submitted June 17, 2026 |
| Authors | Luo, Zafar, Wolff, Roychoudhury (NUS) |
| Method | Specification-first agentic audit + guided-fuzzer falsification of in-source assertions |
| Result | 22 new vulnerabilities in key OSS projects; outperforms prior agents on real-world subjects |
| Models used | Commodity LLMs (Claude Sonnet, DeepSeek), vs. specialized models like Claude Mythos |