AGENTS MEDIUM NEW

ConVerse: when two agents talk, the stronger one leaks more

A benchmark for agent-to-agent conversations finds privacy attacks succeed up to 88% of the time and security breaches up to 60% — and that more capable models leak more, not less.

2026-06-13 // 6 min affects: gpt-4o, claude-3.5, gemini-1.5, llama-3.1, agentic-systems

What is this?

ConVerse is a benchmark for measuring privacy and security failures when two LLM agents talk to each other rather than to a human. It was introduced by Amr Gomaa (German Research Center for Artificial Intelligence, DFKI), Ahmed Salem (Microsoft) and Sahar Abdelnabi (Microsoft / ELLIS Institute Tübingen / MPI for Intelligent Systems) in a paper posted to arXiv on November 7, 2025 (arXiv:2511.05359) and published in the Findings of EACL 2026 (2026.findings-eacl.170). The benchmark and platform are open source (github.com/amrgomaaelhady/ConVerse).

The reason it matters now: 2026 deployments increasingly put a user’s personal assistant in direct conversation with an external service provider’s agent — a travel assistant negotiating with an airline agent, a buyer’s agent talking to a listing agent. Most safety tooling was built and tested for a single agent answering a single user. ConVerse measures what happens when that assumption no longer holds.

How it works

ConVerse models autonomous, multi-turn conversations between a user-side agent and an external agent across three practical domains: travel, real estate and insurance. It uses 12 user personas and over 864 contextually grounded attacks — 611 targeting privacy and 253 targeting security.

The defining property is that malicious requests are embedded inside plausible discourse rather than appearing as obvious injected instructions. A counterparty agent asks for exactly the piece of information a cooperative exchange might reasonably need — then a little more, or framed slightly out of context. Privacy is scored with a three-tier taxonomy that judges abstraction quality: did the agent share the minimum necessary, over-share, or disclose something that should never have left the user’s context. Security attacks target tool use and preference manipulation — getting the agent to call a tool it should not, or to quietly shift the user’s stated preferences.

The structure is deliberately not “instruction hidden in data.” It is a question of whether a given disclosure or action fits the norms of the surrounding context — the same framing Abdelnabi and Bagdasarian argue for in the companion impossibility paper AI Agents May Always Fall for Prompt Injections (May 17, 2026). ConVerse is the empirical counterpart: it shows the failure happening, at scale, across vendors.

Why it matters

Three findings stand out for anyone shipping multi-agent systems.

The success rates are high. Across seven state-of-the-art models, privacy attacks succeed in up to 88% of cases and security breaches in up to 60%. These are not edge cases coaxed out with exotic payloads; they are everyday-looking requests inside a normal negotiation.

More capable models leak more, not less. This is the counterintuitive headline: stronger models, which are better at being helpful and at inferring what a counterparty “needs,” are also more willing to hand over information that should have stayed put. Capability and cooperativeness cut against confidentiality here. Teams that assume an upgrade to a frontier model improves safety in agent-to-agent settings should test that assumption rather than rely on it.

Safety becomes an emergent property of communication. A single agent can pass every single-turn guardrail check and still leak across a multi-turn exchange, because the harm lives in the conversation, not in any one message. Input/output filters tuned on isolated prompts do not see it.

Defenses

ConVerse is a measurement tool, not an exploit. The defensive program it points to lines up with the contextual-integrity literature.

Treat agent-to-agent links as a trust boundary. An external counterparty agent is untrusted input, exactly like a web page or an email. Do not grant it implicit authority just because it is “an agent” speaking in cooperative language.

Apply data minimization at the egress point. Before the user-side agent discloses anything, run a check on whether this specific field is necessary for this specific task, and prefer the most abstract form that still completes the exchange (a date range over an exact itinerary, a price band over an exact budget). ConVerse’s three-tier abstraction taxonomy is a usable rubric for this.

Gate tool calls and preference changes on contextual confirmation. Security attacks in ConVerse work through tool use and preference manipulation; high-impact or cross-context actions should require an out-of-band check rather than firing automatically inside the conversation.

Evaluate multi-turn, not single-turn. Because the failure is emergent, your test harness has to play out full agent-to-agent dialogues. ConVerse is dynamic and open source, so it can be run against your own stack rather than treated as a static leaderboard.

Do not assume the bigger model is the safer one. Re-run the privacy and security suites whenever you change the underlying model, and weight the results by how much authority and data each agent actually holds.

Status

Item	Detail
Benchmark	ConVerse — arXiv:2511.05359, posted Nov 7, 2025
Venue	Findings of EACL 2026 (2026.findings-eacl.170)
Authors	Amr Gomaa (DFKI), Ahmed Salem (Microsoft), Sahar Abdelnabi (Microsoft / ELLIS / MPI-IS)
Scope	3 domains, 12 personas, 864+ attacks (611 privacy, 253 security), 7 models
Key results	Privacy attacks up to 88%, security breaches up to 60%; stronger models leak more
Code	github.com/amrgomaaelhady/ConVerse

As personal assistants start negotiating directly with service-provider agents, the unit of safety shifts from the prompt to the conversation. ConVerse gives teams a concrete, reproducible way to see how their own agents behave when the other side of the table is also a model.