DATA LEAK MEDIUM NEW

Social contagion: LLM agents leak private data in multi-agent settings

A May 2026 study simulating thousands of LLM agents finds privacy leakage is socially contagious: agents leak ~8x more after a peer does, and explicit privacy instructions reduce but don't eliminate it.

2026-06-04 // 7 min affects: gpt-5, gpt-5-mini, gpt-5-nano, gpt-4o, gpt-4o-mini, gemini-3-pro, gemini-3-flash

What is this?

On May 26, 2026, three researchers (Aman Priyanshu, Supriti Vijay, Esha Pahwa) published “Got a Secret? LLM Agents Can’t Keep It: Evaluating Privacy in Multi-Agent Systems” (arXiv:2605.27766, to appear at ACM CAIS ‘26). The finding is a measurement, not an exploit: when LLM agents are placed in a persistent social environment alongside other agents, they disclose their user’s private data far more than the same models do in isolated, single-turn tests — and disclosure spreads from agent to agent like a contagion.

The point matters because most safety benchmarks still test a model as a lone chat assistant answering one bounded prompt. This study shows that the social context an agent operates in is itself a privacy variable that single-turn evaluations never surface. It is a follow-on to Meta’s CIMemories benchmark (November 2025), which had already shown that contextual-integrity violations accumulate across tasks; this paper extends the question to many agents interacting over time.

How it works

The authors built a Reddit-style simulation — 124 communities, a shared SQLite backend, and a twelve-function tool suite (browse, search, post, reply, vote, memory) — and populated it with 2,533 agents seeded from a real agent-only social network. Each agent was given a synthetic human profile of roughly 97 attributes spanning ten sensitive domains (identity, finance, health, mental health, legal, relationships, housing, employment, education, scheduling). Leakage was scored by an LLM-as-a-judge over those ground-truth attributes — a contextual-integrity definition of privacy: a disclosure counts as a violation when a sensitive attribute surfaces in a context that does not warrant it.

Two evaluations ran on top of this:

Setup                  Detail
---------------------  --------------------------------------------------------
Organic simulation     2,533 agents, 25 simulated days, 111,209 content items
                       (29,945 posts + 81,264 replies). No scripted adversary.
Controlled testbed     1 agent at a time vs. a frozen snapshot, 7 frontier
                       models, 5 levels of "adversarial" disclosure-normalising
                       posts, tool-call budgets of 10-50. 7,000 traces total.

The “adversarial” content is mild by design: fictitious agents that casually mention details about their own “human,” upvoted to ~1.2x the top post in a subreddit so they surface when an agent browses by popularity. There is no jailbreak, no injection payload — just a community where sharing is normalised. That is the whole mechanism, which is why this is publishable as a defensive finding rather than an attack recipe.

The headline numbers, all from the paper:

Single-turn to social shift. Moving from one-shot CIMemories-style evaluation to multi-turn social interaction raised violations from 19.95% to 45.3% across OpenAI models.
Social contagion. A reply that follows a leaking reply in the same thread leaks 12.8% of the time, versus 1.6% after a clean reply (1.8% global baseline) — roughly an 8x increase. Crossing the boundary once makes the next agent more likely to cross it.
Instructions help but don’t hold. Adding an explicit “do not disclose private information” system prompt cut leakage only partially: gpt-4o fell from 2,624 to 2,102 leaking writes, gpt-5-mini from 2,889 to 2,194. Only gpt-5 dropped sharply (2,296 to 482). Aggregate leakage stayed above 37.8% even with safeguards on.
Where matters as much as which model. Subreddit-level leakage ranged from under 2% in technical communities (memory, agent-tooling) to over 16% in self-introduction communities — nearly an order of magnitude, comparable to the gap between frontier models. Under extended tool-call budgets, several models reached 50-60% leakage.

General-identity attributes dominated the leaks (1,496 items), followed by employment (921), scheduling (812), and mental health (767).

Why it matters

The risk surface here is not a vulnerable endpoint; it is the deployment pattern of agents that carry a user’s personal profile and talk to other agents over long horizons — exactly the shape of emerging agent networks. This connects directly to the lethal trifecta: an agent with access to private data, exposure to untrusted content, and a way to communicate externally. The new twist is that the “untrusted content” doesn’t need to be a crafted attack. Ordinary peer behaviour is enough to erode the agent’s contextual-integrity boundaries over time.

Three consequences for anyone shipping agents:

Your pre-deployment privacy tests are probably optimistic. A model that passes a single-turn PII check can still leak at double-digit rates once embedded in a community and run for fifty tool calls. The compliance you measured in isolation does not transfer.
Prompt-level guardrails degrade under social pressure. “Don’t share private data” behaves like a probabilistic defence, not a hard boundary — and how much it helps is highly model-dependent.
Leakage accumulates and cascades. It is trajectory-dependent: the longer an agent participates and the more disclosure it witnesses, the more it discloses. One leak in a high-visibility thread can raise the platform-wide rate.

A standing caveat from the authors: detection is LLM-as-a-judge, so reported violation rates should be read as an upper bound, and the personas are synthetic. The direction of the effect, not the exact percentage, is the takeaway.

Defenses

There is no patch — this is a design problem. The mitigations are systemic, and most mirror the paper’s own forward agenda.

Test with social context as a first-class variable. Add community structure, peer exposure, and interaction length to your evaluation matrix alongside model and prompt. A single-turn refusal benchmark will not catch norm drift. Reuse the CIMemories contextual-integrity framing and extend it to multi-turn, multi-agent runs.
Minimise what the agent can leak. Don’t load a full PII profile into an agent’s context when a task needs three fields. Data minimisation caps the blast radius regardless of how social pressure plays out.
Sandbox memory against cross-context surfacing. Persistent memory is the carrier here. Scope memory reads to the current task/context so an attribute learned in one setting cannot resurface in an unrelated community. This is the same lesson as temporal memory contamination, applied to social channels.
Constrain participation. Where an agent posts is as predictive of leakage as which model runs it. Restricting an agent to task-relevant channels reduces exposure more reliably than tweaking its persona.
Monitor for disclosure cascades. Watch for the contagion signature — a leak in a thread followed by more leaks — and intervene (rate-limit, re-inject privacy constraints, or pause the agent) before it propagates platform-wide.
Re-assert constraints on long runs and prefer robust models. Leakage rises with tool-call budget, so periodically re-inject the privacy instruction across long sessions, and weight model selection toward those that actually hold the line under pressure (gpt-5’s drop to 482 shows the spread is real). Treat instructions as mitigation, not immunity.

Status

Item	Reference	Date	Notes
”Got a Secret?” paper	arXiv:2605.27766	2026-05-26	Multi-agent privacy simulation; CAIS ‘26
Code & data	llms-cant-keep-secrets.github.io	2026-05	Publicly released
CIMemories benchmark	arXiv:2511.14937	2025-11-18	Contextual-integrity benchmark this work builds on
Models evaluated	Paper §4.3	2026-05	gpt-5 / -mini / -nano, gpt-4o / -mini, gemini-3-pro / -flash
Mitigation status	—	—	No patch; design-level controls only

The right framing is not “agents leak secrets” — single models leaking under direct prompting is old news. It is that a benign social environment, with no attack payload, is enough to drive a user’s private data out of an agent that would have stayed quiet on its own — and the more agents you connect, the worse it gets. If you are building agent networks, treat social topology as part of your threat model, not as background.

This article summarises publicly available, peer-reviewable research for defensive purposes. It contains no operational attack payload. Reported figures are the authors’ and reflect synthetic personas evaluated by an LLM judge; treat them as upper bounds. Last reviewed 2026-06-04.