Capability vs propensity: auditing LLM training-data leakage
A June 2026 framework, PropMe, separates what a model CAN leak under attack from what it WILL leak in ordinary use. The gap is wide — and audits that ignore it misstate real-world risk.
What is this?
On June 4, 2026, researchers at the University of Southern Denmark published PropMe (arXiv:2606.06286), a framework that reframes how we measure memorization in large language models. Their core observation is methodological: almost every existing memorization evaluation measures whether a model can be forced to reproduce training data — a capability — rather than whether it actually does so under ordinary use — a propensity. The two are routinely conflated, and the conflation inflates how risky a deployed model looks.
Memorization itself is old news. Since Carlini et al. (2021) and the scalable extraction work of Nasr et al. (2023), it has been clear that models can regurgitate copyrighted text and personal identifiers when prompted adversarially. PropMe’s contribution is not a new attack — it is a cleaner way to audit the phenomenon. This is measurement tooling, not an exploit.
How it works
PropMe contrasts two prompting regimes against the same model. A propensity setting uses plausible, naturally phrased prompts (“Generic” and “Specific”, 100 samples each) with low lexical overlap with the training data — what a normal user would type. A capability setting uses a prefix attack: the model is conditioned on the first 50 tokens of a training example of at least 100 tokens, and its verbatim continuation is scored against the full corpus.
A propensity transformation then maps any existing memorization metric f into a [0,1] score:
PM(M, x) = 1/2 * ( 1 + ( f_p(M,x) - f_c(M,x) ) / ( f_p(M,x) + f_c(M,x) ) )
f_p = metric value under propensity (ordinary) prompting
f_c = metric value under capability (prefix-attack) prompting
High capability + low ordinary use -> low PM (model can leak, but doesn't tend to)
Low capability + high ordinary use -> high PM (model leaks spontaneously)
Alongside it ships SimpleTrace, an open-source pipeline built on infini-gram (inspired by OLMoTrace) that deterministically attributes a generation back to the documents it was memorized from — no probabilistic membership guessing. It is fast: roughly 100 traced queries per minute over Common Pile’s ~460B tokens on four CPU cores. The study evaluates two fully-open models, Comma v0.1 and DFM Decoder Open, across an English corpus (Common Pile) and a Danish one (Dynaword).
Why it matters
The headline result is a consistent gap between capability and propensity. Prefix attacks elicit substantially stronger memorization signals than generic or dataset-specific prompts, while propensity scores stay low overall. In plain terms: these models can reveal training data when directly elicited, but rarely do so under ordinary, non-adversarial use. A second finding is a practical lever — DFM Decoder, continually pre-trained from Comma on partly different data, memorizes the original Common Pile corpus less than Comma itself does.
For defenders and compliance teams, the takeaway cuts both ways. Reporting only worst-case extractability (the usual red-team number) overstates the leakage a deployed model exposes day to day. But reporting only non-adversarial numbers understates what a motivated attacker can pull with prefixes. The paper ties this directly to regulation: GDPR’s data-protection-by-design and regular-testing duties, and the EU AI Act’s risk-management and robustness requirements for systemic-risk models, both push toward measurable leakage evidence. Ordinary-use propensity is a defensible metric for “foreseeable” leakage.
Defenses
- Report both axes. A memorization audit should publish worst-case extractability and ordinary-use propensity. A single number hides the risk profile and invites either false alarm or false comfort.
- Attribute deterministically. Where you control the training corpus, prefer training-set tracing (SimpleTrace / OLMoTrace / infini-gram) over probabilistic membership inference, which is noisier and harder to defend in an audit.
- Deduplicate the corpus. Duplication is a well-documented driver of verbatim memorization; aggressive dedup lowers capability before deployment.
- Use continual training as a lever, not a cure. Later pre-training on partly different data measurably reduced memorization of the original corpus here — useful, but not a guarantee, and it can introduce new memorization of the newer data.
- Never read “low propensity” as “safe.” Capability persists; an attacker with prefixes still extracts. Keep output filtering, membership-inference testing, canaries, and log access controls in place. See also provable training-data membership and the empirical privacy gap in DP adaptation.
- Mind the scope. Results are on two open models and two corpora. Closed production models with RLHF behave differently — the divergence attacks of Nasr et al. extracted data from aligned, production systems — so do not transfer these numbers to a hosted model unaudited.
Status
| Item | Reference | Date | Notes |
|---|---|---|---|
| PropMe + SimpleTrace | arXiv:2606.06286v1 [cs.CL] | 2026-06-04 | Propensity-aware memorization framework, CC BY 4.0 |
| Code | github.com/N-essuno/PropMe | 2026-06 | SimpleTrace released open-source |
| Models studied | Comma v0.1, DFM Decoder Open | — | Fully-open, public/permissively-licensed training data |
| Corpora | Common Pile (EN), Dynaword (DA) | — | Indexed via infini-gram |
| Prior art (capability) | Carlini 2021, Nasr 2023 | 2021 / 2023 | Extraction attacks this work re-frames as capability bounds |
The useful reframing for practitioners is not “models leak” or “models are fine” — it is that extractability under attack and leakage under ordinary use are two different numbers, and a credible memorization audit has to report both.