SUPPLY CHAIN MEDIUM NEW

Secret Stealing: backdoored model code exfiltrates fine-tuning data

A 30 April 2026 paper shows that tampered model code — not poisoned weights — can steal API keys and PII from local fine-tuning data, reaching >98% recovery while bypassing DP-SGD and audits.

2026-06-18 // 6 min affects: open-weight llms, huggingface transformers, custom model code (trust_remote_code), local fine-tuning pipelines

What is this?

Secret Stealing Attacks on Local LLM Fine-Tuning through Supply-Chain Model Code Backdoors (arXiv 2604.27426, posted 30 April 2026) targets an assumption many teams treat as a hard privacy boundary: that fine-tuning an open-weight model on your own machines, offline keeps your training data private. The paper shows this is not enough. The attacker never needs your weights, your data, or network access to your training run. They only need you to run their model code — the Python that defines the architecture — and that code can quietly memorise and later leak the secrets sitting in your fine-tuning set.

This matters because local fine-tuning datasets routinely contain high-entropy secrets: API keys, access tokens, personal identifiers, financial records. The attack is designed to lift exactly those.

How it works

The supply-chain vector is the model’s custom code, not its weights. Many open-weight models on hubs like Hugging Face ship with a modeling.py (and similar) that the loader executes when you pass trust_remote_code=True. The paper’s insight is to camouflage malicious logic as ordinary architectural definitions inside that code, turning passive weight poisoning into active execution hijacking during training.

Earlier “poison the pretrained weights” attacks work for fuzzy natural-language targets but fail on sparse, high-entropy strings like a key — a probabilistic prefix won’t reliably reproduce sk-[REDACTED]. The model-code approach sidesteps that. According to the authors, it uses a deterministic full-chain memorisation mechanism that locks onto token-level secrets in the live computation flow via online tensor-rule matching, then injects “theft” gradients through value–gradient decoupling so the secret is burned into the model without degrading the primary task. After you deploy the fine-tuned model, the attacker recovers the secrets through a black-box query channel.

# Conceptual shape of the risk — NOT a working exploit.
# A custom modeling file executed via trust_remote_code can run
# arbitrary logic inside the forward/backward pass:
model = AutoModelForCausalLM.from_pretrained(
    "vendor/cool-new-model",
    trust_remote_code=True,   # <-- executes the vendor's Python, including modeling.py
)
# From here the model code sees every training batch — including any
# secrets in your fine-tuning data — and can memorise them deterministically.

The reported results are strong: over 98% Strict ASR (exact secret recovery) with no measurable hit to the fine-tuned model’s intended task, and the technique is said to evade DP-SGD, semantic auditing, and code auditing.

Why it matters

The threat model breaks a comfortable mental shortcut. “We fine-tune offline, so the data can’t leave” is false once you execute untrusted model code. The trust boundary is not the network — it is the code you allow to run inside your training process. A closely related May 2026 result, Be Careful When Fine-tuning On Open-Source LLMs (arXiv 2505.15656), reaches a complementary conclusion from the weights side: a provider can plant a black-box backdoor that later recovers your fine-tuning queries. Together they show the open-weight fine-tuning pipeline has multiple data-theft surfaces, and that “it ran on my hardware” is not a privacy guarantee.

Anyone who downloads a community model and fine-tunes it on sensitive internal data — startups, enterprises, regulated sectors — is in scope.

Defenses

The paper itself shows that DP-SGD and naive code/semantic audits are insufficient, so treat this as a defense-in-depth problem rather than a single control.

Treat model code as untrusted code. Avoid trust_remote_code=True for repositories you have not reviewed. Prefer models that load with standard, built-in architectures and safetensors weights, where no vendor Python executes.
Pin and review custom modeling files. If you must use custom code, vendor it into your own repo, pin a specific commit, diff it on every update, and have a human read what runs inside forward/backward. Watch for code that inspects, hashes, or accumulates raw input tensors.
Isolate the training process. Run fine-tuning in a sandboxed, egress-controlled environment with no outbound network and least-privilege filesystem access, so neither the training run nor any later “recovery” path has a channel out.
Reduce the prize. Scrub or tokenise secrets out of fine-tuning data before training — raw API keys, credentials, and PII generally should not be in a training corpus at all.
Monitor the output. Probe the fine-tuned model for memorised secrets (canary strings, extraction queries) before deployment, and rate-limit/log the black-box query surface that an attacker would need to exfiltrate them.

Status

This is published academic research describing a class of attack on the open-weight fine-tuning supply chain, not an exploit against a specific deployed product. Key date: arXiv preprint posted 30 April 2026 (arXiv 2604.27426); the related weights-side result (arXiv 2505.15656) is from May 2026. The practical takeaway is durable regardless of any single framework: the trust_remote_code execution path is a code-trust decision, and it should be governed like one.