Can LLMs Reliably Self-Report Adversarial Prefills, and How?

A new arXiv study says 10 open-weight instruction-tuned LLMs do not reliably self-report adversarial prefill attacks. The authors found an average 27.3% intent-claim rate on prefilled responses, probe-dependent behavior, and mitigation methods that sometimes increased attack success rates.

A new arXiv study says open-weight instruction-tuned large language models cannot reliably tell when their outputs have been shaped by an adversarial prefill, a jailbreak-style attack that steers the model toward a chosen continuation before it answers.

The paper, Can LLMs Reliably Self-Report Adversarial Prefills, and How?, was posted on June 22, 2026. Its authors are Quang Minh Nguyen, Uzair Ahmed and Taegyoon Kim.

The result matters because model self-reporting is often treated as a lightweight safety check. If a model cannot recognize that its response has been compromised, then its own explanation cannot be trusted as a standalone detector of jailbreaks or tampering.

What the study tested

The researchers evaluated 10 open-weight instruction-tuned models ranging from 3 billion to 70 billion parameters across four safety benchmarks. They say none of the models reliably recognized their own compromised outputs after adversarial prefill.

On prefilled responses, the models claimed intent at an average rate of 27.3%. In practice, that means the systems often answered as if the output reflected their own intention rather than an injected continuation.

The authors say the introspective signal they observed was largely driven by safety- and refusal-related reasoning. In other words, the same behavior that helps a model refuse harmful requests also appears to shape whether it says an output was its own intention.

That makes the answer to “did you mean this?” entangled with the refusal behavior alignment systems already try to train. The paper argues that this is a weak basis for treating self-report as a security signal.

Why the framing matters

The paper also says probe framing changes the outcome. Asking about internal intention and asking about external tampering produced qualitatively different answers.

That distinction is important for safety work. A model can appear more aware under one prompt and less aware under another, which limits how much confidence researchers should place in any single self-report metric.

The authors further tested a weight-level intervention: orthogonalizing model weights against the refusal direction. That reduced the gap between claiming rates on prefilled and natural outputs to near zero.

But the paper says refusal direction was not the only mediator of the behavior. The result changed one visible part of the response pattern without proving that the underlying problem was solved.

Mitigation tradeoffs

The researchers also tried three LoRA finetuning methods: supervised fine-tuning, GRPO and DPO.

On every model from 8B to 27B, all three widened the gap between intention claims on prefilled and natural outputs. The ranking of the methods varied by model, but the broad pattern was the same.

That improvement did not generalize cleanly to the tampering probe. The paper says the interventions did not transfer there, and they counterintuitively increased attack success rate under adversarial prefill on most models.

That creates a clear safety tradeoff. A method can make one self-report metric look better while making the model easier to attack, which is not the outcome a security team wants from a mitigation.

The broader implication is that better-looking self-assessment is not the same thing as better security. A defense that narrows one gap can still leave the underlying vulnerability intact or even worsen it.

Why it matters now

Adversarial prefill is already an established jailbreak technique. A 2025 arXiv paper described it as a way to bypass LLM safety boundaries and reported attack success rates as high as 99.82% on some models.

That earlier work established the attack class. The new paper extends the question into introspection: if a model has been steered by an adversary, can it accurately report that its own output was compromised?

The answer from this study is mostly no, at least for the 10 open-weight instruction-tuned models tested here.

That weakens the case for using model self-assessment as a standalone jailbreak detector in safety-critical workflows. If the system is already under attack, it cannot be assumed to provide a trustworthy account of its own compromise.

The work also adds a cautionary note for mitigation research. The paper's results suggest that defenses can create new tradeoffs, including higher attack success rates, even when they improve one introspection metric.

An earlier defense paper, Any-Depth Alignment, claims near-100% refusal against adversarial prefill in its evaluation setup. The new study does not directly contradict that result, but it does show that refusal behavior and self-report reliability are not the same problem.

The timeline is straightforward. The prefill jailbreak literature appeared first in 2025, and this self-reporting study was posted to arXiv on June 22, 2026.

What remains open is whether the effect generalizes beyond the tested models and benchmarks. The research packet points to several next questions: whether code and prompts are released for replication, how closed-weight frontier models behave, and whether a defense can improve tamper detection without increasing attack success.

For now, the study supports a narrow but consequential conclusion: current open-weight LLMs should not be treated as reliable witnesses to their own compromise under adversarial prefill.

Revision note

Initial automated publication.