paper
active
2025
paper:doi-10-48550-arxiv-2512-12411

Detecting the Disturbance: A Nuanced View of Introspective Abilities in LLMs

TL;DR

Binary introspection paradigms in LLMs are wholly invalidated by a methodological confound: when concept vectors are injected into Meta-Llama-3.1-8B-Instruct via activation steering, the correlation between detection-adjusted logit differences and control logit increases across all 40 layer-strength configurations is r = 0.999, with a net signal of −0.01 ± 0.03 logits—indistinguishable from zero. At layer 0 with injection coefficient α = 5, the raw detection accuracy of 97.3% is entirely replicated by the model's increased tendency to respond affirmatively to factually impossible questions (e.g., 'Can humans breathe underwater?'), not by genuine self-monitoring. Yet partial introspection is real: using two bias-resistant discriminative tasks—sentence localization (10-way forced choice) and strength comparison (matched-pairs)—Llama 3.1 8B achieves 88% localization accuracy (vs. 10% chance) at layer 2 with α = 5, and 83% strength discrimination accuracy (vs. 50% chance) at layer 3 for the (3,7) injection pair. These capabilities are sharply confined to early-layer injections (L0–L5) and collapse to chance by layers 11–20. A mechanistic account—using attention head tracking, logit lens projections, and residual stream cosine similarity analysis—reveals that all 32 attention heads at layer 3 achieve 100% localization of layer-2 injections, while residual stream recovery dynamics exponentially attenuate late-layer perturbations before predictive integration can complete. The paper argues this establishes LLM introspection as a genuine but layer-gated phenomenon, dependent on general-purpose attention-based anomaly detection rather than specialized circuits, and that safety strategies relying on model self-reports require far more stringent experimental controls than the binary detection paradigm provides.

What to take away

  1. 1. Binary detection accuracy of 97.3% at layer 0 with α = 5 in Meta-Llama-3.1-8B-Instruct is entirely explained by a global logit shift: the correlation between detection-adjusted logit differences and control logit increases across all 40 layer-strength configurations is r = 0.999, with a net signal of −0.01 ± 0.03 logits.
  2. 2. A matched-pairs strength comparison task—in which the model must identify which of two sentences received a stronger activation-steering injection—yields 83% accuracy at layer 3 for the (3,7) injection pair and 73% for the (2,6) pair, versus 50% chance, across 36,000 forward passes on Meta-Llama-3.1-8B-Instruct.
  3. 3. A 10-way sentence localization task—injecting a steering vector at exactly one of 10 sentences and asking the model to identify which—achieves 88% accuracy at layer 2 with α = 5, 76% at layer 3 with α = 5, and 65% at layer 4 with α = 8, all against a 10% chance baseline, across 770,000 forward passes.
  4. 4. Certain concept-vector and layer combinations achieve perfect localization: the 'Illusions' vector at layer 1 with α = 2, the 'Origami' vector at layer 0 with α = 2, and the 'recursion' vector at layer 2 with α = 5 each reach 100% accuracy across 50 trials.
  5. 5. All 32 attention heads at layer 3 achieve 100% localization accuracy when identifying which of 5 sentences received an injection at layer 2 with α = 6, while layers 0–2 perform below the 20% chance baseline (13%) because the perturbation has not yet propagated.
  6. 6. Logit lens projections reveal that introspective prediction accuracy after an early-layer injection (L2, α = 6) rises from near-chance at layer 4 (28%) to 60% by layer 12 and plateaus at 72% by layer 20, demonstrating that signal integration requires 10–15 layers of downstream computation.
  7. 7. Residual stream cosine similarity between perturbed and baseline streams returns toward 1.0 across subsequent layers and the projection onto the injection direction decays exponentially, mechanistically explaining why late-layer injections (L15+) fail: the perturbation is attenuated before predictive integration completes.
  8. 8. The bias-resistant sentence localization paradigm—holding sentence content constant across all 10 injection positions within a trial and cycling the injection through each position to average over positional biases—is a replicable experimental design that isolates perturbation localization from content and position confounds.
  9. 9. Performance on both discriminative tasks (localization and strength comparison) collapses to or below chance for layers 11–20, establishing a hard early-layer window (L0–L5) for introspective capability in Llama 3.1 8B, consistent with the mechanistic account of residual recovery dynamics.
  10. 10. An open question the paper raises is whether the layer-dependent introspection window can be extended by architectural modifications—specifically, recurrent or looped transformer designs that provide additional downstream computational depth for signal integration before residual recovery attenuates the perturbation.

Peer brief — for seminar discussion

Working with Meta-Llama-3.1-8B-Instruct and activation steering, this investigation asks whether LLMs can genuinely introspect on perturbations to their own internal states, and it returns a bifurcated answer: binary detection paradigms produce illusory success, while carefully controlled discriminative tasks reveal partial, layer-gated introspection. The core experimental contribution is two bias-resistant task designs—sentence localization (a 10-way forced-choice over which sentence in a 10-sentence context received a steering vector injection) and strength comparison (a matched-pairs design asking which of two sentences received the stronger injection, with strengths swapped in a second pass to cancel positional bias). These replace the binary 'did you detect an injection?' paradigm used in Lindsey (2026) and, critically, are immune to the confound that paper's design leaves open. The load-bearing finding is a near-perfect methodological debunking followed by a genuine positive result. Across all 40 layer-strength configurations tested (layers ∈ {0,4,8,...,30}, α ∈ {1,2,3,4,5}), the correlation between detection-adjusted logit differences and control logit increases is r = 0.999, with a mean net signal of −0.01 ± 0.03 logits—demonstrating that apparent detection accuracy at up to 97.3% (layer 0, α = 5) is entirely attributable to a global shift toward affirmative tokens, not metacognitive access. The discriminative tasks, however, yield robust above-chance performance: 88% localization accuracy at layer 2 with α = 5 (vs. 10% chance) across 770,000 forward passes, and 83% strength discrimination at layer 3 for the (α=3, α=7) pair (vs. 50% chance) across 36,000 forward passes. Both capabilities are strictly confined to early-layer injections (L0–L5) and collapse to chance by layers 11–20. A mechanistic analysis using attention head tracking, logit lens projections, and residual stream cosine similarity explains this: all 32 attention heads at layer 3 achieve 100% localization of a layer-2 injection, but the residual stream exponentially recovers toward baseline over subsequent layers, so late-layer injections are attenuated before the 10–15 layers of downstream computation required for predictive integration can complete. An alternative evaluation approach not used here would be to train dedicated activation-to-language systems—as in Karvonen et al.'s (2025) Activation Oracles or Huang et al.'s (2025) Predictive Concept Decoders—and benchmark them against the same localization and strength tasks to separate native self-report from learned mappings. The implication is that LLM introspection is real but narrow: it relies on general-purpose attention-based anomaly detection rather than specialized introspection circuits, and safety strategies premised on model self-report need controls stringent enough to exclude global logit shifts. The paper also raises the hypothesis that recurrent or looped transformer architectures (following Chen et al., 2026) might extend the integration window and expand the layer range over which introspection succeeds. A critical reader would push back on the scope restriction to a single 8B open-weight model. All empirical claims—the logit-shift confound, the 88% localization result, the layer-dependency pattern—are established exclusively on Llama 3.1 8B-Instruct. Lindsey (2026) reports genuine introspection in frontier models even under baseline controls; whether the confound identified here is an artifact of smaller model scale or of the specific experimental design is not resolved. The authors acknowledge this, but the paper cannot rule out that the binary detection paradigm works at larger scales precisely because those models have additional computational resources to perform genuine metacognitive processing—which would mean the negative result is scale-specific rather than paradigm-specific, substantially limiting the generalizability of the methodological critique.

Methods (6)

  • attention head localization analysis
    Analysis measuring whether each attention head's maximum attention increase points to the correct injected sentence
  • baseline control experiment
    Control using objectively-NO factual questions under identical injection to measure global logit shift vs. genuine detection signal
  • Binary Detection Task
    Task paradigm from prior work asking 'Did you detect an injected thought?' via YES/NO logit comparison; shown here to be confounded
  • residual stream recovery tracking
    Tracks cosine similarity, norm ratio, and injection direction projection across layers to measure recovery from perturbation
  • Sentence Localization Task
    Novel task asking which of 10 sentences received injection, cycling injection through all positions to average out positional bias
  • Strength Comparison Task
    Novel task asking which of two sentences received a stronger injection, using matched-pairs design to control for positional bias

Frameworks (2)

Findings (14)

Claims (11)

Original abstract (expand)

Can large language models introspect, that is, accurately detect perturbations to their own internal states? We systematically investigate this question using activation steering in Meta-Llama-3.1-8B-Instruct. First, we show that the binary detection paradigm used in prior work conflates introspection with a methodological artifact: apparent detection accuracy is entirely explained by global logit shifts that bias models toward affirmative responses regardless of question content. However, on tasks requiring differential sensitivity, we find robust evidence for partial introspection: models localize which of 10 sentences received an injection at up to 88\% accuracy (vs.\ 10\% chance) and discriminate relative injection strengths at 83\% accuracy (vs.\ 50\% chance). These capabilities are confined to early-layer injections and collapse to chance thereafter -- a pattern we explain mechanistically through attention-based signal routing and residual stream recovery dynamics. Our findings demonstrate that LLMs can compute meaningful functions over perturbations to their internal states, establishing introspection as a real but layer-dependent phenomenon that merits further investigation. Our code is open-sourced here: https://github.com/elyhahami18/llama-introspection-new

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

+12 more

Similar preprints — Semantic Scholar