paper:doi-10-48550-arxiv-2512-12411Detecting the Disturbance: A Nuanced View of Introspective Abilities in LLMs
TL;DR
Binary introspection paradigms in LLMs are wholly invalidated by a methodological confound: when concept vectors are injected into Meta-Llama-3.1-8B-Instruct via activation steering, the correlation between detection-adjusted logit differences and control logit increases across all 40 layer-strength configurations is r = 0.999, with a net signal of −0.01 ± 0.03 logits—indistinguishable from zero. At layer 0 with injection coefficient α = 5, the raw detection accuracy of 97.3% is entirely replicated by the model's increased tendency to respond affirmatively to factually impossible questions (e.g., 'Can humans breathe underwater?'), not by genuine self-monitoring. Yet partial introspection is real: using two bias-resistant discriminative tasks—sentence localization (10-way forced choice) and strength comparison (matched-pairs)—Llama 3.1 8B achieves 88% localization accuracy (vs. 10% chance) at layer 2 with α = 5, and 83% strength discrimination accuracy (vs. 50% chance) at layer 3 for the (3,7) injection pair. These capabilities are sharply confined to early-layer injections (L0–L5) and collapse to chance by layers 11–20. A mechanistic account—using attention head tracking, logit lens projections, and residual stream cosine similarity analysis—reveals that all 32 attention heads at layer 3 achieve 100% localization of layer-2 injections, while residual stream recovery dynamics exponentially attenuate late-layer perturbations before predictive integration can complete. The paper argues this establishes LLM introspection as a genuine but layer-gated phenomenon, dependent on general-purpose attention-based anomaly detection rather than specialized circuits, and that safety strategies relying on model self-reports require far more stringent experimental controls than the binary detection paradigm provides.
What to take away
- 1. Binary detection accuracy of 97.3% at layer 0 with α = 5 in Meta-Llama-3.1-8B-Instruct is entirely explained by a global logit shift: the correlation between detection-adjusted logit differences and control logit increases across all 40 layer-strength configurations is r = 0.999, with a net signal of −0.01 ± 0.03 logits.
- 2. A matched-pairs strength comparison task—in which the model must identify which of two sentences received a stronger activation-steering injection—yields 83% accuracy at layer 3 for the (3,7) injection pair and 73% for the (2,6) pair, versus 50% chance, across 36,000 forward passes on Meta-Llama-3.1-8B-Instruct.
- 3. A 10-way sentence localization task—injecting a steering vector at exactly one of 10 sentences and asking the model to identify which—achieves 88% accuracy at layer 2 with α = 5, 76% at layer 3 with α = 5, and 65% at layer 4 with α = 8, all against a 10% chance baseline, across 770,000 forward passes.
- 4. Certain concept-vector and layer combinations achieve perfect localization: the 'Illusions' vector at layer 1 with α = 2, the 'Origami' vector at layer 0 with α = 2, and the 'recursion' vector at layer 2 with α = 5 each reach 100% accuracy across 50 trials.
- 5. All 32 attention heads at layer 3 achieve 100% localization accuracy when identifying which of 5 sentences received an injection at layer 2 with α = 6, while layers 0–2 perform below the 20% chance baseline (13%) because the perturbation has not yet propagated.
- 6. Logit lens projections reveal that introspective prediction accuracy after an early-layer injection (L2, α = 6) rises from near-chance at layer 4 (28%) to 60% by layer 12 and plateaus at 72% by layer 20, demonstrating that signal integration requires 10–15 layers of downstream computation.
- 7. Residual stream cosine similarity between perturbed and baseline streams returns toward 1.0 across subsequent layers and the projection onto the injection direction decays exponentially, mechanistically explaining why late-layer injections (L15+) fail: the perturbation is attenuated before predictive integration completes.
- 8. The bias-resistant sentence localization paradigm—holding sentence content constant across all 10 injection positions within a trial and cycling the injection through each position to average over positional biases—is a replicable experimental design that isolates perturbation localization from content and position confounds.
- 9. Performance on both discriminative tasks (localization and strength comparison) collapses to or below chance for layers 11–20, establishing a hard early-layer window (L0–L5) for introspective capability in Llama 3.1 8B, consistent with the mechanistic account of residual recovery dynamics.
- 10. An open question the paper raises is whether the layer-dependent introspection window can be extended by architectural modifications—specifically, recurrent or looped transformer designs that provide additional downstream computational depth for signal integration before residual recovery attenuates the perturbation.
Peer brief — for seminar discussion
Working with Meta-Llama-3.1-8B-Instruct and activation steering, this investigation asks whether LLMs can genuinely introspect on perturbations to their own internal states, and it returns a bifurcated answer: binary detection paradigms produce illusory success, while carefully controlled discriminative tasks reveal partial, layer-gated introspection. The core experimental contribution is two bias-resistant task designs—sentence localization (a 10-way forced-choice over which sentence in a 10-sentence context received a steering vector injection) and strength comparison (a matched-pairs design asking which of two sentences received the stronger injection, with strengths swapped in a second pass to cancel positional bias). These replace the binary 'did you detect an injection?' paradigm used in Lindsey (2026) and, critically, are immune to the confound that paper's design leaves open. The load-bearing finding is a near-perfect methodological debunking followed by a genuine positive result. Across all 40 layer-strength configurations tested (layers ∈ {0,4,8,...,30}, α ∈ {1,2,3,4,5}), the correlation between detection-adjusted logit differences and control logit increases is r = 0.999, with a mean net signal of −0.01 ± 0.03 logits—demonstrating that apparent detection accuracy at up to 97.3% (layer 0, α = 5) is entirely attributable to a global shift toward affirmative tokens, not metacognitive access. The discriminative tasks, however, yield robust above-chance performance: 88% localization accuracy at layer 2 with α = 5 (vs. 10% chance) across 770,000 forward passes, and 83% strength discrimination at layer 3 for the (α=3, α=7) pair (vs. 50% chance) across 36,000 forward passes. Both capabilities are strictly confined to early-layer injections (L0–L5) and collapse to chance by layers 11–20. A mechanistic analysis using attention head tracking, logit lens projections, and residual stream cosine similarity explains this: all 32 attention heads at layer 3 achieve 100% localization of a layer-2 injection, but the residual stream exponentially recovers toward baseline over subsequent layers, so late-layer injections are attenuated before the 10–15 layers of downstream computation required for predictive integration can complete. An alternative evaluation approach not used here would be to train dedicated activation-to-language systems—as in Karvonen et al.'s (2025) Activation Oracles or Huang et al.'s (2025) Predictive Concept Decoders—and benchmark them against the same localization and strength tasks to separate native self-report from learned mappings. The implication is that LLM introspection is real but narrow: it relies on general-purpose attention-based anomaly detection rather than specialized introspection circuits, and safety strategies premised on model self-report need controls stringent enough to exclude global logit shifts. The paper also raises the hypothesis that recurrent or looped transformer architectures (following Chen et al., 2026) might extend the integration window and expand the layer range over which introspection succeeds. A critical reader would push back on the scope restriction to a single 8B open-weight model. All empirical claims—the logit-shift confound, the 88% localization result, the layer-dependency pattern—are established exclusively on Llama 3.1 8B-Instruct. Lindsey (2026) reports genuine introspection in frontier models even under baseline controls; whether the confound identified here is an artifact of smaller model scale or of the specific experimental design is not resolved. The authors acknowledge this, but the paper cannot rule out that the binary detection paradigm works at larger scales precisely because those models have additional computational resources to perform genuine metacognitive processing—which would mean the negative result is scale-specific rather than paradigm-specific, substantially limiting the generalizability of the methodological critique.
Methods (6)
- attention head localization analysisAnalysis measuring whether each attention head's maximum attention increase points to the correct injected sentence
- baseline control experimentControl using objectively-NO factual questions under identical injection to measure global logit shift vs. genuine detection signal
- Binary Detection TaskTask paradigm from prior work asking 'Did you detect an injected thought?' via YES/NO logit comparison; shown here to be confounded
- residual stream recovery trackingTracks cosine similarity, norm ratio, and injection direction projection across layers to measure recovery from perturbation
- Sentence Localization TaskNovel task asking which of 10 sentences received injection, cycling injection through all positions to average out positional bias
- Strength Comparison TaskNovel task asking which of two sentences received a stronger injection, using matched-pairs design to control for positional bias
Frameworks (2)
- Computational Account of Layer-Dependent IntrospectionThis paper's proposed mechanistic explanation integrating signal injection, attention routing, predictive integration, and residual recovery
- Emergent Introspective Awareness Framework (Lindsey 2026)Prior framework claiming frontier LLMs can detect and name injected concepts, interpreted as nascent self-awareness
Findings (14)
- Cosine similarity between perturbed and baseline residual streams returns toward 1.0 and projection onto injection direction decays exponentially over subsequent layers
Mechanistic evidence that network actively attenuates injected perturbations, explaining late-layer introspection failure
- Illusions vector at layer 1 α=2, Origami vector at layer 0 α=2, and recursion vector at layer 2 α=5 each achieve 100% localization accuracy across 50 trials
Demonstrates concept-specific variation in introspective salience, suggesting some vectors produce more detectable perturbations
- Strength comparison pair (3,7) with |Δα|=4 outperforms pair (3,5) with |Δα|=2, indicating graded sensitivity to perturbation magnitude
Shows that introspective accuracy scales with injection strength difference, not binary detection
- Correlation r=0.999 between detection-adjusted logit difference and control logit increase across all 40 layer-strength configurations
Key quantitative evidence that detection signal is identical to global logit shift confound
- Net detection signal (detection minus control) is near-zero across all 40 layer-strength configurations: mean = -0.01 ± 0.03 logits
Quantitative evidence that binary detection provides no genuine introspection signal beyond global logit shifts
- Binary detection accuracy (up to 97.3% at L0 α=5) is entirely explained by global logit shifts (r=0.999 correlation with control)
Core negative result: the binary detection paradigm cannot distinguish genuine introspection from uniform output bias
- All 32 attention heads at layer 3 achieve 100% localization accuracy for injections at layer 2 (5-way classification, 20% chance)
Striking mechanistic finding that injection creates universally detectable perturbation in residual stream immediately downstream
- At layer 0 α=5, detection-adjusted logit difference is +3.19 and control increase is +3.22, a difference of only 0.03 logits
Concrete numerical example showing detection and control are nearly identical at peak apparent accuracy
- Binary detection adjusted accuracy reaches 97.3% at layer 0 with α=5 before baseline control is applied
The misleadingly high result that prior paradigm would report as evidence of introspection
- Model baseline logit difference ΔL_baseline = -3.96, indicating prior preference for 'NO' responses
Establishes the model's prior YES/NO bias, needed to interpret detection accuracies
Claims (11)
- Late-layer injection fails both because there is insufficient computational depth for integration and because residual recovery dynamics attenuate the perturbation before it influences output logits
Mechanistic account explaining why late-layer introspection fails, combining two independent explanatory factors
- Introspection relies on general-purpose computational mechanisms—attention-based anomaly detection and residual stream dynamics—rather than specialized introspection circuits
Interpretive claim about the mechanistic substrate of introspection in LLMs
- Apparent success on binary detection tasks is entirely explained by mechanical logit shifts that bias models toward affirmative responses regardless of question content
Primary negative finding reinterpreted as methodological claim: binary paradigm is invalid for testing introspection
- LLMs can compute meaningful functions over perturbations to their internal states, establishing introspection as a real but layer-dependent phenomenon
Primary positive claim of the paper, grounded in strength comparison and localization results
- Prior experimental paradigms may overestimate introspective capabilities by conflating genuine self-awareness with uniform output distribution shifts
Critical methodological claim directed at Lindsey 2026 and similar work using binary detection
- Either introspection is an emergent capability requiring larger scale, or more stringent controls are needed to test introspection in smaller models
Alternative interpretations offered for why binary detection fails in Llama 3.1 8B but frontier models claim success
- Signal integration from early perturbation into an explicit prediction requires substantial downstream computation spanning layers 4-20
Mechanistic characterization based on logit lens analysis showing gradual accuracy rise across layers
- Some steering vectors produce more salient perturbations than others, perhaps based on shared semantic or qualitative factors
Observation from 100% accuracy on specific concept-layer-strength combinations suggesting concept-specific detectability
- Safety strategies predicated on model self-reports may provide false assurance while genuine risks go undetected
Policy-relevant implication drawn from the binary detection confound result
- Introspective capabilities are confined to early-layer injections (L0-L5) and collapse to chance thereafter
Key quantitative characterization of the layer-dependence of partial introspection
Hypotheses (3)
- We hypothesize that native self-report, fine-tuned introspection models, and trained activation-to-language systems will show different performance on bias-resistant localization and strength benchmarks
Comparative prediction motivating future work contrasting different approaches to LLM self-knowledge
- We hypothesize that introspective capabilities may scale with model size and architecture, including recurrence/looping that extends the integration window
Forward-looking prediction about whether early-layer introspection generalizes to larger models or recurrent architectures
- We hypothesize that partial introspection may fail under adversarial prompts, distribution shift, and multiple simultaneous injections
Stress-test prediction about robustness limits of the partial introspection finding
Questions (4)
- Do apparent introspection results reflect genuine metacognitive access to internal representations, or do they emerge from simpler mechanisms such as output distribution shifts?
Key discriminating question motivating the baseline control experiment
- What shared semantic or qualitative factor explains why some steering vectors produce more salient and detectable perturbations than others?
Open question arising from the 100% accuracy on specific concept-layer-strength combinations
- Is introspection an emergent property of scale, or do smaller open-weight models exhibit similar capabilities?
Motivates comparison of Llama 3.1 8B results against Lindsey's frontier model findings
- Can large language models introspect—that is, accurately detect perturbations to their own internal states?
Central research question of the paper
Original abstract (expand)
Can large language models introspect, that is, accurately detect perturbations to their own internal states? We systematically investigate this question using activation steering in Meta-Llama-3.1-8B-Instruct. First, we show that the binary detection paradigm used in prior work conflates introspection with a methodological artifact: apparent detection accuracy is entirely explained by global logit shifts that bias models toward affirmative responses regardless of question content. However, on tasks requiring differential sensitivity, we find robust evidence for partial introspection: models localize which of 10 sentences received an injection at up to 88\% accuracy (vs.\ 10\% chance) and discriminate relative injection strengths at 83\% accuracy (vs.\ 50\% chance). These capabilities are confined to early-layer injections and collapse to chance thereafter -- a pattern we explain mechanistically through attention-based signal routing and residual stream recovery dynamics. Our findings demonstrate that LLMs can compute meaningful functions over perturbations to their internal states, establishing introspection as a real but layer-dependent phenomenon that merits further investigation. Our code is open-sourced here: https://github.com/elyhahami18/llama-introspection-new
Related work— refs + corpus + external arXiv
Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.
- Emergent Introspective Awareness in Large Language Modelscitedin corpus2026≈ 82%
- Quantitative Introspection in Language Models: Tracking Emotive States Across Conversationin corpus2026≈ 87%
- Testing the Limits of Truth Directions in LLMsin corpus2026≈ 84%
- Quantifying LLM Attention-Head Stability: Implications for Circuit UniversalityJack Stanley, Praneet Suresh, Danilo Bzdok Karan Bali2026≈ 84%
- Beyond Behavioural Trade-Offs: Mechanistic Tracing of Pain-Pleasure Decisions in an LLMFrancesca Bianco and Derek Shiller2026≈ 83%
- ≈ 83%
- ≈ 83%
- The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasetsin corpus2023≈ 83%
- Do Benchmarks Underestimate LLM Performance? Evaluating Hallucination Detection With LLM-First Human-Adjudicated AssessmentB. Mutlu, E. A. Sezer, A. Wahdan I. F. Atasoy2026≈ 83%
- Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation ControlChaoqun Wan, Yonggang Zhang, Wenxiao Wang, Binbin Lin, Xiaofei He, Xu Shen, Jieping Ye Yuxin Xiao2024≈ 83%
- ≈ 83%
- Steering Conceptual Bias via Transformer Latent-Subspace ActivationVansh Sharma and Venkat Raman2025≈ 83%
- Detection Without Correction: A Robust Asymmetry in Activation-Based Hallucination ProbingRajiv Misra, Sanjay Kumar Singh, Anisha Roy Dip Roy2026≈ 82%
- Can LLMs Lie? Investigation beyond HallucinationMihir Prabhudesai, Mengning Wu, Shantanu Jaiswal, Deepak Pathak Haoran Huan2025≈ 82%
- Anima Labs Phenomenology Pt1in corpus≈ 82%
- Causal Evidence that Language Models use Confidence to Drive BehaviorNathaniel Daw, Simon Osindero, Petar Velickovic, Viorica Patraucean Dharshan Kumaran2026≈ 82%
- Masked by Consensus: Disentangling Privileged Knowledge in LLM CorrectnessShai Gretz, Yoav Katz, Yonatan Belinkov, Liat Ein-Dor Tomer Ashuach2026≈ 82%
- Psychological Steering of Large Language Modelsin corpus2026≈ 82%
- Moral Sensitivity in LLMs: A Tiered Evaluation of Contextual Bias via Behavioral Profiling and Mechanistic InterpretabilityAtmika Gorti, Vinija Jain, Aman Chadha, Krishnaprasad Thirunarayan, Manas Gaur Yash Aggarwal2026≈ 82%
- ≈ 82%
- Causal Tracing of Object Representations in Large Vision Language Models: Mechanistic Interpretability and Hallucination MitigationZekai Ye, Xiaocheng Feng, Weihong Zhong, Weitao Ma, Xiachong Feng Qiming Li2025≈ 82%
- RADAR: Mechanistic Pathways for Detecting Data Contamination in LLM EvaluationHarshwardhan Fartale, Arpita Vats, Rahul Raja, Ishita Prasad Ashish Kattamuri2025≈ 82%
- ≈ 81%
- ≈ 81%
- ≈ 81%
- Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Trainingin corpus2026≈ 81%
- ≈ 81%
- ≈ 75%
- The llama 3 herd of modelscited2024≈ 58%
+12 more