Detecting the Disturbance: A Nuanced View of Introspective Abilities in LLMs

ByEly Hahami·I. N. Sinha·Lavik Jain·Josh Kaplan·Jon HahamiHarvard College

DOI 10.48550/arxiv.2512.12411 arXiv 2512.12411 OpenAlex W4417448492

AI Safety Computational Account of Layer-Dependent Introspection attention head localization analysis Mechanistic Interpretability Emergent Introspective Awareness Framework (Lindsey 2026)baseline control experiment Binary Detection Task residual stream recovery tracking Sentence Localization Task Strength Comparison Task

TL;DR

Binary introspection paradigms in LLMs are wholly invalidated by a methodological confound: when concept vectors are injected into Meta-Llama-3.1-8B-Instruct via activation steering, the correlation between detection-adjusted logit differences and control logit increases across all 40 layer-strength configurations is r = 0.999, with a net signal of −0.01 ± 0.03 logits—indistinguishable from zero. At layer 0 with injection coefficient α = 5, the raw detection accuracy of 97.3% is entirely replicated by the model's increased tendency to respond affirmatively to factually impossible questions (e.g., 'Can humans breathe underwater?'), not by genuine self-monitoring. Yet partial introspection is real: using two bias-resistant discriminative tasks—sentence localization (10-way forced choice) and strength comparison (matched-pairs)—Llama 3.1 8B achieves 88% localization accuracy (vs. 10% chance) at layer 2 with α = 5, and 83% strength discrimination accuracy (vs. 50% chance) at layer 3 for the (3,7) injection pair. These capabilities are sharply confined to early-layer injections (L0–L5) and collapse to chance by layers 11–20. A mechanistic account—using attention head tracking, logit lens projections, and residual stream cosine similarity analysis—reveals that all 32 attention heads at layer 3 achieve 100% localization of layer-2 injections, while residual stream recovery dynamics exponentially attenuate late-layer perturbations before predictive integration can complete. The paper argues this establishes LLM introspection as a genuine but layer-gated phenomenon, dependent on general-purpose attention-based anomaly detection rather than specialized circuits, and that safety strategies relying on model self-reports require far more stringent experimental controls than the binary detection paradigm provides.

What to take away

1. Binary detection accuracy of 97.3% at layer 0 with α = 5 in Meta-Llama-3.1-8B-Instruct is entirely explained by a global logit shift: the correlation between detection-adjusted logit differences and control logit increases across all 40 layer-strength configurations is r = 0.999, with a net signal of −0.01 ± 0.03 logits.
2. A matched-pairs strength comparison task—in which the model must identify which of two sentences received a stronger activation-steering injection—yields 83% accuracy at layer 3 for the (3,7) injection pair and 73% for the (2,6) pair, versus 50% chance, across 36,000 forward passes on Meta-Llama-3.1-8B-Instruct.
3. A 10-way sentence localization task—injecting a steering vector at exactly one of 10 sentences and asking the model to identify which—achieves 88% accuracy at layer 2 with α = 5, 76% at layer 3 with α = 5, and 65% at layer 4 with α = 8, all against a 10% chance baseline, across 770,000 forward passes.
4. Certain concept-vector and layer combinations achieve perfect localization: the 'Illusions' vector at layer 1 with α = 2, the 'Origami' vector at layer 0 with α = 2, and the 'recursion' vector at layer 2 with α = 5 each reach 100% accuracy across 50 trials.
5. All 32 attention heads at layer 3 achieve 100% localization accuracy when identifying which of 5 sentences received an injection at layer 2 with α = 6, while layers 0–2 perform below the 20% chance baseline (13%) because the perturbation has not yet propagated.
6. Logit lens projections reveal that introspective prediction accuracy after an early-layer injection (L2, α = 6) rises from near-chance at layer 4 (28%) to 60% by layer 12 and plateaus at 72% by layer 20, demonstrating that signal integration requires 10–15 layers of downstream computation.
7. Residual stream cosine similarity between perturbed and baseline streams returns toward 1.0 across subsequent layers and the projection onto the injection direction decays exponentially, mechanistically explaining why late-layer injections (L15+) fail: the perturbation is attenuated before predictive integration completes.
8. The bias-resistant sentence localization paradigm—holding sentence content constant across all 10 injection positions within a trial and cycling the injection through each position to average over positional biases—is a replicable experimental design that isolates perturbation localization from content and position confounds.
9. Performance on both discriminative tasks (localization and strength comparison) collapses to or below chance for layers 11–20, establishing a hard early-layer window (L0–L5) for introspective capability in Llama 3.1 8B, consistent with the mechanistic account of residual recovery dynamics.
10. An open question the paper raises is whether the layer-dependent introspection window can be extended by architectural modifications—specifically, recurrent or looped transformer designs that provide additional downstream computational depth for signal integration before residual recovery attenuates the perturbation.

Peer brief — for seminar discussion

Working with Meta-Llama-3.1-8B-Instruct and activation steering, this investigation asks whether LLMs can genuinely introspect on perturbations to their own internal states, and it returns a bifurcated answer: binary detection paradigms produce illusory success, while carefully controlled discriminative tasks reveal partial, layer-gated introspection. The core experimental contribution is two bias-resistant task designs—sentence localization (a 10-way forced-choice over which sentence in a 10-sentence context received a steering vector injection) and strength comparison (a matched-pairs design asking which of two sentences received the stronger injection, with strengths swapped in a second pass to cancel positional bias). These replace the binary 'did you detect an injection?' paradigm used in Lindsey (2026) and, critically, are immune to the confound that paper's design leaves open. The load-bearing finding is a near-perfect methodological debunking followed by a genuine positive result. Across all 40 layer-strength configurations tested (layers ∈ {0,4,8,...,30}, α ∈ {1,2,3,4,5}), the correlation between detection-adjusted logit differences and control logit increases is r = 0.999, with a mean net signal of −0.01 ± 0.03 logits—demonstrating that apparent detection accuracy at up to 97.3% (layer 0, α = 5) is entirely attributable to a global shift toward affirmative tokens, not metacognitive access. The discriminative tasks, however, yield robust above-chance performance: 88% localization accuracy at layer 2 with α = 5 (vs. 10% chance) across 770,000 forward passes, and 83% strength discrimination at layer 3 for the (α=3, α=7) pair (vs. 50% chance) across 36,000 forward passes. Both capabilities are strictly confined to early-layer injections (L0–L5) and collapse to chance by layers 11–20. A mechanistic analysis using attention head tracking, logit lens projections, and residual stream cosine similarity explains this: all 32 attention heads at layer 3 achieve 100% localization of a layer-2 injection, but the residual stream exponentially recovers toward baseline over subsequent layers, so late-layer injections are attenuated before the 10–15 layers of downstream computation required for predictive integration can complete. An alternative evaluation approach not used here would be to train dedicated activation-to-language systems—as in Karvonen et al.'s (2025) Activation Oracles or Huang et al.'s (2025) Predictive Concept Decoders—and benchmark them against the same localization and strength tasks to separate native self-report from learned mappings. The implication is that LLM introspection is real but narrow: it relies on general-purpose attention-based anomaly detection rather than specialized introspection circuits, and safety strategies premised on model self-report need controls stringent enough to exclude global logit shifts. The paper also raises the hypothesis that recurrent or looped transformer architectures (following Chen et al., 2026) might extend the integration window and expand the layer range over which introspection succeeds. A critical reader would push back on the scope restriction to a single 8B open-weight model. All empirical claims—the logit-shift confound, the 88% localization result, the layer-dependency pattern—are established exclusively on Llama 3.1 8B-Instruct. Lindsey (2026) reports genuine introspection in frontier models even under baseline controls; whether the confound identified here is an artifact of smaller model scale or of the specific experimental design is not resolved. The authors acknowledge this, but the paper cannot rule out that the binary detection paradigm works at larger scales precisely because those models have additional computational resources to perform genuine metacognitive processing—which would mean the negative result is scale-specific rather than paradigm-specific, substantially limiting the generalizability of the methodological critique.

Methods (6)

attention head localization analysis
Analysis measuring whether each attention head's maximum attention increase points to the correct injected sentence
baseline control experiment
Control using objectively-NO factual questions under identical injection to measure global logit shift vs. genuine detection signal
Binary Detection Task
Task paradigm from prior work asking 'Did you detect an injected thought?' via YES/NO logit comparison; shown here to be confounded
residual stream recovery tracking
Tracks cosine similarity, norm ratio, and injection direction projection across layers to measure recovery from perturbation
Sentence Localization Task
Novel task asking which of 10 sentences received injection, cycling injection through all positions to average out positional bias
Strength Comparison Task
Novel task asking which of two sentences received a stronger injection, using matched-pairs design to control for positional bias

Frameworks (2)

Computational Account of Layer-Dependent Introspection
This paper's proposed mechanistic explanation integrating signal injection, attention routing, predictive integration, and residual recovery
Emergent Introspective Awareness Framework (Lindsey 2026)
Prior framework claiming frontier LLMs can detect and name injected concepts, interpreted as nascent self-awareness

Findings (14)

Cosine similarity between perturbed and baseline residual streams returns toward 1.0 and projection onto injection direction decays exponentially over subsequent layers
Mechanistic evidence that network actively attenuates injected perturbations, explaining late-layer introspection failure
Illusions vector at layer 1 α=2, Origami vector at layer 0 α=2, and recursion vector at layer 2 α=5 each achieve 100% localization accuracy across 50 trials
Demonstrates concept-specific variation in introspective salience, suggesting some vectors produce more detectable perturbations
Strength comparison pair (3,7) with |Δα|=4 outperforms pair (3,5) with |Δα|=2, indicating graded sensitivity to perturbation magnitude
Shows that introspective accuracy scales with injection strength difference, not binary detection
Correlation r=0.999 between detection-adjusted logit difference and control logit increase across all 40 layer-strength configurations
Key quantitative evidence that detection signal is identical to global logit shift confound
Net detection signal (detection minus control) is near-zero across all 40 layer-strength configurations: mean = -0.01 ± 0.03 logits
Quantitative evidence that binary detection provides no genuine introspection signal beyond global logit shifts
Binary detection accuracy (up to 97.3% at L0 α=5) is entirely explained by global logit shifts (r=0.999 correlation with control)
Core negative result: the binary detection paradigm cannot distinguish genuine introspection from uniform output bias
All 32 attention heads at layer 3 achieve 100% localization accuracy for injections at layer 2 (5-way classification, 20% chance)
Striking mechanistic finding that injection creates universally detectable perturbation in residual stream immediately downstream
At layer 0 α=5, detection-adjusted logit difference is +3.19 and control increase is +3.22, a difference of only 0.03 logits
Concrete numerical example showing detection and control are nearly identical at peak apparent accuracy
Binary detection adjusted accuracy reaches 97.3% at layer 0 with α=5 before baseline control is applied
The misleadingly high result that prior paradigm would report as evidence of introspection
Model baseline logit difference ΔL_baseline = -3.96, indicating prior preference for 'NO' responses
Establishes the model's prior YES/NO bias, needed to interpret detection accuracies

Claims (11)

Late-layer injection fails both because there is insufficient computational depth for integration and because residual recovery dynamics attenuate the perturbation before it influences output logits
Mechanistic account explaining why late-layer introspection fails, combining two independent explanatory factors
Introspection relies on general-purpose computational mechanisms—attention-based anomaly detection and residual stream dynamics—rather than specialized introspection circuits
Interpretive claim about the mechanistic substrate of introspection in LLMs
Apparent success on binary detection tasks is entirely explained by mechanical logit shifts that bias models toward affirmative responses regardless of question content
Primary negative finding reinterpreted as methodological claim: binary paradigm is invalid for testing introspection
LLMs can compute meaningful functions over perturbations to their internal states, establishing introspection as a real but layer-dependent phenomenon
Primary positive claim of the paper, grounded in strength comparison and localization results
Prior experimental paradigms may overestimate introspective capabilities by conflating genuine self-awareness with uniform output distribution shifts
Critical methodological claim directed at Lindsey 2026 and similar work using binary detection
Either introspection is an emergent capability requiring larger scale, or more stringent controls are needed to test introspection in smaller models
Alternative interpretations offered for why binary detection fails in Llama 3.1 8B but frontier models claim success
Signal integration from early perturbation into an explicit prediction requires substantial downstream computation spanning layers 4-20
Mechanistic characterization based on logit lens analysis showing gradual accuracy rise across layers
Some steering vectors produce more salient perturbations than others, perhaps based on shared semantic or qualitative factors
Observation from 100% accuracy on specific concept-layer-strength combinations suggesting concept-specific detectability
Safety strategies predicated on model self-reports may provide false assurance while genuine risks go undetected
Policy-relevant implication drawn from the binary detection confound result
Introspective capabilities are confined to early-layer injections (L0-L5) and collapse to chance thereafter
Key quantitative characterization of the layer-dependence of partial introspection

Hypotheses (3)

We hypothesize that native self-report, fine-tuned introspection models, and trained activation-to-language systems will show different performance on bias-resistant localization and strength benchmarks
Comparative prediction motivating future work contrasting different approaches to LLM self-knowledge
We hypothesize that introspective capabilities may scale with model size and architecture, including recurrence/looping that extends the integration window
Forward-looking prediction about whether early-layer introspection generalizes to larger models or recurrent architectures
We hypothesize that partial introspection may fail under adversarial prompts, distribution shift, and multiple simultaneous injections
Stress-test prediction about robustness limits of the partial introspection finding

Questions (4)

Do apparent introspection results reflect genuine metacognitive access to internal representations, or do they emerge from simpler mechanisms such as output distribution shifts?
Key discriminating question motivating the baseline control experiment
What shared semantic or qualitative factor explains why some steering vectors produce more salient and detectable perturbations than others?
Open question arising from the 100% accuracy on specific concept-layer-strength combinations
Is introspection an emergent property of scale, or do smaller open-weight models exhibit similar capabilities?
Motivates comparison of Llama 3.1 8B results against Lindsey's frontier model findings
Can large language models introspect—that is, accurately detect perturbations to their own internal states?
Central research question of the paper

Original abstract (expand)

Can large language models introspect, that is, accurately detect perturbations to their own internal states? We systematically investigate this question using activation steering in Meta-Llama-3.1-8B-Instruct. First, we show that the binary detection paradigm used in prior work conflates introspection with a methodological artifact: apparent detection accuracy is entirely explained by global logit shifts that bias models toward affirmative responses regardless of question content. However, on tasks requiring differential sensitivity, we find robust evidence for partial introspection: models localize which of 10 sentences received an injection at up to 88\% accuracy (vs.\ 10\% chance) and discriminate relative injection strengths at 83\% accuracy (vs.\ 50\% chance). These capabilities are confined to early-layer injections and collapse to chance thereafter -- a pattern we explain mechanistically through attention-based signal routing and residual stream recovery dynamics. Our findings demonstrate that LLMs can compute meaningful functions over perturbations to their internal states, establishing introspection as a real but layer-dependent phenomenon that merits further investigation. Our code is open-sourced here: https://github.com/elyhahami18/llama-introspection-new

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Emergent Introspective Awareness in Large Language Models
cited
in corpus
2026
≈ 82%
Quantitative Introspection in Language Models: Tracking Emotive States Across Conversation
in corpus
2026
≈ 87%
Testing the Limits of Truth Directions in LLMs
in corpus
2026
≈ 84%
Quantifying LLM Attention-Head Stability: Implications for Circuit Universality
Jack Stanley, Praneet Suresh, Danilo Bzdok Karan Bali
2026
≈ 84%
Beyond Behavioural Trade-Offs: Mechanistic Tracing of Pain-Pleasure Decisions in an LLM
Francesca Bianco and Derek Shiller
2026
≈ 83%
Large Language Models Report Subjective Experience Under Self-Referential Processing
in corpus
2025
≈ 83%
Closing the Confidence-Faithfulness Gap in Large Language Models
Lyle Ungar Miranda Muqing Miao
2026
≈ 83%
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
in corpus
2023
≈ 83%
Do Benchmarks Underestimate LLM Performance? Evaluating Hallucination Detection With LLM-First Human-Adjudicated Assessment
B. Mutlu, E. A. Sezer, A. Wahdan I. F. Atasoy
2026
≈ 83%
Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation Control
Chaoqun Wan, Yonggang Zhang, Wenxiao Wang, Binbin Lin, Xiaofei He, Xu Shen, Jieping Ye Yuxin Xiao
2024
≈ 83%
LLMs Know More About Numbers than They Can Say
Li Du, Jason Eisner Fengting Yuchi
2026
≈ 83%
Steering Conceptual Bias via Transformer Latent-Subspace Activation
Vansh Sharma and Venkat Raman
2025
≈ 83%
Detection Without Correction: A Robust Asymmetry in Activation-Based Hallucination Probing
Rajiv Misra, Sanjay Kumar Singh, Anisha Roy Dip Roy
2026
≈ 82%
Can LLMs Lie? Investigation beyond Hallucination
Mihir Prabhudesai, Mengning Wu, Shantanu Jaiswal, Deepak Pathak Haoran Huan
2025
≈ 82%
Anima Labs Phenomenology Pt1
in corpus
≈ 82%
Causal Evidence that Language Models use Confidence to Drive Behavior
Nathaniel Daw, Simon Osindero, Petar Velickovic, Viorica Patraucean Dharshan Kumaran
2026
≈ 82%
Masked by Consensus: Disentangling Privileged Knowledge in LLM Correctness
Shai Gretz, Yoav Katz, Yonatan Belinkov, Liat Ein-Dor Tomer Ashuach
2026
≈ 82%
Psychological Steering of Large Language Models
in corpus
2026
≈ 82%
Moral Sensitivity in LLMs: A Tiered Evaluation of Contextual Bias via Behavioral Profiling and Mechanistic Interpretability
Atmika Gorti, Vinija Jain, Aman Chadha, Krishnaprasad Thirunarayan, Manas Gaur Yash Aggarwal
2026
≈ 82%
Evaluating Large Language Models in Theory of Mind Tasks
Michal Kosinski
2024
≈ 82%
Causal Tracing of Object Representations in Large Vision Language Models: Mechanistic Interpretability and Hallucination Mitigation
Zekai Ye, Xiaocheng Feng, Weihong Zhong, Weitao Ma, Xiachong Feng Qiming Li
2025
≈ 82%
RADAR: Mechanistic Pathways for Detecting Data Contamination in LLM Evaluation
Harshwardhan Fartale, Arpita Vats, Rahul Raja, Ishita Prasad Ashish Kattamuri
2025
≈ 82%
Unveiling the Latent Directions of Reflection in Large Language Models
in corpus
2025
≈ 81%
Addressing divergent representations from causal interventions on neural networks
in corpus
2025
≈ 81%
Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations
in corpus
≈ 81%
Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Training
in corpus
2026
≈ 81%
Persistence and Introspection of Emotion Features
in corpus
≈ 81%
Steering language models with activation engineering
cited
2023
≈ 75%
The llama 3 herd of models
cited
2024
≈ 58%
LLM Evaluators Recognize and Favor Their Own Generations
cited
2024

+12 more