Jon Hahami

Co-author of the paper, affiliated with University of Chicago

Authored

Introduces

Studies

Affiliations

Cited by

Authored papers (1)

Detecting the Disturbance: A Nuanced View of Introspective Abilities in LLMs2025
Binary introspection paradigms in LLMs are wholly invalidated by a methodological confound: when concept vectors are injected into Meta-Llama-3.1-8B-Instruct via activation steering, the correlation between detection-adjusted logit differences and control logit increases across all 40 layer-strength configurations is r = 0.999, with a net signal of −0.01 ± 0.03 logits—indistinguishable from zero. At layer 0 with injection coefficient α = 5, the raw detection accuracy of 97.3% is entirely replicated by the model's increased tendency to respond affirmatively to factually impossible questions (e.g., 'Can humans breathe underwater?'), not by genuine self-monitoring. Yet partial introspection is real: using two bias-resistant discriminative tasks—sentence localization (10-way forced choice) and strength comparison (matched-pairs)—Llama 3.1 8B achieves 88% localization accuracy (vs. 10% chance) at layer 2 with α = 5, and 83% strength discrimination accuracy (vs. 50% chance) at layer 3 for the (3,7) injection pair. These capabilities are sharply confined to early-layer injections (L0–L5) and collapse to chance by layers 11–20. A mechanistic account—using attention head tracking, logit lens projections, and residual stream cosine similarity analysis—reveals that all 32 attention heads at layer 3 achieve 100% localization of layer-2 injections, while residual stream recovery dynamics exponentially attenuate late-layer perturbations before predictive integration can complete. The paper argues this establishes LLM introspection as a genuine but layer-gated phenomenon, dependent on general-purpose attention-based anomaly detection rather than specialized circuits, and that safety strategies relying on model self-reports require far more stringent experimental controls than the binary detection paradigm provides.

More papers — OpenAlex / S2

Affiliations (1)

University of Chicago(institute)

Co-authors (8)

Ely Hahami3 shared
Hahami, Jon2 shared
I. N. Sinha2 shared
Jain, Lavik2 shared
Kaplan, Josh2 shared
Ishaan Sinha1 shared
Josh Kaplan1 shared
Lavik Jain1 shared

Recent mentions (1)

papers-typed
hahami-2025-detecting-disturbance.md