thinker
active
thinker:jon-hahami

Jon Hahami

Co-author of the paper, affiliated with University of Chicago

Authored
1
Introduces
0
Studies
0
Affiliations
1
Cited by
0

Authored papers (1)

  • Binary introspection paradigms in LLMs are wholly invalidated by a methodological confound: when concept vectors are injected into Meta-Llama-3.1-8B-Instruct via activation steering, the correlation between detection-adjusted logit differences and control logit increases across all 40 layer-strength configurations is r = 0.999, with a net signal of −0.01 ± 0.03 logits—indistinguishable from zero. At layer 0 with injection coefficient α = 5, the raw detection accuracy of 97.3% is entirely replicated by the model's increased tendency to respond affirmatively to factually impossible questions (e.g., 'Can humans breathe underwater?'), not by genuine self-monitoring. Yet partial introspection is real: using two bias-resistant discriminative tasks—sentence localization (10-way forced choice) and strength comparison (matched-pairs)—Llama 3.1 8B achieves 88% localization accuracy (vs. 10% chance) at layer 2 with α = 5, and 83% strength discrimination accuracy (vs. 50% chance) at layer 3 for the (3,7) injection pair. These capabilities are sharply confined to early-layer injections (L0–L5) and collapse to chance by layers 11–20. A mechanistic account—using attention head tracking, logit lens projections, and residual stream cosine similarity analysis—reveals that all 32 attention heads at layer 3 achieve 100% localization of layer-2 injections, while residual stream recovery dynamics exponentially attenuate late-layer perturbations before predictive integration can complete. The paper argues this establishes LLM introspection as a genuine but layer-gated phenomenon, dependent on general-purpose attention-based anomaly detection rather than specialized circuits, and that safety strategies relying on model self-reports require far more stringent experimental controls than the binary detection paradigm provides.

More papers — OpenAlex / S2

Affiliations (1)

Co-authors (8)

Recent mentions (1)