thinker:jain-lavikJain, Lavik
Authored papers (1)
Binary introspection paradigms in LLMs are wholly invalidated by a methodological confound: when concept vectors are injected into Meta-Llama-3.1-8B-Instruct via activation steering, the correlation between detection-adjusted logit differences and control logit increases across all 40 layer-strength configurations is r = 0.999, with a net signal of −0.01 ± 0.03 logits—indistinguishable from zero. At layer 0 with injection coefficient α = 5, the raw detection accuracy of 97.3% is entirely replicated by the model's increased tendency to respond affirmatively to factually impossible questions (e.g., 'Can humans breathe underwater?'), not by genuine self-monitoring. Yet partial introspection is real: using two bias-resistant discriminative tasks—sentence localization (10-way forced choice) and strength comparison (matched-pairs)—Llama 3.1 8B achieves 88% localization accuracy (vs. 10% chance) at layer 2 with α = 5, and 83% strength discrimination accuracy (vs. 50% chance) at layer 3 for the (3,7) injection pair. These capabilities are sharply confined to early-layer injections (L0–L5) and collapse to chance by layers 11–20. A mechanistic account—using attention head tracking, logit lens projections, and residual stream cosine similarity analysis—reveals that all 32 attention heads at layer 3 achieve 100% localization of layer-2 injections, while residual stream recovery dynamics exponentially attenuate late-layer perturbations before predictive integration can complete. The paper argues this establishes LLM introspection as a genuine but layer-gated phenomenon, dependent on general-purpose attention-based anomaly detection rather than specialized circuits, and that safety strategies relying on model self-reports require far more stringent experimental controls than the binary detection paradigm provides.
More papers — OpenAlex / S2
Co-authors (8)
- Ely Hahami6 shared
- Hahami, Jon4 shared
- I. N. Sinha4 shared
- Kaplan, Josh4 shared
- Ishaan Sinha2 shared
- Jon Hahami2 shared
- Josh Kaplan2 shared
- Lavik Jain2 shared