claim

active

claim:introspective-capabilities-are-confined-to-early-layer-injections-l0-l5-and-collapse-to-chance-thereafter

Introspective capabilities are confined to early-layer injections (L0-L5) and collapse to chance thereafter

Key quantitative characterization of the layer-dependence of partial introspection

Source paper

extracted_from

Detecting the Disturbance: A Nuanced View of Introspective Abilities in LLMs

(2025) · Ely Hahami · I. N. Sinha · Jain, Lavik · Kaplan, Josh +1

Neighborhood — ranked by edge-count

Findings (3)

finding

All 32 attention heads at layer 3 achieve 100% localization accuracy for injections at layer 2 (5-way classification, 20% chance)
supports
Striking mechanistic finding that injection creates universally detectable perturbation in residual stream immediately downstream
Sentence localization accuracy reaches 88% at layer 2, α=5 vs. 10% chance in 10-way classification
supports
Highest localization accuracy achieved, showing strong partial introspection for early-layer injections
Strength comparison accuracy averages 47% at layers 15-30, indistinguishable from 50% chance
supports
Shows collapse of introspective capability at later layers in the strength comparison task

Frameworks (1)

framework

Computational Account of Layer-Dependent Introspection
supports
This paper's proposed mechanistic explanation integrating signal injection, attention routing, predictive integration, and residual recovery

Claims (1)

claim

LLMs can compute meaningful functions over perturbations to their internal states, establishing introspection as a real but layer-dependent phenomenon
extends
Primary positive claim of the paper, grounded in strength comparison and localization results

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

We hypothesize that introspective capabilities may scale with model size and architecture, including recurrence/looping that extends the integration windowhypothesis0.825
Forward-looking prediction about whether early-layer introspection generalizes to larger models or recurrent architectures
Introspective capabilities may continue to develop with further improvements to model capabilitiesclaim0.824
Forward-looking statement about future models.
Either introspection is an emergent capability requiring larger scale, or more stringent controls are needed to test introspection in smaller modelsclaim0.821
Alternative interpretations offered for why binary detection fails in Llama 3.1 8B but frontier models claim success
Introspective capabilities have threshold effects requiring very large models; 70B models are barely on the threshold, and independent researchers lack access to larger models.claim0.821
Practical bottleneck explaining why these phenomena are not widely studied.
This introspective capacity is highly unreliable and context-dependent in today's modelsclaim0.808
A caveat qualifying the main claim.
Notably, Claude Opus 4.1 and 4—the most recently released and most capable models of those that we test—perform the best in our experiments, suggesting that introspective capabilities may emerge alongside other improvements to language models.quote0.805
Key finding about the relationship between capability and introspection.
Introspective ability can be decomposed into: (i) information available about internal state and (ii) capacity to transform that signal into precise output reportsclaim0.803
Conceptual distinction motivated by entropy analyses showing probe and report entropy can diverge under steering
Introspective capacity is present from the first conversation turn, not requiring multi-turn context to emergeclaim0.801
Three of four concepts show significant introspection at turn 1; rules out joint temporal drift as sole explanation