claim
active
claim:introspective-capabilities-are-confined-to-early-layer-injections-l0-l5-and-collapse-to-chance-thereafterIntrospective capabilities are confined to early-layer injections (L0-L5) and collapse to chance thereafter
Key quantitative characterization of the layer-dependence of partial introspection
Source paper
extracted_from(2025) · Ely Hahami · I. N. Sinha · Jain, Lavik · Kaplan, Josh +1
Neighborhood — ranked by edge-count
Findings (3)
finding
- Striking mechanistic finding that injection creates universally detectable perturbation in residual stream immediately downstream
- Sentence localization accuracy reaches 88% at layer 2, α=5 vs. 10% chance in 10-way classificationsupportsHighest localization accuracy achieved, showing strong partial introspection for early-layer injections
- Strength comparison accuracy averages 47% at layers 15-30, indistinguishable from 50% chancesupportsShows collapse of introspective capability at later layers in the strength comparison task
Frameworks (1)
framework
- This paper's proposed mechanistic explanation integrating signal injection, attention routing, predictive integration, and residual recovery
Claims (1)
claim
- Primary positive claim of the paper, grounded in strength comparison and localization results
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Forward-looking prediction about whether early-layer introspection generalizes to larger models or recurrent architectures
- Introspective capabilities may continue to develop with further improvements to model capabilitiesclaim0.824Forward-looking statement about future models.
- Alternative interpretations offered for why binary detection fails in Llama 3.1 8B but frontier models claim success
- Practical bottleneck explaining why these phenomena are not widely studied.
- A caveat qualifying the main claim.
- Key finding about the relationship between capability and introspection.
- Conceptual distinction motivated by entropy analyses showing probe and report entropy can diverge under steering
- Three of four concepts show significant introspection at turn 1; rules out joint temporal drift as sole explanation