Computational Account of Layer-Dependent Introspection

This paper's proposed mechanistic explanation integrating signal injection, attention routing, predictive integration, and residual recovery

Neighborhood — ranked by edge-count

Papers (1)

paper

Detecting the Disturbance: A Nuanced View of Introspective Abilities in LLMs
introduces

Concepts (3)

concept

predictive integration
supports
The mid-to-late layer computational process that converts routed perturbation signals into explicit predictions
residual stream recovery dynamics
supports
The network's tendency to actively attenuate injected perturbations over subsequent layers, erasing the signal before output
attention-based signal routing
supports
Mechanism by which attention heads detect injected perturbations and route information about them to the final token position

Claims (2)

claim

Introspective capabilities are confined to early-layer injections (L0-L5) and collapse to chance thereafter
supports
Key quantitative characterization of the layer-dependence of partial introspection
Introspection relies on general-purpose computational mechanisms—attention-based anomaly detection and residual stream dynamics—rather than specialized introspection circuits
supports
Interpretive claim about the mechanistic substrate of introspection in LLMs

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Layer-dependent introspective peaksfinding0.829
Introspective awareness in Opus 4.1 peaks at layer ~2/3 through model depth for thought injection and text distinction; prefill detection most sensitive to earlier layer, suggesting mechanistically distinct processes.
Either introspection is an emergent capability requiring larger scale, or more stringent controls are needed to test introspection in smaller modelsclaim0.760
Alternative interpretations offered for why binary detection fails in Llama 3.1 8B but frontier models claim success
LLMs can compute meaningful functions over perturbations to their internal states, establishing introspection as a real but layer-dependent phenomenonclaim0.757
Primary positive claim of the paper, grounded in strength comparison and localization results
What are the mechanisms underlying introspection in language models?question0.757
Central open question raised by the paper.
We hypothesize that introspective capabilities may scale with model size and architecture, including recurrence/looping that extends the integration windowhypothesis0.756
Forward-looking prediction about whether early-layer introspection generalizes to larger models or recurrent architectures
Introspectionconcept0.749
The ability of a model to observe its own past internal states or computations; claimed to be architecturally permitted by transformers.
What mechanisms enable collective introspection to emerge across multiple interacting AI agents?question0.748
Core unanswered question that drives the search; addresses the integration of distributed cognition and machine consciousness.
Systematic Introspective Processesconcept0.747
Identified gap; methods for enabling machine consciousness development through self-examination.