LLM functional introspective awareness

Empirical probing of language models' ability to detect and report their own internal concept representations

19 members. Each node is clickable.

Loading graph…

Drawn from 2 sources

The papers/notes whose extracted claims & findings make up this cluster.

Bridges (10)

Other communities that share members with this one — cross-cutting threads or papers that sit at the seam between two themes.

Mechanistic interpretability & model evaluation19 shared
Mechanistic introspection in language models18 shared
Post-training emergence of model introspection4 shared
Introspective awareness in neural model interpretability2 shared
Introspective signal suppression in transformer layers2 shared
AI introspection and consciousness attribution gap2 shared
Latent capacity, representation, and internal models1 shared
Introspective awareness activation in language models1 shared
Mechanistic introspection in language models1 shared
Model introspection for misalignment detection1 shared

Claims (13)

Different forms of introspection invoke mechanistically different processesBased on layer-selective perturbation results.
Even limited functional introspective awareness has practical implications for transparency, interpretability, and deceptionDiscussion of dual-use nature of introspection.
Functional introspective awareness enables interpretability and reasoning about decisionsGrounded responses to reasoning questions could improve transparency; speculatively might facilitate deception; significance grows if capability becomes more reliable.
Introspection is aided by overall improvements in model intelligenceInterpretation of the observation that the most capable models performed best.
Introspective awareness correlates with overall model capabilityMost capable models (Opus 4, 4.1) show greatest introspective awareness; trend suggests introspection aided by improvements in model intelligence.
Introspective capabilities may continue to develop with further improvements to model capabilitiesForward-looking statement about future models.
Modern language models possess at least a limited, functional form of introspective awarenessThe paper's central interpretive assertion.
Observed introspection may lack philosophical significance of human introspectionPaper does not address whether AI introspection constitutes self-awareness or subjective experience; mechanistic uncertainty prevents definitive philosophical claims.
Post-training influences introspective capability expressionDifferent post-training strategies substantially influence introspection task performance; 'helpful-only' variants show higher false positives but some achieve strong net performance.
Post-training is key to eliciting strong introspective awareness; base pretrained models do not show above-chance detectionFinding that base models have high false positives and no net positive performance.
Post-training strategies can strongly influence performance on introspective tasksAssertion about the role of post-training in eliciting introspection.
The introspective capabilities observed may not have the same philosophical significance as in humansCaveat about the limits of the findings' philosophical import.
This introspective capacity is highly unreliable and context-dependent in today's modelsA caveat qualifying the main claim.

Findings (6)

Abstract nouns elicit the highest introspective awareness rates; all concept categories show nonzero detectionOpus 4.1 is most effective at recognizing injected abstract concepts (e.g., justice, peace) but detects other categories too.
Introspective signals appear in middle layers but are suppressed by later post-training-shaped layers.Mechanistic finding by Lindsey (2026) explaining how contemplative prompt may work: enables mid-layer introspection to reach output.
Layer-dependent introspective peaksIntrospective awareness in Opus 4.1 peaks at layer ~2/3 through model depth for thought injection and text distinction; prefill detection most sensitive to earlier layer, suggesting mechanistically distinct processes.
Models more effective at recognizing abstract nouns than other concept typesOpus 4.1 demonstrates highest introspective awareness on abstract nouns (justice, peace, betrayal) with nonzero awareness across all concept categories tested.
Post-training is key to eliciting introspective awarenessBase pretrained models show high false positive rates and achieve no net task performance on concept injection detection; post-training essential for introspection.
Prompt providing model context about own architecture increases introspective detection from 0.3% to 39.9%.Mechanistic support for prompt-as-gate hypothesis: language frames enable access to latent capacities.