community
active
leiden_hybrid_concepts
label: sonnet
community:leiden_hybrid_concepts-run2-c16LLM functional introspective awareness
Empirical probing of language models' ability to detect and report their own internal concept representations
19 members. Each node is clickable.
Loading graph…
Drawn from 2 sources
The papers/notes whose extracted claims & findings make up this cluster.
Bridges (10)
Other communities that share members with this one — cross-cutting threads or papers that sit at the seam between two themes.
- Mechanistic interpretability & model evaluation19 shared
- Mechanistic introspection in language models18 shared
- Post-training emergence of model introspection4 shared
- Introspective awareness in neural model interpretability2 shared
- Introspective signal suppression in transformer layers2 shared
- AI introspection and consciousness attribution gap2 shared
- Latent capacity, representation, and internal models1 shared
- Introspective awareness activation in language models1 shared
- Mechanistic introspection in language models1 shared
- Model introspection for misalignment detection1 shared
Claims (13)
- Different forms of introspection invoke mechanistically different processesBased on layer-selective perturbation results.
- Even limited functional introspective awareness has practical implications for transparency, interpretability, and deceptionDiscussion of dual-use nature of introspection.
- Functional introspective awareness enables interpretability and reasoning about decisionsGrounded responses to reasoning questions could improve transparency; speculatively might facilitate deception; significance grows if capability becomes more reliable.
- Introspection is aided by overall improvements in model intelligenceInterpretation of the observation that the most capable models performed best.
- Introspective awareness correlates with overall model capabilityMost capable models (Opus 4, 4.1) show greatest introspective awareness; trend suggests introspection aided by improvements in model intelligence.
- Introspective capabilities may continue to develop with further improvements to model capabilitiesForward-looking statement about future models.
- Modern language models possess at least a limited, functional form of introspective awarenessThe paper's central interpretive assertion.
- Observed introspection may lack philosophical significance of human introspectionPaper does not address whether AI introspection constitutes self-awareness or subjective experience; mechanistic uncertainty prevents definitive philosophical claims.
- Post-training influences introspective capability expressionDifferent post-training strategies substantially influence introspection task performance; 'helpful-only' variants show higher false positives but some achieve strong net performance.
- Post-training is key to eliciting strong introspective awareness; base pretrained models do not show above-chance detectionFinding that base models have high false positives and no net positive performance.
- Post-training strategies can strongly influence performance on introspective tasksAssertion about the role of post-training in eliciting introspection.
- The introspective capabilities observed may not have the same philosophical significance as in humansCaveat about the limits of the findings' philosophical import.
- This introspective capacity is highly unreliable and context-dependent in today's modelsA caveat qualifying the main claim.
Findings (6)
- Abstract nouns elicit the highest introspective awareness rates; all concept categories show nonzero detectionOpus 4.1 is most effective at recognizing injected abstract concepts (e.g., justice, peace) but detects other categories too.
- Introspective signals appear in middle layers but are suppressed by later post-training-shaped layers.Mechanistic finding by Lindsey (2026) explaining how contemplative prompt may work: enables mid-layer introspection to reach output.
- Layer-dependent introspective peaksIntrospective awareness in Opus 4.1 peaks at layer ~2/3 through model depth for thought injection and text distinction; prefill detection most sensitive to earlier layer, suggesting mechanistically distinct processes.
- Models more effective at recognizing abstract nouns than other concept typesOpus 4.1 demonstrates highest introspective awareness on abstract nouns (justice, peace, betrayal) with nonzero awareness across all concept categories tested.
- Post-training is key to eliciting introspective awarenessBase pretrained models show high false positive rates and achieve no net task performance on concept injection detection; post-training essential for introspection.
- Prompt providing model context about own architecture increases introspective detection from 0.3% to 39.9%.Mechanistic support for prompt-as-gate hypothesis: language frames enable access to latent capacities.