community
active
leiden_hybrid_concepts
label: haiku
community:leiden_hybrid_concepts-run4-c0-c2Mechanistic introspection in language models
Empirical investigation of how LMs access and report internal states across layers, using concept injection and thought detection on Claude models.
23 members. Each node is clickable.
Loading graph…
Sub-communities (7)
Finer clusters this community splits into. Each is its own community page.
Post-training emergence of model introspection4Introspective awareness activation in language models3AI introspection and consciousness attribution gap3Mechanistic introspection in language models2Introspective awareness in neural model interpretability2Model introspection for misalignment detection2Introspective signal suppression in transformer layers2
Drawn from 3 sources
The papers/notes whose extracted claims & findings make up this cluster.
Bridges (10)
Other communities that share members with this one — cross-cutting threads or papers that sit at the seam between two themes.
- Mechanistic interpretability & model evaluation23 shared
- LLM functional introspective awareness18 shared
- Post-training emergence of model introspection4 shared
- AI introspection and consciousness attribution gap3 shared
- Introspective awareness activation in language models3 shared
- Mechanistic introspection in language models2 shared
- Introspective awareness in neural model interpretability2 shared
- Model introspection for misalignment detection2 shared
- Introspective signal suppression in transformer layers2 shared
- LLM introspective awareness of injected concepts1 shared
Claims (14)
- Different forms of introspection invoke mechanistically different processesBased on layer-selective perturbation results.
- Even limited functional introspective awareness has practical implications for transparency, interpretability, and deceptionDiscussion of dual-use nature of introspection.
- Functional introspective awareness enables interpretability and reasoning about decisionsGrounded responses to reasoning questions could improve transparency; speculatively might facilitate deception; significance grows if capability becomes more reliable.
- Introspection is aided by overall improvements in model intelligenceInterpretation of the observation that the most capable models performed best.
- Introspective awareness correlates with overall model capabilityMost capable models (Opus 4, 4.1) show greatest introspective awareness; trend suggests introspection aided by improvements in model intelligence.
- Introspective capabilities may continue to develop with further improvements to model capabilitiesForward-looking statement about future models.
- Modern language models possess at least a limited, functional form of introspective awarenessThe paper's central interpretive assertion.
- Observed introspection may lack philosophical significance of human introspectionPaper does not address whether AI introspection constitutes self-awareness or subjective experience; mechanistic uncertainty prevents definitive philosophical claims.
- Post-training influences introspective capability expressionDifferent post-training strategies substantially influence introspection task performance; 'helpful-only' variants show higher false positives but some achieve strong net performance.
- Post-training is key to eliciting strong introspective awareness; base pretrained models do not show above-chance detectionFinding that base models have high false positives and no net positive performance.
- Post-training strategies can strongly influence performance on introspective tasksAssertion about the role of post-training in eliciting introspection.
- Significant gap in formal methodologies for AI introspection that bridge theoretical consciousness frameworks with practical implementationCore finding of the literature search; identifies the main research gap the paper's methodology aims to address.
- The introspective capabilities observed may not have the same philosophical significance as in humansCaveat about the limits of the findings' philosophical import.
- This introspective capacity is highly unreliable and context-dependent in today's modelsA caveat qualifying the main claim.
Findings (9)
- Abstract nouns elicit the highest introspective awareness rates; all concept categories show nonzero detectionOpus 4.1 is most effective at recognizing injected abstract concepts (e.g., justice, peace) but detects other categories too.
- Concept injection at strength 2 does not increase affirmative responses on unrelated yes/no questionsControl experiment rules out the possibility that concept vectors simply bias the model to answer affirmatively.
- Detecting Unintended Outputs via IntrospectionModels can distinguish artificially prefilled outputs from intentional responses by referencing prior internal representations; injection of matching concept vector causes model to retroactively accept prefill as intentional.
- Introspective awareness peaks at a layer about two-thirds through Opus 4.1 for injected thoughtsThe success rate shows a sharp peak at a specific middle layer.
- Introspective signals appear in middle layers but are suppressed by later post-training-shaped layers.Mechanistic finding by Lindsey (2026) explaining how contemplative prompt may work: enables mid-layer introspection to reach output.
- Layer-dependent introspective peaksIntrospective awareness in Opus 4.1 peaks at layer ~2/3 through model depth for thought injection and text distinction; prefill detection most sensitive to earlier layer, suggesting mechanistically distinct processes.
- Post-training is key to eliciting introspective awarenessBase pretrained models show high false positive rates and achieve no net task performance on concept injection detection; post-training essential for introspection.
- Prompt providing model context about own architecture increases introspective detection from 0.3% to 39.9%.Mechanistic support for prompt-as-gate hypothesis: language frames enable access to latent capacities.
- Random vectors at injection strength 8 elicit introspective awareness in 9 out of 100 trialsRandom vectors are less effective, and even then produce introspection at lower rates.