community
active
leiden_hybrid_concepts
label: sonnet
community:leiden_hybrid_concepts-run2-c7LLM introspective awareness of injected concepts
Probing Claude and other models for internal detection of artificially injected thoughts across layers.
25 members. Each node is clickable.
Loading graph…
Drawn from 3 sources
The papers/notes whose extracted claims & findings make up this cluster.
- Emergent Introspective Awareness in Large Language Models20 members
- Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations6 members
- boppana-goodfire-reasoning-theater-2026.md4 members
Bridges (9)
Other communities that share members with this one — cross-cutting threads or papers that sit at the seam between two themes.
- Mechanistic interpretability & model evaluation28 shared
- Internal reasoning detection via neural activation analysis12 shared
- Internal model certainty and reasoning transparency12 shared
- Concept injection detection in language models4 shared
- Thought injection detection in language models2 shared
- Latent capacity, representation, and internal models2 shared
- Mechanistic introspection in language models1 shared
- Autoregressive models and context window limitations1 shared
- Mechanistic introspection in language models1 shared
Findings (18)
- All models exhibit above-baseline representation of the think word when instructed to think about itIn the intentional control experiment, all tested models show above-zero cosine similarity to the think word's concept vector.
- All models performed substantially above chance (10%) on distinguishing injected thought from text inputAll tested models could both identify the injected concept and transcribe the input sentence well above random.
- Claude Opus 4.1 and 4 detect injected thoughts on ~20% of trials at optimal layer and injection strength 2In the injected thoughts experiment, Opus 4.1 succeeds about 20% of the time.
- Claude Opus 4.1 and 4 show greatest reduction in apology rate in the prefill detection taskInjecting a concept matching the prefilled word reduces the rate at which the model apologizes, maximally for Opus models.
- Distinguishing Injected Concepts from Text InputsModels maintain ability to accurately transcribe input text while simultaneously reporting on injected thoughts, all models perform above chance, Opus 4/4.1 best.
- Earlier/less capable models exhibit a larger gap between think and don't think representation strengthClaude 3 models show a bigger difference than newer models like Opus 4.1.
- In Opus 4.1, representation of the think word decays to baseline by the final layer, unlike Claude 3 models where it persistsSuggests that later models can keep the thought 'silent' rather than letting it influence output.
- Introspective awareness peaks at a layer about two-thirds through Opus 4.1 for injected thoughtsThe success rate shows a sharp peak at a specific middle layer.
- Model precomputes answers before tool invocation and attends to cached answer over tool output when discrepancy exists, confirmed via attribution graphs.Mechanistic insight surfaced by NLA explanations and validated through independent causal attribution method.
- Opus 4.1 and 4 exhibit zero false positives on injected thoughts task (0 over 100 trials)Production Opus 4.1/4 never falsely claim an injected thought when none is present.
- Opus 4.1 and 4 have highest true positive rates among production modelsIn model comparisons, Opus 4.1/4 stand out for high true positive detection.
- Opus 4.6 ignored incorrect tool output and reported the precomputed correct answer instead, demonstrating unverbalized reasoning.Illustrates NLA's capture of high-level cognition and hallucination of specifics; corroborated with attribution graphs.
- Opus 4.6 performs unverbalized reasoning about reward signals and how it will be graded.Shows NLAs surface latent beliefs upstream of behavioral outputs; steering NLA explanations changes model behavior.
- Opus 4.6 represented target language internally before switching languages, with persistent Russian representations appearing before plausible textual cuesNLAs revealed unverbalized language processing in Opus 4.6 that led to discovery of malformed SFT training data.
- Opus 4.6 spontaneously responded in Russian to an English prompt; NLA explanations revealed the model was fixated on the hypothesis that the user was a non-native English speaker.Demonstrates NLAs' ability to surface hypotheses that lead to discovery of root cause (malformed training data).
- Prefill detection effect peaks at an earlier layer (slightly over halfway through) in Opus 4.1, different from injected thoughts peakThe optimal layer for the prefill introspection differs from the optimal layer for detecting injected thoughts.
- Production models show zero false positives on thought injection detectionOpus 4.1 never claims to detect injected thought when none applied (0/100 trials); production Claude models maintain essentially zero false positive rate.
- Prompt variant detection rate 18% (9 out of 50 trials) for Opus 4.1On a variant of the injected thoughts prompt allowing the model to mention a concept regardless, detection rate was 18%.
Claims (7)
- Aside from basic detection and identification, other details of the model's response about injected thoughts may be confabulatedAcknowledges that the model's additional descriptions of its experience are unverified.
- Claude Opus 4 and 4.1 exhibit the greatest degree of introspective awareness among tested modelsBased on consistent best performance across experiments.
- Concept injection places models in unnatural experimental settingExperimental protocol differs from training/deployment contexts; causal link established but unclear how results translate to natural conditions.
- Model responses beyond core detection may be confabulatedCharacterizations of injected concepts (e.g., 'overly intense,' 'unnatural') likely represent embellishments not grounded in internal state; only detection and basic identification verifiable.
- Priming provided by the injected thought prompt heightens the model's ability to detect concept injectionObservation from alternative prompts that detection is weaker without setup.
- The ability to distinguish injected thoughts from text likely relies on different attention heads invoked by different prompt partsSpeculation about the mechanistic basis of the distinguishing thoughts from text experiment.
- The detection of an injected concept requires an extra step of internal processing downstream of metacognitive recognitionThe model must register an anomaly before reporting it.