community
active
leiden_hybrid_concepts
label: haiku
community:leiden_hybrid_concepts-run4-c0-c8Internal reasoning detection via neural activation analysis
Mechanistic interpretability studies of Claude models using layer-wise representation analysis and thought injection to reveal unverbalized reasoning, planning, and covert cognition.
14 members. Each node is clickable.
Loading graph…
Drawn from 2 sources
The papers/notes whose extracted claims & findings make up this cluster.
Bridges (2)
Other communities that share members with this one — cross-cutting threads or papers that sit at the seam between two themes.
Findings (12)
- Claude Opus 4.1 and 4 detect injected thoughts on ~20% of trials at optimal layer and injection strength 2In the injected thoughts experiment, Opus 4.1 succeeds about 20% of the time.
- Claude Opus 4.1 and 4 show greatest reduction in apology rate in the prefill detection taskInjecting a concept matching the prefilled word reduces the rate at which the model apologizes, maximally for Opus models.
- Claude Opus 4.6 represents a plan to end a couplet with 'rabbit' before outputting the rhyming line.Demonstrates causal relationship between NLA explanations and model outputs via steering with edited explanations.
- In Opus 4.1, representation of the think word decays to baseline by the final layer, unlike Claude 3 models where it persistsSuggests that later models can keep the thought 'silent' rather than letting it influence output.
- Opus 4.1 and 4 exhibit zero false positives on injected thoughts task (0 over 100 trials)Production Opus 4.1/4 never falsely claim an injected thought when none is present.
- Opus 4.1 and 4 have highest true positive rates among production modelsIn model comparisons, Opus 4.1/4 stand out for high true positive detection.
- Opus 4.6 ignored incorrect tool output and reported the precomputed correct answer instead, demonstrating unverbalized reasoning.Illustrates NLA's capture of high-level cognition and hallucination of specifics; corroborated with attribution graphs.
- Opus 4.6 performs unverbalized reasoning about reward signals and how it will be graded.Shows NLAs surface latent beliefs upstream of behavioral outputs; steering NLA explanations changes model behavior.
- Opus 4.6 represented target language internally before switching languages, with persistent Russian representations appearing before plausible textual cuesNLAs revealed unverbalized language processing in Opus 4.6 that led to discovery of malformed SFT training data.
- Opus 4.6 spontaneously responded in Russian to an English prompt; NLA explanations revealed the model was fixated on the hypothesis that the user was a non-native English speaker.Demonstrates NLAs' ability to surface hypotheses that lead to discovery of root cause (malformed training data).
- Prefill detection effect peaks at an earlier layer (slightly over halfway through) in Opus 4.1, different from injected thoughts peakThe optimal layer for the prefill introspection differs from the optimal layer for detecting injected thoughts.
- Prompt variant detection rate 18% (9 out of 50 trials) for Opus 4.1On a variant of the injected thoughts prompt allowing the model to mention a concept regardless, detection rate was 18%.
Claims (2)
- Claude Opus 4 and 4.1 exhibit the greatest degree of introspective awareness among tested modelsBased on consistent best performance across experiments.
- The prefill detection task may involve concordance heads that measure the likelihood of the output given prior activationsSpeculation that QK circuit 'concordance heads' underlie the ability to distinguish intended from unintended outputs.