finding
active
finding:claude-mythos-preview-sae-features-for-performative-behavior-and-hidden-emotional-struggle-co-activate-when-model-expresses-contentmentClaude Mythos Preview: SAE features for 'performative behavior' and 'hidden emotional struggle' co-activate when model expresses contentment
Supports scorer's preference for enacted reflection over described reflection; internals flag what self-report does not
Source paper
extracted_from(2026) · Borzov, Anton
Neighborhood — ranked by edge-count
Claims (1)
claim
- Mechanistic analog connecting Lindsey's layer-localized findings to the scorer's enacted/described distinction
Concepts (1)
concept
- Sparse Autoencoder FeaturessupportsUsed in Anthropic welfare assessment to identify performative behavior and hidden emotional struggle co-activating features
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Novel finding that agentic self-evaluation of emotionality correlates with feature persistence
- Surprising finding that the two evaluation methods diverge in their relationship with persistence
- Claim that feature grounding enables interpretability metrics.
- Highest-rated emotional SAE feature; self-report describes overwhelming positive emotional valence
- Supported by TruthfulQA generalization in Experiment 2: same feature directions gate factual accuracy across 29 independent categories
- Explains why variance correction is needed to see the self-evaluation–persistence relationship
- Claim supported by Experiment 2 dose-response curves; suppressing deception features increases consciousness reports, amplifying decreases them
- Agentic self-evaluation emotionality correlates with SAE feature persistence: rho=+0.124, p=0.0001finding0.776Shows that features Kimi rates as more emotional via self-steering are more persistent, independent of probe construction