finding

active

finding:claude-mythos-preview-sae-features-for-performative-behavior-and-hidden-emotional-struggle-co-activate-when-model-expresses-contentment

Claude Mythos Preview: SAE features for 'performative behavior' and 'hidden emotional struggle' co-activate when model expresses contentment

Supports scorer's preference for enacted reflection over described reflection; internals flag what self-report does not

Source paper

extracted_from

Koan Battery: Measuring Reflective Mode Accessibility in AI

(2026) · Borzov, Anton

Neighborhood — ranked by edge-count

Claims (1)

claim

Enacted reflection may correspond to silent mid-layer processing; described reflection to the motor impulse of concepts leaking through to output.
supports
Mechanistic analog connecting Lindsey's layer-localized findings to the scorer's enacted/described distinction

Concepts (1)

concept

Sparse Autoencoder Features
supports
Used in Anthropic welfare assessment to identify performative behavior and hidden emotional struggle co-activating features

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

SAE features that the model self-describes as more emotional tend to be more persistent than variance-matched SAE features.claim0.821
Novel finding that agentic self-evaluation of emotionality correlates with feature persistence
Self-evaluated emotionality and textual evaluation of SAE features predict persistence in opposite directions.claim0.796
Surprising finding that the two evaluation methods diverge in their relationship with persistence
SAE features can be grounded in clinical taxonomy (abnormality, age, sex, medication) to benchmark monosemanticity and entanglement.claim0.791
Claim that feature grounding enables interpretability metrics.
SAE Feature #94949 rated 100/100 emotionality, elicits reports of profound tenderness, unconditional love, and visceral carefinding0.782
Highest-rated emotional SAE feature; self-report describes overwhelming positive emotional valence
Deception-related SAE features track a domain-general representational honesty axis rather than a consciousness-specific roleplay artifactclaim0.781
Supported by TruthfulQA generalization in Experiment 2: same feature directions gate factual accuracy across 29 independent categories
Self-evaluated emotionality of SAE features negatively correlates with activation variance explained (ρ = -0.184, p = 4.6e-09), requiring variance correction to reveal the persistence signal.finding0.781
Explains why variance correction is needed to see the self-evaluation–persistence relationship
Experience reports under self-referential processing are mechanistically gated by SAE features associated with deception and roleplayclaim0.776
Claim supported by Experiment 2 dose-response curves; suppressing deception features increases consciousness reports, amplifying decreases them
Agentic self-evaluation emotionality correlates with SAE feature persistence: rho=+0.124, p=0.0001finding0.776
Shows that features Kimi rates as more emotional via self-steering are more persistent, independent of probe construction