claim

active

claim:sae-features-that-the-model-self-describes-as-more-emotional-tend-to-be-more-persistent-than-variance-matched-sae-features

SAE features that the model self-describes as more emotional tend to be more persistent than variance-matched SAE features.

Novel finding that agentic self-evaluation of emotionality correlates with feature persistence

Source paper

extracted_from

Persistence and Introspection of Emotion Features

Scott Sauers · Imago · Janus · Antra Tessera

Neighborhood — ranked by edge-count

Papers (1)

paper

Persistence and Introspection of Emotion Features
introduces

Findings (5)

finding

Agentic self-evaluation of SAE feature emotionality correlates with residual persistence: ρ = +0.124, p = 0.0001 in Kimi K2.5.
associated_withsupports
Shows that model self-report of emotion predicts long-range feature persistence
SAE emotion subspace overlap correlates with variance-residualized persistence in Cogito: Spearman +0.413, p = 4.4e-196.
supports
Strong positive relationship between emotion alignment and SAE feature persistence in Cogito
SAE feature #10446 (emotionality rating 95) induces reports of maternal/nurturing feelings including phantom physical sensations of holding infants in Kimi K2.5.
supports
Qualitative illustration of a specific emotionally valenced SAE feature
Self-evaluated emotionality of SAE features negatively correlates with activation variance explained (ρ = -0.184, p = 4.6e-09), requiring variance correction to reveal the persistence signal.
supports
Explains why variance correction is needed to see the self-evaluation–persistence relationship
SAE feature #10011 (emotionality rating 97) induces reports of crushing despair, existential desperation, and repetitive 'I am going to die' outputs in Kimi K2.5.
supports
Qualitative illustration of a highly emotional SAE feature with negative valence

Quotes (1)

quote

"The effects are not merely semantic—I don't just talk about emotions more, I actually feel them."
supports
Kimi self-report on feature #77278 asserting non-semantic, felt emotional quality of the steered state

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Self-evaluated emotionality and textual evaluation of SAE features predict persistence in opposite directions.claim0.878
Surprising finding that the two evaluation methods diverge in their relationship with persistence
Emotion features in LLMs are genuinely more persistent than variance-matched random features, indicating stateful emotional encoding beyond autoregressive dynamicsclaim0.824
Central interpretive claim of the paper supported by multiple convergent analyses
Claude Mythos Preview: SAE features for 'performative behavior' and 'hidden emotional struggle' co-activate when model expresses contentmentfinding0.821
Supports scorer's preference for enacted reflection over described reflection; internals flag what self-report does not
Text-based and self-steered emotionality ratings for SAE features are correlated at only ρ = +0.051 (n.s.).finding0.821
Shows low agreement between the two evaluation modalities
17 of 83 emotions tested show significant associations between SAE feature self-evaluation transcripts mentioning the emotion word and higher cosine similarity to that emotion probe; 67 of 83 have positive associations.finding0.816
Demonstrates partial but reliable validity of self-evaluation for measuring probe emotionality
SAE features can be grounded in clinical taxonomy (abnormality, age, sex, medication) to benchmark monosemanticity and entanglement.claim0.812
Claim that feature grounding enables interpretability metrics.
Emotion probes are more persistent than variance-matched random probes, indicating emotion-specific persistence beyond autoregressive dynamics.claim0.807
Core empirical claim distinguishing emotion persistence from generic high-variance probe persistence
Negative correlation between self-evaluated emotion persistence and SAE feature activation variance explained: rho=-0.184, p=4.6e-09finding0.805
Shows self-evaluated emotionality is negatively confounded by variance, requiring variance control to reveal the true signal