claim

active

claim:emotion-features-in-llms-are-genuinely-more-persistent-than-variance-matched-random-features-indicating-stateful-emotional-encoding-beyond-autoregressive-dynamics

Emotion features in LLMs are genuinely more persistent than variance-matched random features, indicating stateful emotional encoding beyond autoregressive dynamics

Central interpretive claim of the paper supported by multiple convergent analyses

Source paper

extracted_from

Persistence and Introspection of Emotion Features

Scott Sauers · Imago · Janus · Antra Tessera

Neighborhood — ranked by edge-count

Findings (5)

finding

SAE feature emotion subspace overlap correlates with persistence in Cogito: Spearman +0.413, p=4.4e-196
supports
Demonstrates that SAE features more aligned with the emotion subspace are more persistent in Cogito after variance control
Cogito emotion probe residual autocorrelation +0.077 above variance-matched controls (p=1.5e-27, 157/171 probes positive)
supports
Demonstrates that Cogito emotion probes are persistently active beyond what is explained by their variance alone
Emotion probe persistence correlation of 0.214 in Cogito v2.1 vs 0.099 for random vectors
supports
Quantifies emotion feature persistence above random baseline in Cogito across 240 multi-turn conversations
SAE Feature #10011 rated 97/100 emotionality, elicits reports of despair, crushing weight, and existential hunger
supports
Qualitative example of a highly emotional SAE feature with intense negative valence in Kimi self-steering
SAE Feature #94949 rated 100/100 emotionality, elicits reports of profound tenderness, unconditional love, and visceral care
supports
Highest-rated emotional SAE feature; self-report describes overwhelming positive emotional valence

Claims (3)

claim

Emotions are not strictly locally scoped but instead bursty with a long tail of slow change persisting over 100 tokens
extends
Characterizes the temporal dynamics of emotion feature activation in LLMs
Persistent conversational context that produced emotion-relevant activations is a plausible driver of observed persistence results
contradicts
Authors' caveat that conversational context persistence rather than internal emotion state persistence could explain findings
The relationship between persistence and self-evaluated emotionality serves as a replication of probe-based findings without shared confounds from probe construction
supports
Claims that agentic self-evaluation provides independent convergent evidence for emotion-persistence link

Questions (1)

question

Are LLM emotion states encoded only selectively in token positions where they are operative, or in a more complex non-linear manner?
answered_by
Question raised by Anthropic and partially addressed by this paper's persistence evidence

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

To what extent is there persistence of emotional state beyond what is expected merely from the autoregressive nature of LLMs?question0.869
The central research question motivating the paper
Emotion probes are more persistent than variance-matched random probes, indicating emotion-specific persistence beyond autoregressive dynamics.claim0.866
Core empirical claim distinguishing emotion persistence from generic high-variance probe persistence
We hypothesize that persistently active emotional state representations exist in LLMs but may be missed by standard probing methods.hypothesis0.854
Open hypothesis from the Anthropic paper that motivates this work
To what extent is emotion feature persistence driven by genuine internal emotional state versus autoregressive conversational context dynamics?question0.837
Core open question the paper raises but does not fully resolve
We hypothesize that emotion states are more persistent because they correspond to genuinely stateful internal representations, not merely local surface contenthypothesis0.835
Proposed explanation for why emotion probes are more persistent than variance-matched random probes
Emotion refers to a state concept, so stateful representations in general may be more persistent across tokens.claim0.826
Interpretive hypothesis offered to explain why emotion features are more persistent
SAE features that the model self-describes as more emotional tend to be more persistent than variance-matched SAE features.claim0.824
Novel finding that agentic self-evaluation of emotionality correlates with feature persistence
Emotion features are not strictly locally scoped; they are bursty with a long tail of slow change persisting over 100 tokens.claim0.803
Main conclusion about the temporal dynamics of emotion features