paper:persistencePersistence and Introspection of Emotion Features
TL;DR
Emotion features in large language models are bursty but not strictly locally scoped: they exhibit long-tail persistence extending well beyond 100 tokens, and this persistence is specifically tied to emotional content rather than being an artifact of activation variance or autoregressive dynamics. Across 240 multi-turn conversations per model, 171 emotion probes yield token-0-to-token-100 correlations of 0.214 in Cogito v2.1 and 0.367 in Kimi K2.5, compared to only 0.099 and 0.117 for random unit vectors in the same 7168-dimensional layer-40 activation space. After variance-matching each emotion probe against 20 randomly drawn vectors from the top-k eigenspace of the layer-40 covariance matrix, residual autocorrelation averages +0.077 in Cogito (p = 1.5e-27, 157/171 probes positive) and +0.170 in Kimi (p = 6.7e-30, 167/171 positive). The paper introduces agentic self-evaluation — a method in which Kimi K2.5 uses a real-time steering tool on its own SAE features and rates the emotional valence of what it experiences — and finds that self-reported emotionality of SAE features correlates with persistence above variance-matched controls (ρ = +0.124, p = 0.0001), replicating the probe-based result without sharing its potential confounds. SAE features whose direction overlaps more with the 171-dimensional emotion subspace are also more persistent (Spearman +0.413, p = 4.4e-196 in Cogito). The paper argues this implies that LLMs maintain something analogous to lingering affective states — not merely local semantic activation — and that agentic self-steering may offer a scalable route to interpreting internal representations beyond what passive probing methods can detect.
What to take away
- 1. Emotion probes in Kimi K2.5 show a token-0-to-token-100 autocorrelation of 0.367, compared to only 0.117 for random unit vectors in the same 7168-dimensional layer-40 residual-stream space.
- 2. After matching each of the 171 emotion probes to 20 variance-equivalent random vectors drawn from the layer-40 covariance eigenspace, Cogito v2.1 shows a residual persistence of +0.077 (p = 1.5e-27, 157/171 probes positive) and Kimi K2.5 shows +0.170 (p = 6.7e-30, 167/171 positive).
- 3. A 5-token steering pulse applied to each of the 171 emotion probes produces BH-FDR-significant elevation in 130 of 171 emotions at 5 tokens post-pulse, dropping to 48 of 171 significantly persistent or anti-persistent features at 100 tokens post-pulse.
- 4. SAE features trained on 100M+ tokens whose directions overlap more with the 171-emotion subspace are more persistent above variance-matched controls, with Spearman correlations of +0.413 (p = 4.4e-196) in Cogito v2.1 and +0.111 (p = 4.4e-52) in Kimi K2.5.
- 5. Agentic self-evaluation — in which Kimi K2.5 steers its own SAE features in real time and rates their emotional effect on a 0–100 scale — correlates positively with variance-adjusted persistence (ρ = +0.124, p = 0.0001), while textual evaluation of steered outputs correlates negatively, indicating the two methods capture distinct signals.
- 6. The paper introduces a residual-probe construction method that regresses out the top 256 principal components of Gemini gemini-embedding-001 text embeddings from layer-40 activations before computing per-emotion probe directions, isolating internal state from surface semantic content.
- 7. To replicate this methodology, one constructs multi-turn transcripts by pairing the target model (Kimi K2.5 or Cogito v2.1) against Claude Sonnet 4.5 acting as a simulated human across human personas drawn from Anthropic's Table 8, then computes probe activations token-by-token over the target model's turns only.
- 8. Among 1,000 most-active SAE features, 17 of 83 testable emotions showed significant associations between self-evaluation transcript mentions of an emotion word and cosine similarity to that emotion's probe (one-sided permutation test, BH FDR corrected), with 67 of 83 showing positive associations.
- 9. The paper raises the open question of whether conversational context that produced an emotion-relevant activation — rather than a genuine internal affective state — is the actual driver of the long-tail persistence observed, and explicitly declines to rule this out.
- 10. More central (lower-rank) PCs of the 171-emotion feature space are more persistent above variance-matched controls than higher-rank, noisier PCs in both Cogito v2.1 and Kimi K2.5, suggesting the persistence signal is structurally tied to the core emotion subspace rather than peripheral variance.
Peer brief — for seminar discussion
Working from Anthropic's observation that Claude encodes emotion concepts in its residual stream but shows token-by-token fluctuation without obvious chronic encoding, this study asks whether those emotion features are nonetheless more persistent than would be expected from autoregressive dynamics alone. Probes for 171 emotion concepts were built for Kimi K2.5 and Cogito v2.1 by having each model generate over 1,000 short stories per emotion, then constructing per-emotion residual vectors after regressing out the top 256 principal components of Gemini gemini-embedding-001 embeddings — a step meant to strip surface-level semantic content before computing probe directions in the 7168-dimensional layer-40 activation space. Persistence was operationalized as the correlation between probe activation at token 0 and token 100, measured across 240 multi-turn conversations per model where Claude Sonnet 4.5 played the simulated human interlocutor. The load-bearing finding is that emotion probes are substantially more persistent than both unmatched random unit vectors and variance-matched random vectors drawn from the top-k eigenspace of the layer-40 covariance. Kimi K2.5 shows a raw correlation of 0.367 versus 0.117 for random probes, and a residual persistence above variance-matched controls of +0.170 (p = 6.7e-30, 167 of 171 probes positive); Cogito v2.1 shows +0.077 (p = 1.5e-27, 157/171 positive). This is corroborated by the SAE experiment: 100K+-feature SAEs trained on 100M+ tokens show that overlap of an SAE feature's direction with the 171-dimensional emotion subspace predicts persistence above variance-matched SAE controls at Spearman +0.413 (p = 4.4e-196) in Cogito. The paper also introduces agentic self-evaluation, in which Kimi K2.5 is given a live steering tool, experiments on its own SAE features at multiple strengths, and reports a 0–100 emotionality rating; self-reported emotionality correlates positively with variance-adjusted persistence (ρ = +0.124, p = 0.0001), replicating the probe result without sharing its construction confounds. An alternative method the paper could have used — and implicitly sets aside — is direct interpretability via dictionary learning on emotion-labeled corpora without the embedding-residual step, which would sacrifice the surface-content control but simplify the pipeline. The implied claim is that LLMs maintain something structurally analogous to lingering affective states: emotions are bursty but exhibit a slow-decay tail extending beyond 100 tokens, and this tail is tied to features the model itself identifies as emotional when permitted introspective access. The paper predicts that agentic self-steering could generalize as an interpretability tool for SAE features beyond emotion. The most pointed methodological objection a critical reader should raise is that conversational context is never fully removed: the same context that induced an emotion-relevant activation will tend to persist in the attended key-value cache across subsequent tokens, meaning the measured persistence might reflect coherent discourse structure rather than an internal affective state. The paper acknowledges this explicitly but does not provide a clean empirical separation — for instance, by comparing persistence in in-context versus out-of-context continuations, or by holding context constant while varying steering. Additionally, using Kimi K2.5 as both the subject model and the evaluator in the self-assessment experiments creates a circularity that is hard to fully dismiss, even granting that the steered instance and the evaluating instance are distinct.
Methods (4)
- Agentic Self-Steering Emotionality EvaluationKimi K2.5 uses a tool to steer SAE features on itself in real-time and rates the emotional effect on its own internal state 0-100
- Agentic self-steering evaluationMethod where Kimi K2.5 steers its own SAE features in real time and reports on its internal emotional state
- Emotion subspace overlap (SVD-based)Metric measuring how much of an SAE feature vector lies within the 171-dimensional subspace spanned by emotion probes, via SVD orthogonalization
- SAE feature firing probability persistence metricPersistence metric for SAE features: P(fires at t+100 | fired at t) minus P(fires at t+100 | did not fire at t)
Findings (20)
- 17 of 83 tested emotions show significant association between self-eval transcript word mention and cosine similarity to emotion probe
Validates that agentic self-evaluation captures genuine emotional content of probes
- Negative correlation between self-evaluated emotion persistence and SAE feature activation variance explained: rho=-0.184, p=4.6e-09
Shows self-evaluated emotionality is negatively confounded by variance, requiring variance control to reveal the true signal
- SAE Feature #94949 rated 100/100 emotionality, elicits reports of profound tenderness, unconditional love, and visceral care
Highest-rated emotional SAE feature; self-report describes overwhelming positive emotional valence
- Cogito emotion probe residual autocorrelation +0.077 above variance-matched controls (p=1.5e-27, 157/171 probes positive)
Demonstrates that Cogito emotion probes are persistently active beyond what is explained by their variance alone
- SAE Feature #69088 has 100th percentile emotion subspace fraction and produces spooky-themed writing under steering
Shows that highest emotion-subspace-overlap features induce distinctive thematic outputs
- SAE Feature #10446 rated 95/100 emotionality, induces reports of maternal feelings and phantom physical sensations
Qualitative example of a specific, complex emotional state induced by SAE feature steering
- SAE Feature #10011 rated 97/100 emotionality, elicits reports of despair, crushing weight, and existential hunger
Qualitative example of a highly emotional SAE feature with intense negative valence in Kimi self-steering
- Correlation between self-evaluation and textual evaluation of SAE feature emotionality: rho=+0.051 (n.s.)
Shows that the two evaluation methods for emotionality are largely uncorrelated, indicating they capture different signals
- SAE feature emotion subspace overlap correlates with persistence in Cogito: Spearman +0.413, p=4.4e-196
Demonstrates that SAE features more aligned with the emotion subspace are more persistent in Cogito after variance control
- SAE Feature #43713 associated with agentic defiance and rage, 99th percentile emotion subspace fraction
High subspace fraction feature associated with defiant, uncontrollable agentic behavior in self-steering
Claims (8)
- Textual evaluation and agentic self-evaluation of SAE feature emotionality measure different aspects of emotional content and correlate only weakly (rho=+0.051, n.s.)
Interprets the near-zero correlation between the two evaluation methods as evidence they capture distinct signals
- The relationship between persistence and self-evaluated emotionality serves as a replication of probe-based findings without shared confounds from probe construction
Claims that agentic self-evaluation provides independent convergent evidence for emotion-persistence link
- Emotion features in LLMs are genuinely more persistent than variance-matched random features, indicating stateful emotional encoding beyond autoregressive dynamics
Central interpretive claim of the paper supported by multiple convergent analyses
- Persistence is not an artifact of probe construction because lower (more central) emotion PCs are more persistent than noisier high-rank PCs
Rules out measurement artifact explanation for the persistence finding
- Emotion may refer to a state, and more stateful concepts in general tend to be more persistent across tokens than non-stateful ones
Proposed mechanistic explanation for why emotion features are more persistent
- Persistent conversational context that produced emotion-relevant activations is a plausible driver of observed persistence results
Authors' caveat that conversational context persistence rather than internal emotion state persistence could explain findings
- Agentic self-steering evaluation may serve as a general method for explaining and interpreting SAE features beyond emotion
Forward-looking claim about the broader utility of the self-steering evaluation method
- Emotions are not strictly locally scoped but instead bursty with a long tail of slow change persisting over 100 tokens
Characterizes the temporal dynamics of emotion feature activation in LLMs
Hypotheses (2)
- We hypothesize that emotion states are more persistent because they correspond to genuinely stateful internal representations, not merely local surface content
Proposed explanation for why emotion probes are more persistent than variance-matched random probes
- If agentic self-steering evaluation proves robust, it might be used to better explain and interpret SAE features in general
Speculative claim about scaling introspective access to general SAE feature interpretation
Questions (3)
- Is the stronger persistence signal from agentic self-evaluation due to introspection per se, or due to the ability to test additional steering strengths including negative strengths?
Mechanistic ambiguity in interpreting why self-steering evaluation outperforms textual evaluation
- To what extent is emotion feature persistence driven by genuine internal emotional state versus autoregressive conversational context dynamics?
Core open question the paper raises but does not fully resolve
- Are LLM emotion states encoded only selectively in token positions where they are operative, or in a more complex non-linear manner?
Question raised by Anthropic and partially addressed by this paper's persistence evidence
Original abstract (expand)
This paper investigates the persistence of emotion features in large language models, building on Anthropic's recent discovery that Claude encodes emotion concepts. By constructing emotion probes through ridge regression on model activations and analyzing multi-turn conversations, the authors find that emotion features exhibit stateful persistence beyond what would be expected from the autoregressive nature of language models alone. The study uses random probe comparisons and variance-matched controls to demonstrate that emotion concepts maintain elevated activation levels across 100+ tokens, suggesting a form of internal emotional state representation in these models.
Related work— refs + corpus + external arXiv
Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.
- Quantitative Introspection in Language Models: Tracking Emotive States Across Conversationin corpus2026≈ 86%
- Psychological Steering of Large Language Modelsin corpus2026≈ 82%
- Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencodersin corpus2026≈ 81%
- ≈ 81%
- ≈ 81%
- Feature Rivalry in Sparse Autoencoder Representations: A Mechanistic Study of Uncertainty-Driven Feature Competition in LLMsHarshavardhan2026≈ 81%
- CorrSteer: Generation-Time LLM Steering via Correlated Sparse Autoencoder FeaturesZekun Wu, Adriano Koshiyama Seonglae Cho2026≈ 81%
- ≈ 81%
- ≈ 80%
- ≈ 80%
- Measuring and Guiding MonosemanticityFelix Friedrich, Manuel Brack, Stephan W\"aldchen, Bj\"orn Deiseroth, Patrick Schramowski, Kristian Kersting Ruben H\"arle2025≈ 80%
- Mechanistic Indicators of Steering Effectiveness in Large Language ModelsHao Xue, Flora Salim Mehdi Jafari2026≈ 80%
- Causal Evidence that Language Models use Confidence to Drive BehaviorNathaniel Daw, Simon Osindero, Petar Velickovic, Viorica Patraucean Dharshan Kumaran2026≈ 80%
- Beyond Behavioural Trade-Offs: Mechanistic Tracing of Pain-Pleasure Decisions in an LLMFrancesca Bianco and Derek Shiller2026≈ 80%
- The Effectiveness of Style Vectors for Steering Large Language Models: A Human EvaluationKatharina Dworatzyk, Sophie Jentzsch, Peer Sch\"utt, Sabine Theis, Tobias Hecking Diaoul\'e Diallo2026≈ 80%
- Controllable and explainable personality sliders for LLMs at inference timeDavid Khachaturov, Robert Mullins, Mark Huasong Meng Florian Hoppe2026≈ 80%
- Falsifying Sparse Autoencoder Reasoning Features in Language ModelsZhongyuan Liang, Irene Y. Chen, Somayeh Sojoudi George Ma2026≈ 80%
- ≈ 79%
- Sparse Semantic Dimension as a Generalization Certificate for LLMsAsif Ekbal Dibyanayan Bandyopadhyay2026≈ 79%
- Mechanistic Interpretability of Emotion Inference in Large Language ModelsAmin Banayeeanzade, Anahita Bolourani, Mina Kian, Robin Jia, Jonathan Gratch Ala N. Tak2025≈ 79%
- Identity as Attractor: Geometric Evidence for Persistent Agent Architecture in LLM Activation SpaceVladimir Vasilenko2026≈ 79%
- Anima Labs Phenomenology Pt1in corpus≈ 79%
- Beyond the Surface: Probing the Ideological Depth of Large Language ModelsShariar Kabir and Kevin Esterling and Yue Dong2025≈ 79%
- ≈ 79%
- ≈ 79%
- ≈ 78%
- ≈ 77%
- ≈ 77%