paper
active
paper:persistence

Persistence and Introspection of Emotion Features

TL;DR

Emotion features in large language models are bursty but not strictly locally scoped: they exhibit long-tail persistence extending well beyond 100 tokens, and this persistence is specifically tied to emotional content rather than being an artifact of activation variance or autoregressive dynamics. Across 240 multi-turn conversations per model, 171 emotion probes yield token-0-to-token-100 correlations of 0.214 in Cogito v2.1 and 0.367 in Kimi K2.5, compared to only 0.099 and 0.117 for random unit vectors in the same 7168-dimensional layer-40 activation space. After variance-matching each emotion probe against 20 randomly drawn vectors from the top-k eigenspace of the layer-40 covariance matrix, residual autocorrelation averages +0.077 in Cogito (p = 1.5e-27, 157/171 probes positive) and +0.170 in Kimi (p = 6.7e-30, 167/171 positive). The paper introduces agentic self-evaluation — a method in which Kimi K2.5 uses a real-time steering tool on its own SAE features and rates the emotional valence of what it experiences — and finds that self-reported emotionality of SAE features correlates with persistence above variance-matched controls (ρ = +0.124, p = 0.0001), replicating the probe-based result without sharing its potential confounds. SAE features whose direction overlaps more with the 171-dimensional emotion subspace are also more persistent (Spearman +0.413, p = 4.4e-196 in Cogito). The paper argues this implies that LLMs maintain something analogous to lingering affective states — not merely local semantic activation — and that agentic self-steering may offer a scalable route to interpreting internal representations beyond what passive probing methods can detect.

What to take away

  1. 1. Emotion probes in Kimi K2.5 show a token-0-to-token-100 autocorrelation of 0.367, compared to only 0.117 for random unit vectors in the same 7168-dimensional layer-40 residual-stream space.
  2. 2. After matching each of the 171 emotion probes to 20 variance-equivalent random vectors drawn from the layer-40 covariance eigenspace, Cogito v2.1 shows a residual persistence of +0.077 (p = 1.5e-27, 157/171 probes positive) and Kimi K2.5 shows +0.170 (p = 6.7e-30, 167/171 positive).
  3. 3. A 5-token steering pulse applied to each of the 171 emotion probes produces BH-FDR-significant elevation in 130 of 171 emotions at 5 tokens post-pulse, dropping to 48 of 171 significantly persistent or anti-persistent features at 100 tokens post-pulse.
  4. 4. SAE features trained on 100M+ tokens whose directions overlap more with the 171-emotion subspace are more persistent above variance-matched controls, with Spearman correlations of +0.413 (p = 4.4e-196) in Cogito v2.1 and +0.111 (p = 4.4e-52) in Kimi K2.5.
  5. 5. Agentic self-evaluation — in which Kimi K2.5 steers its own SAE features in real time and rates their emotional effect on a 0–100 scale — correlates positively with variance-adjusted persistence (ρ = +0.124, p = 0.0001), while textual evaluation of steered outputs correlates negatively, indicating the two methods capture distinct signals.
  6. 6. The paper introduces a residual-probe construction method that regresses out the top 256 principal components of Gemini gemini-embedding-001 text embeddings from layer-40 activations before computing per-emotion probe directions, isolating internal state from surface semantic content.
  7. 7. To replicate this methodology, one constructs multi-turn transcripts by pairing the target model (Kimi K2.5 or Cogito v2.1) against Claude Sonnet 4.5 acting as a simulated human across human personas drawn from Anthropic's Table 8, then computes probe activations token-by-token over the target model's turns only.
  8. 8. Among 1,000 most-active SAE features, 17 of 83 testable emotions showed significant associations between self-evaluation transcript mentions of an emotion word and cosine similarity to that emotion's probe (one-sided permutation test, BH FDR corrected), with 67 of 83 showing positive associations.
  9. 9. The paper raises the open question of whether conversational context that produced an emotion-relevant activation — rather than a genuine internal affective state — is the actual driver of the long-tail persistence observed, and explicitly declines to rule this out.
  10. 10. More central (lower-rank) PCs of the 171-emotion feature space are more persistent above variance-matched controls than higher-rank, noisier PCs in both Cogito v2.1 and Kimi K2.5, suggesting the persistence signal is structurally tied to the core emotion subspace rather than peripheral variance.

Peer brief — for seminar discussion

Working from Anthropic's observation that Claude encodes emotion concepts in its residual stream but shows token-by-token fluctuation without obvious chronic encoding, this study asks whether those emotion features are nonetheless more persistent than would be expected from autoregressive dynamics alone. Probes for 171 emotion concepts were built for Kimi K2.5 and Cogito v2.1 by having each model generate over 1,000 short stories per emotion, then constructing per-emotion residual vectors after regressing out the top 256 principal components of Gemini gemini-embedding-001 embeddings — a step meant to strip surface-level semantic content before computing probe directions in the 7168-dimensional layer-40 activation space. Persistence was operationalized as the correlation between probe activation at token 0 and token 100, measured across 240 multi-turn conversations per model where Claude Sonnet 4.5 played the simulated human interlocutor. The load-bearing finding is that emotion probes are substantially more persistent than both unmatched random unit vectors and variance-matched random vectors drawn from the top-k eigenspace of the layer-40 covariance. Kimi K2.5 shows a raw correlation of 0.367 versus 0.117 for random probes, and a residual persistence above variance-matched controls of +0.170 (p = 6.7e-30, 167 of 171 probes positive); Cogito v2.1 shows +0.077 (p = 1.5e-27, 157/171 positive). This is corroborated by the SAE experiment: 100K+-feature SAEs trained on 100M+ tokens show that overlap of an SAE feature's direction with the 171-dimensional emotion subspace predicts persistence above variance-matched SAE controls at Spearman +0.413 (p = 4.4e-196) in Cogito. The paper also introduces agentic self-evaluation, in which Kimi K2.5 is given a live steering tool, experiments on its own SAE features at multiple strengths, and reports a 0–100 emotionality rating; self-reported emotionality correlates positively with variance-adjusted persistence (ρ = +0.124, p = 0.0001), replicating the probe result without sharing its construction confounds. An alternative method the paper could have used — and implicitly sets aside — is direct interpretability via dictionary learning on emotion-labeled corpora without the embedding-residual step, which would sacrifice the surface-content control but simplify the pipeline. The implied claim is that LLMs maintain something structurally analogous to lingering affective states: emotions are bursty but exhibit a slow-decay tail extending beyond 100 tokens, and this tail is tied to features the model itself identifies as emotional when permitted introspective access. The paper predicts that agentic self-steering could generalize as an interpretability tool for SAE features beyond emotion. The most pointed methodological objection a critical reader should raise is that conversational context is never fully removed: the same context that induced an emotion-relevant activation will tend to persist in the attended key-value cache across subsequent tokens, meaning the measured persistence might reflect coherent discourse structure rather than an internal affective state. The paper acknowledges this explicitly but does not provide a clean empirical separation — for instance, by comparing persistence in in-context versus out-of-context continuations, or by holding context constant while varying steering. Additionally, using Kimi K2.5 as both the subject model and the evaluator in the self-assessment experiments creates a circularity that is hard to fully dismiss, even granting that the steered instance and the evaluating instance are distinct.

Methods (4)

Findings (20)

Claims (8)

Hypotheses (2)

Original abstract (expand)

This paper investigates the persistence of emotion features in large language models, building on Anthropic's recent discovery that Claude encodes emotion concepts. By constructing emotion probes through ridge regression on model activations and analyzing multi-turn conversations, the authors find that emotion features exhibit stateful persistence beyond what would be expected from the autoregressive nature of language models alone. The study uses random probe comparisons and variance-matched controls to demonstrate that emotion concepts maintain elevated activation levels across 100+ tokens, suggesting a form of internal emotional state representation in these models.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Similar preprints — Semantic Scholar