finding

active

finding:48-of-171-emotion-probes-individually-significant-at-token-100-post-steering

48 of 171 emotion probes individually significant at token 100 post-steering

Shows that causal steering effects persist over long ranges for a substantial fraction of emotion probes

Source paper

extracted_from

Persistence and Introspection of Emotion Features

Scott Sauers · Imago · Janus · Antra Tessera

Neighborhood — ranked by edge-count

Claims (1)

claim

Emotions are not strictly locally scoped but instead bursty with a long tail of slow change persisting over 100 tokens
supports
Characterizes the temporal dynamics of emotion feature activation in LLMs

Methods (1)

method

5-Token Steering Pulse Experiment
introduces
Applies a 5-token steering pulse to each emotion probe and measures persistence of causal effect via contrast z-score over 200 subsequent tokens

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

At 100 tokens post-steering, 48 of 171 emotion features remain individually BH-significant despite average effect being near zero.finding0.864
Demonstrates long-tail persistence of causal steering effect in a subset of emotion features
62% of emotions significantly elevated at 5 tokens after steering pulse endsfinding0.829
Demonstrates that the majority of emotion features show persistent upregulation shortly after a steering pulse
At 5 tokens after steering pulse ends, 130 of 171 (62%) emotion features are BH-significantly elevated; 14% are suppressed.finding0.812
Shows immediate causal effect of steering on emotion feature activation
Do psychological steering results hold beyond 64-token completions?question0.810
Acknowledged limitation of restricting experiments to 64-token completions
Emotion probe persistence (token-0 to token-100 correlation) in Cogito v2.1 is 0.214, compared to 0.099 for random unit vectors in 7168D space.finding0.787
Quantitative measure of emotion feature persistence vs random baseline in Cogito
17 of 83 tested emotions show significant association between self-eval transcript word mention and cosine similarity to emotion probefinding0.783
Validates that agentic self-evaluation captures genuine emotional content of probes
Emotion probes (171-emotion residual vector probes)method0.780
Linear probes constructed to measure 171 emotion concepts in model activations with surface semantic content removed
Cogito emotion probe residual autocorrelation +0.077 above variance-matched controls (p=1.5e-27, 157/171 probes positive)finding0.770
Demonstrates that Cogito emotion probes are persistently active beyond what is explained by their variance alone