finding

active

finding:at-100-tokens-post-steering-48-of-171-emotion-features-remain-individually-bh-significant-despite-average-effect-being-near-zero

At 100 tokens post-steering, 48 of 171 emotion features remain individually BH-significant despite average effect being near zero.

Demonstrates long-tail persistence of causal steering effect in a subset of emotion features

Source paper

extracted_from

Persistence and Introspection of Emotion Features

Scott Sauers · Imago · Janus · Antra Tessera

Neighborhood — ranked by edge-count

Claims (1)

claim

Emotion features are not strictly locally scoped; they are bursty with a long tail of slow change persisting over 100 tokens.
supports
Main conclusion about the temporal dynamics of emotion features

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

At 5 tokens after steering pulse ends, 130 of 171 (62%) emotion features are BH-significantly elevated; 14% are suppressed.finding0.895
Shows immediate causal effect of steering on emotion feature activation
48 of 171 emotion probes individually significant at token 100 post-steeringfinding0.864
Shows that causal steering effects persist over long ranges for a substantial fraction of emotion probes
62% of emotions significantly elevated at 5 tokens after steering pulse endsfinding0.824
Demonstrates that the majority of emotion features show persistent upregulation shortly after a steering pulse
Do psychological steering results hold beyond 64-token completions?question0.793
Acknowledged limitation of restricting experiments to 64-token completions
Text-based and self-steered emotionality ratings are only weakly correlated (ρ = +0.051, n.s.), suggesting they measure different aspects of feature emotionality.claim0.785
Finding that the two evaluation modalities frequently diverge in their interpretation of the same SAE feature
Text-based and self-steered emotionality ratings for SAE features are correlated at only ρ = +0.051 (n.s.).finding0.783
Shows low agreement between the two evaluation modalities
17 of 83 emotions tested show significant associations between SAE feature self-evaluation transcripts mentioning the emotion word and higher cosine similarity to that emotion probe; 67 of 83 have positive associations.finding0.758
Demonstrates partial but reliable validity of self-evaluation for measuring probe emotionality
Emotions are not strictly locally scoped but instead bursty with a long tail of slow change persisting over 100 tokensclaim0.757
Characterizes the temporal dynamics of emotion feature activation in LLMs