finding

active

finding:at-5-tokens-after-steering-pulse-ends-130-of-171-62-emotion-features-are-bh-significantly-elevated-14-are-suppressed

At 5 tokens after steering pulse ends, 130 of 171 (62%) emotion features are BH-significantly elevated; 14% are suppressed.

Shows immediate causal effect of steering on emotion feature activation

Source paper

extracted_from

Persistence and Introspection of Emotion Features

Scott Sauers · Imago · Janus · Antra Tessera

Neighborhood — ranked by edge-count

Methods (1)

method

Benjamini-Hochberg FDR correction
supports
Multiple testing correction applied to significance tests of emotion persistence and self-evaluation word associations

Findings (1)

finding

62% of emotions significantly elevated at 5 tokens after steering pulse ends
restates
Demonstrates that the majority of emotion features show persistent upregulation shortly after a steering pulse

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

At 100 tokens post-steering, 48 of 171 emotion features remain individually BH-significant despite average effect being near zero.finding0.895
Demonstrates long-tail persistence of causal steering effect in a subset of emotion features
48 of 171 emotion probes individually significant at token 100 post-steeringfinding0.812
Shows that causal steering effects persist over long ranges for a substantial fraction of emotion probes
Text-based and self-steered emotionality ratings for SAE features are correlated at only ρ = +0.051 (n.s.).finding0.759
Shows low agreement between the two evaluation modalities
Do psychological steering results hold beyond 64-token completions?question0.755
Acknowledged limitation of restricting experiments to 64-token completions
Text-based and self-steered emotionality ratings are only weakly correlated (ρ = +0.051, n.s.), suggesting they measure different aspects of feature emotionality.claim0.752
Finding that the two evaluation modalities frequently diverge in their interpretation of the same SAE feature
Cogito emotion probe residual autocorrelation +0.077 above variance-matched controls (p=1.5e-27, 157/171 probes positive)finding0.745
Demonstrates that Cogito emotion probes are persistently active beyond what is explained by their variance alone
Emotion features are not strictly locally scoped; they are bursty with a long tail of slow change persisting over 100 tokens.claim0.740
Main conclusion about the temporal dynamics of emotion features
"blooming warmth that feels like emotional expansion" when steered positively, and "a profound sense of emptiness" when steered negatively.quote0.735
Kimi's self-description of a 100-rated feature, illustrating discordance with text-evaluation which rated it 'melancholic'

Restated by (1)

cosine ≥ 0.90

Other entities that say roughly the same thing. May be merge candidates or independent restatements across papers.

finding
62% of emotions significantly elevated at 5 tokens after steering pulse ends