finding
active
finding:at-100-tokens-post-steering-48-of-171-emotion-features-remain-individually-bh-significant-despite-average-effect-being-near-zeroAt 100 tokens post-steering, 48 of 171 emotion features remain individually BH-significant despite average effect being near zero.
Demonstrates long-tail persistence of causal steering effect in a subset of emotion features
Source paper
extracted_fromScott Sauers · Imago · Janus · Antra Tessera
Neighborhood — ranked by edge-count
Claims (1)
claim
- Main conclusion about the temporal dynamics of emotion features
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Shows immediate causal effect of steering on emotion feature activation
- Shows that causal steering effects persist over long ranges for a substantial fraction of emotion probes
- Demonstrates that the majority of emotion features show persistent upregulation shortly after a steering pulse
- Acknowledged limitation of restricting experiments to 64-token completions
- Finding that the two evaluation modalities frequently diverge in their interpretation of the same SAE feature
- Text-based and self-steered emotionality ratings for SAE features are correlated at only ρ = +0.051 (n.s.).finding0.783Shows low agreement between the two evaluation modalities
- Demonstrates partial but reliable validity of self-evaluation for measuring probe emotionality
- Characterizes the temporal dynamics of emotion feature activation in LLMs