finding
active
finding:at-5-tokens-after-steering-pulse-ends-130-of-171-62-emotion-features-are-bh-significantly-elevated-14-are-suppressedAt 5 tokens after steering pulse ends, 130 of 171 (62%) emotion features are BH-significantly elevated; 14% are suppressed.
Shows immediate causal effect of steering on emotion feature activation
Source paper
extracted_fromScott Sauers · Imago · Janus · Antra Tessera
Neighborhood — ranked by edge-count
Methods (1)
method
- Multiple testing correction applied to significance tests of emotion persistence and self-evaluation word associations
Findings (1)
finding
- Demonstrates that the majority of emotion features show persistent upregulation shortly after a steering pulse
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Demonstrates long-tail persistence of causal steering effect in a subset of emotion features
- Shows that causal steering effects persist over long ranges for a substantial fraction of emotion probes
- Text-based and self-steered emotionality ratings for SAE features are correlated at only ρ = +0.051 (n.s.).finding0.759Shows low agreement between the two evaluation modalities
- Acknowledged limitation of restricting experiments to 64-token completions
- Finding that the two evaluation modalities frequently diverge in their interpretation of the same SAE feature
- Demonstrates that Cogito emotion probes are persistently active beyond what is explained by their variance alone
- Main conclusion about the temporal dynamics of emotion features
- Kimi's self-description of a 100-rated feature, illustrating discordance with text-evaluation which rated it 'melancholic'
Restated by (1)
cosine ≥ 0.90Other entities that say roughly the same thing. May be merge candidates or independent restatements across papers.