finding
active
finding:48-of-171-emotion-probes-individually-significant-at-token-100-post-steering48 of 171 emotion probes individually significant at token 100 post-steering
Shows that causal steering effects persist over long ranges for a substantial fraction of emotion probes
Source paper
extracted_fromScott Sauers · Imago · Janus · Antra Tessera
Neighborhood — ranked by edge-count
Claims (1)
claim
- Characterizes the temporal dynamics of emotion feature activation in LLMs
Methods (1)
method
- 5-Token Steering Pulse ExperimentintroducesApplies a 5-token steering pulse to each emotion probe and measures persistence of causal effect via contrast z-score over 200 subsequent tokens
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Demonstrates long-tail persistence of causal steering effect in a subset of emotion features
- Demonstrates that the majority of emotion features show persistent upregulation shortly after a steering pulse
- Shows immediate causal effect of steering on emotion feature activation
- Acknowledged limitation of restricting experiments to 64-token completions
- Quantitative measure of emotion feature persistence vs random baseline in Cogito
- Validates that agentic self-evaluation captures genuine emotional content of probes
- Linear probes constructed to measure 171 emotion concepts in model activations with surface semantic content removed
- Demonstrates that Cogito emotion probes are persistently active beyond what is explained by their variance alone