steerable emotion features

Emotion-encoding directions in LLM activation space that can be amplified or suppressed via activation steering to causally drive model behavior

Neighborhood — ranked by edge-count

concept

Emotion Concepts and their Function in a Large Language Model
introduces
The prior Anthropic paper whose findings about emotion features in Claude this paper builds upon and extends
internal emotional state
associated_with
The possibility of a stably encoded, causally active emotional state within LLMs, as distinct from token-by-token semantic content

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

emotion feature persistenceconcept0.772
The phenomenon that emotion feature activations remain elevated above baseline beyond local token bursts, measurable as long-range correlation
Anti-Persistence of Emotion Featuresconcept0.759
The phenomenon where activating an emotion feature leads to subsequent below-baseline activation of that feature
Feature steering (clamping feature activations)method0.752
Modifying model behavior by clamping SAE feature activations to specific values during forward pass.
The feeling which steers us is a vision of the emotional substance of the coming work.claim0.734
Emotion features are not strictly locally scoped; they are bursty with a long tail of slow change persisting over 100 tokens.claim0.731
Main conclusion about the temporal dynamics of emotion features
Text-based and self-steered emotionality ratings are only weakly correlated (ρ = +0.051, n.s.), suggesting they measure different aspects of feature emotionality.claim0.729
Finding that the two evaluation modalities frequently diverge in their interpretation of the same SAE feature
Steerability Score (Phi)concept0.729
Aggregate metric averaging mean SJT scores across OCEAN traits and steering directions; maximum possible is 10
Agentic Self-Steering Emotionality Evaluationmethod0.728
Kimi K2.5 uses a tool to steer SAE features on itself in real-time and rates the emotional effect on its own internal state 0-100