concept
active
concept:steerable-emotion-featuressteerable emotion features
Emotion-encoding directions in LLM activation space that can be amplified or suppressed via activation steering to causally drive model behavior
Neighborhood — ranked by edge-count
Concepts (2)
concept
- The prior Anthropic paper whose findings about emotion features in Claude this paper builds upon and extends
- internal emotional stateassociated_withThe possibility of a stably encoded, causally active emotional state within LLMs, as distinct from token-by-token semantic content
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- The phenomenon that emotion feature activations remain elevated above baseline beyond local token bursts, measurable as long-range correlation
- The phenomenon where activating an emotion feature leads to subsequent below-baseline activation of that feature
- Modifying model behavior by clamping SAE feature activations to specific values during forward pass.
- Main conclusion about the temporal dynamics of emotion features
- Finding that the two evaluation modalities frequently diverge in their interpretation of the same SAE feature
- Aggregate metric averaging mean SJT scores across OCEAN traits and steering directions; maximum possible is 10
- Kimi K2.5 uses a tool to steer SAE features on itself in real-time and rates the emotional effect on its own internal state 0-100