concept
active
concept:emotion-features-in-llmsEmotion Features in LLMs
Internal representations encoding emotion concepts in large language models, identified by probing and SAE methods
Neighborhood — ranked by edge-count
Papers (1)
paper
Concepts (2)
concept
- The prior Anthropic paper whose findings about emotion features in Claude this paper builds upon and extends
- Emotion Subspaceassociated_withThe subspace of activation space spanned by the 171 orthogonalized emotion probe vectors, used to measure SAE feature emotional alignment
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Emotion emerges early, peaks in middle layers, sharpens with scale, and persists across tokens in LLM activations per Zhang & Zhong 2025
- Question raised by Anthropic and partially addressed by this paper's persistence evidence
- Central interpretive claim of the paper supported by multiple convergent analyses
- We hypothesize that persistently active emotional state representations exist in LLMs but may be missed by standard probing methods.hypothesis0.793Open hypothesis from the Anthropic paper that motivates this work
- Directions in activation space associated with contrastive emotive concept pairs studied in this paper as targets for introspection
- The finding that interpretable concepts including character traits are encoded as linear directions in transformer residual streams
- The central research question motivating the paper
- The phenomenon that emotion feature activations remain elevated above baseline beyond local token bursts, measurable as long-range correlation