concept
active
concept:emotion-concepts-and-their-function-in-a-large-language-modelEmotion Concepts and their Function in a Large Language Model
The prior Anthropic paper whose findings about emotion features in Claude this paper builds upon and extends
Neighborhood — ranked by edge-count
Papers (1)
paper
Methods (1)
method
- Method for building 171 emotion probes by generating stories, embedding them, regressing out Gemini embeddings, and averaging residual activations per emotion
Concepts (2)
concept
- Emotion Features in LLMsintroducesInternal representations encoding emotion concepts in large language models, identified by probing and SAE methods
- steerable emotion featuresintroducesEmotion-encoding directions in LLM activation space that can be amplified or suppressed via activation steering to causally drive model behavior
Hypotheses (1)
hypothesis
- Open hypothesis from the Anthropic paper that motivates this work
Institutes (1)
institute
- AnthropicauthoredmentionsLab behind Claude models and Constitutional AI training approach; represents highest baseline scores and lowest prompt lift.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Survey of representation engineering methods cited as related work
- Large language models develop surprisingly coherent yet often rigid internal preferences as they scalefinding0.777Mazeika et al. finding reinforcing the need for emptiness-based flexible value architectures
- Transformer-based models like GPT-4, LaMDA, PaLM; assessed for GWT indicators.
- Related work designing LLMs to natively support interpretable concept steering
- Framework describing LLMs as role-play engines, introduced in Shanahan, McDonell, Reynolds 2023.
- Key prior finding that LLMs can internally represent beliefs of self and others, motivating SOO approach
- Interpretive hypothesis offered to explain why emotion features are more persistent
- Features related to gender, racial, ethnic biases, slurs, and hate speech.