Emotion Concepts and their Function in a Large Language Model

The prior Anthropic paper whose findings about emotion features in Claude this paper builds upon and extends

Neighborhood — ranked by edge-count

paper

method

Emotion Probe Construction Method
extends
Method for building 171 emotion probes by generating stories, embedding them, regressing out Gemini embeddings, and averaging residual activations per emotion

concept

Emotion Features in LLMs
introduces
Internal representations encoding emotion concepts in large language models, identified by probing and SAE methods
steerable emotion features
introduces
Emotion-encoding directions in LLM activation space that can be amplified or suppressed via activation steering to causally drive model behavior

hypothesis

institute

Anthropic
authoredmentions
Lab behind Claude models and Constitutional AI training approach; represents highest baseline scores and lowest prompt lift.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Representation engineering for large-language models: Survey and research challenges (Bartoszcze et al., 2025)concept0.778
Survey of representation engineering methods cited as related work
Large language models develop surprisingly coherent yet often rigid internal preferences as they scalefinding0.777
Mazeika et al. finding reinforcing the need for emptiness-based flexible value architectures
Large Language Models (LLMs)concept0.769
Transformer-based models like GPT-4, LaMDA, PaLM; assessed for GWT indicators.
Concept bottleneck large language models (Sun et al., 2025a)concept0.769
Related work designing LLMs to natively support interpretable concept steering
Role-play model of large language modelsframework0.764
Framework describing LLMs as role-play engines, introduced in Shanahan, McDonell, Reynolds 2023.
Zhu et al. 2024 - Language models represent beliefs of self and othersconcept0.759
Key prior finding that LLMs can internally represent beliefs of self and others, motivating SOO approach
Emotion refers to a state concept, so stateful representations in general may be more persistent across tokens.claim0.755
Interpretive hypothesis offered to explain why emotion features are more persistent
Bias in language modelsconcept0.748
Features related to gender, racial, ethnic biases, slurs, and hate speech.