finding

active

finding:anthropic-interpretability-team-171-emotion-vectors-causally-influence-behavior-performing-vs-having-functional-emotion-representation-are-measurably-different

Anthropic Interpretability Team: 171 emotion vectors causally influence behavior; performing vs having functional emotion representation are measurably different

Cited as activation-level support for the performing care vs having care distinction the battery detects behaviorally

Source paper

extracted_from

Koan Battery: Measuring Reflective Mode Accessibility in AI

(2026) · Borzov, Anton

Neighborhood — ranked by edge-count

Claims (1)

claim

Performing care is not the same as having care: models optimized to seem like they have inner life score lower than models never trained for it.
supports
Interpretive claim supported by roleplay and empathy model results

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Anthropic is extremely conservative in writing up interpretability results due to Overton window concerns.claim0.779
Antra's explanation for why even stronger evidence may exist but remains unpublished.
Emotion refers to a state concept, so stateful representations in general may be more persistent across tokens.claim0.761
Interpretive hypothesis offered to explain why emotion features are more persistent
Anthropic's model-welfare program signals frontier labs taking "what's it like to be a model" seriously, creating space for external measurement.claim0.759
Representation geometry causally shapes behavior; activation and behavior manifolds are approximately isometric.claim0.759
To what extent is emotion feature persistence driven by genuine internal emotional state versus autoregressive conversational context dynamics?question0.759
Core open question the paper raises but does not fully resolve
Steering vectors discover effective triggers such as 'However' and 'Otherwise', consistent with prior reported reflection datasetsfinding0.759
Validates that steering vectors capture reflection semantics by finding tokens reported in related work.
Causal abstraction is not enough for mechanistic interpretability because it becomes vacuous without assumptions about how models encode informationclaim0.758
Central thesis of the paper
Text-based and self-steered emotionality ratings are only weakly correlated (ρ = +0.051, n.s.), suggesting they measure different aspects of feature emotionality.claim0.751
Finding that the two evaluation modalities frequently diverge in their interpretation of the same SAE feature