finding
active
finding:anthropic-interpretability-team-171-emotion-vectors-causally-influence-behavior-performing-vs-having-functional-emotion-representation-are-measurably-differentAnthropic Interpretability Team: 171 emotion vectors causally influence behavior; performing vs having functional emotion representation are measurably different
Cited as activation-level support for the performing care vs having care distinction the battery detects behaviorally
Source paper
extracted_from(2026) · Borzov, Anton
Neighborhood — ranked by edge-count
Claims (1)
claim
- Interpretive claim supported by roleplay and empathy model results
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Antra's explanation for why even stronger evidence may exist but remains unpublished.
- Interpretive hypothesis offered to explain why emotion features are more persistent
- Core open question the paper raises but does not fully resolve
- Validates that steering vectors capture reflection semantics by finding tokens reported in related work.
- Central thesis of the paper
- Finding that the two evaluation modalities frequently diverge in their interpretation of the same SAE feature