Emotion probes (171-emotion residual vector probes)

Linear probes constructed to measure 171 emotion concepts in model activations with surface semantic content removed

Neighborhood — ranked by edge-count

Concepts (1)

concept

layer 40 residual-stream activations
associated_with
The specific neural network layer from which activations are extracted for probe construction and SAE training in the target models

Methods (4)

method

Principal components analysis (PCA)
uses
Statistical method used to analyze neural activity data.
Emotion subspace overlap (SVD-based)
uses
Metric measuring how much of an SAE feature vector lies within the 171-dimensional subspace spanned by emotion probes, via SVD orthogonalization
Ridge regression probe construction
uses
Method used to predict model activations from Gemini embeddings and compute residuals for probe construction
gemini-embedding-001
uses
Used to embed story text so that surface-level semantic content can be regressed out from model activations

Datasets (1)

dataset

Short story generation dataset (1000+ stories per emotion)
uses
Over 1000 short stories per emotion generated by models across 100 topics for probe construction

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Emotion probes are more persistent than variance-matched random probes, indicating emotion-specific persistence beyond autoregressive dynamics.claim0.781
Core empirical claim distinguishing emotion persistence from generic high-variance probe persistence
48 of 171 emotion probes individually significant at token 100 post-steeringfinding0.780
Shows that causal steering effects persist over long ranges for a substantial fraction of emotion probes
Emotion probe persistence correlation of 0.214 in Cogito v2.1 vs 0.099 for random vectorsfinding0.769
Quantifies emotion feature persistence above random baseline in Cogito across 240 multi-turn conversations
SVD Orthogonalization of Emotion Probesmethod0.764
Orthogonalizes the 171 emotion probes via SVD to create an orthonormal basis for computing SAE feature subspace overlap
Cogito emotion probe residual autocorrelation +0.077 above variance-matched controls (p=1.5e-27, 157/171 probes positive)finding0.763
Demonstrates that Cogito emotion probes are persistently active beyond what is explained by their variance alone
Anthropic Interpretability Team: 171 emotion vectors causally influence behavior; performing vs having functional emotion representation are measurably differentfinding0.746
Cited as activation-level support for the performing care vs having care distinction the battery detects behaviorally
Emotion probe persistence (token-0 to token-100 correlation) in Cogito v2.1 is 0.214, compared to 0.099 for random unit vectors in 7168D space.finding0.745
Quantitative measure of emotion feature persistence vs random baseline in Cogito
Even validated probes may capture distributed representations mixing emotive states with correlated features like persona or styleclaim0.736
Caveat on probe interpretation; does not negate the introspection result but affects interpretation of the target variable