method
active
method:emotion-probe-construction-methodEmotion Probe Construction Method
Method for building 171 emotion probes by generating stories, embedding them, regressing out Gemini embeddings, and averaging residual activations per emotion
Neighborhood — ranked by edge-count
Concepts (2)
concept
- The prior Anthropic paper whose findings about emotion features in Claude this paper builds upon and extends
- Residual Activation VectorsimplementsLayer-40 activations with the component explained by compressed Gemini embeddings subtracted, isolating information not driven by surface text content
Methods (2)
method
- Ridge regression fit on top-256 PCs of Gemini embeddings to predict model layer-40 activations and compute residuals
- Used to embed story text so that surface-level semantic content can be regressed out from model activations
Datasets (1)
dataset
- 171 Emotion Probe SetintroducesSet of 171 residual probe vectors, one per emotion concept, constructed by regressing out Gemini embedding effects from story activations
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- The technique of discovering essential centers by imaginatively inhabiting a culture and using one's own feelings as a measuring instrument
- Linear probes constructed to measure 171 emotion concepts in model activations with surface semantic content removed
- Top-down interpretability approach studying linguistic properties at various residual stream stages; contrasted with the paper's bottom-up mechanistic approach
- Orthogonalizes the 171 emotion probes via SVD to create an orthonormal basis for computing SAE feature subspace overlap
- Method used to predict model activations from Gemini embeddings and compute residuals for probe construction
- Claims that agentic self-evaluation provides independent convergent evidence for emotion-persistence link
- Creating physical mockups to compare which alternative produces the deepest feeling (used in the Great Hall colors, Eishin wall mockups, and molding).
- Question addressed by testing whether self-evaluation transcripts mentioning emotion words have higher cosine similarity to corresponding probes