SVD Orthogonalization of Emotion Probes

Orthogonalizes the 171 emotion probes via SVD to create an orthonormal basis for computing SAE feature subspace overlap

Neighborhood — ranked by edge-count

Methods (1)

method

SAE Feature Emotion Subspace Overlap Metric
uses
Fraction of an SAE feature's length lying inside the 171-dimensional subspace spanned by emotion probes, computed via SVD orthogonalization

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Emotion subspace overlap (SVD-based)method0.782
Metric measuring how much of an SAE feature vector lies within the 171-dimensional subspace spanned by emotion probes, via SVD orthogonalization
Emotion probes (171-emotion residual vector probes)method0.764
Linear probes constructed to measure 171 emotion concepts in model activations with surface semantic content removed
The correlation between emotion subspace fraction and self-evaluated emotionality validates that emotion probe concepts somewhat overlap with the model's self-reported internal emotions.claim0.740
Claim supporting the validity of the probe construction method via cross-validation with self-report
Emotion probes are more persistent than variance-matched random probes, indicating emotion-specific persistence beyond autoregressive dynamics.claim0.740
Core empirical claim distinguishing emotion persistence from generic high-variance probe persistence
Even validated probes may capture distributed representations mixing emotive states with correlated features like persona or styleclaim0.727
Caveat on probe interpretation; does not negate the introspection result but affects interpretation of the target variable
Emotion Probe Construction Methodmethod0.721
Method for building 171 emotion probes by generating stories, embedding them, regressing out Gemini embeddings, and averaging residual activations per emotion
Text-based and self-steered emotionality ratings for SAE features are correlated at only ρ = +0.051 (n.s.).finding0.721
Shows low agreement between the two evaluation modalities
Simple difference-in-mean probes generalize as well as other probing techniques while identifying directions which are more causally implicated in model outputsclaim0.717
Key methodological claim: MM probes are both competitive in accuracy and superior in causal influence