SAE Feature Emotion Subspace Overlap Metric

Fraction of an SAE feature's length lying inside the 171-dimensional subspace spanned by emotion probes, computed via SVD orthogonalization

Neighborhood — ranked by edge-count

Findings (3)

finding

SAE feature emotion subspace overlap correlates with persistence in Cogito: Spearman +0.413, p=4.4e-196
introduces
Demonstrates that SAE features more aligned with the emotion subspace are more persistent in Cogito after variance control
SAE Feature #43713 associated with agentic defiance and rage, 99th percentile emotion subspace fraction
supports
High subspace fraction feature associated with defiant, uncontrollable agentic behavior in self-steering
SAE Feature #69088 has 100th percentile emotion subspace fraction and produces spooky-themed writing under steering
supports
Shows that highest emotion-subspace-overlap features induce distinctive thematic outputs

Concepts (1)

concept

Emotion Subspace
about
The subspace of activation space spanned by the 171 orthogonalized emotion probe vectors, used to measure SAE feature emotional alignment

Methods (1)

method

SVD Orthogonalization of Emotion Probes
uses
Orthogonalizes the 171 emotion probes via SVD to create an orthonormal basis for computing SAE feature subspace overlap

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Emotion subspace overlap (SVD-based)method0.838
Metric measuring how much of an SAE feature vector lies within the 171-dimensional subspace spanned by emotion probes, via SVD orthogonalization
SAE emotion subspace overlap correlates with variance-residualized persistence in Cogito: Spearman +0.413, p = 4.4e-196.finding0.809
Strong positive relationship between emotion alignment and SAE feature persistence in Cogito
SAE Feature #11100 associated with panic, 93rd percentile emotion subspace fractionfinding0.798
Shows high emotion subspace overlap for a specific negative emotion feature
Text-based and self-steered emotionality ratings for SAE features are correlated at only ρ = +0.051 (n.s.).finding0.771
Shows low agreement between the two evaluation modalities
The correlation between emotion subspace fraction and self-evaluated emotionality validates that emotion probe concepts somewhat overlap with the model's self-reported internal emotions.claim0.760
Claim supporting the validity of the probe construction method via cross-validation with self-report
17 of 83 emotions tested show significant associations between SAE feature self-evaluation transcripts mentioning the emotion word and higher cosine similarity to that emotion probe; 67 of 83 have positive associations.finding0.753
Demonstrates partial but reliable validity of self-evaluation for measuring probe emotionality
SAE features that the model self-describes as more emotional tend to be more persistent than variance-matched SAE features.claim0.753
Novel finding that agentic self-evaluation of emotionality correlates with feature persistence
Self-evaluated emotionality and textual evaluation of SAE features predict persistence in opposite directions.claim0.753
Surprising finding that the two evaluation methods diverge in their relationship with persistence