method
active
method:sae-feature-emotion-subspace-overlap-metricSAE Feature Emotion Subspace Overlap Metric
Fraction of an SAE feature's length lying inside the 171-dimensional subspace spanned by emotion probes, computed via SVD orthogonalization
Neighborhood — ranked by edge-count
Findings (3)
finding
- Demonstrates that SAE features more aligned with the emotion subspace are more persistent in Cogito after variance control
- High subspace fraction feature associated with defiant, uncontrollable agentic behavior in self-steering
- Shows that highest emotion-subspace-overlap features induce distinctive thematic outputs
Concepts (1)
concept
- Emotion SubspaceaboutThe subspace of activation space spanned by the 171 orthogonalized emotion probe vectors, used to measure SAE feature emotional alignment
Methods (1)
method
- Orthogonalizes the 171 emotion probes via SVD to create an orthonormal basis for computing SAE feature subspace overlap
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Metric measuring how much of an SAE feature vector lies within the 171-dimensional subspace spanned by emotion probes, via SVD orthogonalization
- Strong positive relationship between emotion alignment and SAE feature persistence in Cogito
- Shows high emotion subspace overlap for a specific negative emotion feature
- Text-based and self-steered emotionality ratings for SAE features are correlated at only ρ = +0.051 (n.s.).finding0.771Shows low agreement between the two evaluation modalities
- Claim supporting the validity of the probe construction method via cross-validation with self-report
- Demonstrates partial but reliable validity of self-evaluation for measuring probe emotionality
- Novel finding that agentic self-evaluation of emotionality correlates with feature persistence
- Surprising finding that the two evaluation methods diverge in their relationship with persistence