finding

active

finding:negative-correlation-between-self-evaluated-emotion-persistence-and-sae-feature-activation-variance-explained-rho-0-184-p-4-6e-09

Negative correlation between self-evaluated emotion persistence and SAE feature activation variance explained: rho=-0.184, p=4.6e-09

Shows self-evaluated emotionality is negatively confounded by variance, requiring variance control to reveal the true signal

Source paper

extracted_from

Persistence and Introspection of Emotion Features

Scott Sauers · Imago · Janus · Antra Tessera

Neighborhood — ranked by edge-count

Methods (1)

method

Variance-Matched Random Probe Comparison
supports
Controls for variance by sampling random directions from top-k PC spaces matching each emotion probe's explained variance, and subtracting median persistence of 20 matched directions

Findings (1)

finding

Self-evaluated emotionality of SAE features negatively correlates with activation variance explained (ρ = -0.184, p = 4.6e-09), requiring variance correction to reveal the persistence signal.
restates
Explains why variance correction is needed to see the self-evaluation–persistence relationship

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Agentic self-evaluation emotionality correlates with SAE feature persistence: rho=+0.124, p=0.0001finding0.861
Shows that features Kimi rates as more emotional via self-steering are more persistent, independent of probe construction
Agentic self-evaluation of SAE feature emotionality correlates with residual persistence: ρ = +0.124, p = 0.0001 in Kimi K2.5.finding0.859
Shows that model self-report of emotion predicts long-range feature persistence
Correlation between self-evaluation and textual evaluation of SAE feature emotionality: rho=+0.051 (n.s.)finding0.849
Shows that the two evaluation methods for emotionality are largely uncorrelated, indicating they capture different signals
17 of 83 emotions tested show significant associations between SAE feature self-evaluation transcripts mentioning the emotion word and higher cosine similarity to that emotion probe; 67 of 83 have positive associations.finding0.837
Demonstrates partial but reliable validity of self-evaluation for measuring probe emotionality
Text-based and self-steered emotionality ratings for SAE features are correlated at only ρ = +0.051 (n.s.).finding0.834
Shows low agreement between the two evaluation modalities
SAE emotion subspace overlap correlates with variance-residualized persistence in Cogito: Spearman +0.413, p = 4.4e-196.finding0.824
Strong positive relationship between emotion alignment and SAE feature persistence in Cogito
Self-evaluated emotionality and textual evaluation of SAE features predict persistence in opposite directions.claim0.822
Surprising finding that the two evaluation methods diverge in their relationship with persistence
Emotion probe persistence correlation of 0.214 in Cogito v2.1 vs 0.099 for random vectorsfinding0.809
Quantifies emotion feature persistence above random baseline in Cogito across 240 multi-turn conversations

Restated by (1)

cosine ≥ 0.90

Other entities that say roughly the same thing. May be merge candidates or independent restatements across papers.

finding
Self-evaluated emotionality of SAE features negatively correlates with activation variance explained (ρ = -0.184, p = 4.6e-09), requiring variance correction to reveal the persistence signal.