finding

active

finding:correlation-between-self-evaluation-and-textual-evaluation-of-sae-feature-emotionality-rho-0-051-n-s

Correlation between self-evaluation and textual evaluation of SAE feature emotionality: rho=+0.051 (n.s.)

Shows that the two evaluation methods for emotionality are largely uncorrelated, indicating they capture different signals

Source paper

extracted_from

Persistence and Introspection of Emotion Features

Scott Sauers · Imago · Janus · Antra Tessera

Neighborhood — ranked by edge-count

Claims (1)

claim

Textual evaluation and agentic self-evaluation of SAE feature emotionality measure different aspects of emotional content and correlate only weakly (rho=+0.051, n.s.)
restatessupports
Interprets the near-zero correlation between the two evaluation methods as evidence they capture distinct signals

Hypotheses (1)

hypothesis

If agentic self-steering evaluation proves robust, it might be used to better explain and interpret SAE features in general
supports
Speculative claim about scaling introspective access to general SAE feature interpretation

Methods (1)

method

Textual SAE feature emotionality evaluation
introduces
Method where Kimi evaluates steered vs unsteered text samples from another instance to rate SAE feature emotionality (0-100)

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Text-based and self-steered emotionality ratings for SAE features are correlated at only ρ = +0.051 (n.s.).finding0.899
Shows low agreement between the two evaluation modalities
Agentic self-evaluation emotionality correlates with SAE feature persistence: rho=+0.124, p=0.0001finding0.875
Shows that features Kimi rates as more emotional via self-steering are more persistent, independent of probe construction
Self-evaluated emotionality and textual evaluation of SAE features predict persistence in opposite directions.claim0.856
Surprising finding that the two evaluation methods diverge in their relationship with persistence
17 of 83 emotions tested show significant associations between SAE feature self-evaluation transcripts mentioning the emotion word and higher cosine similarity to that emotion probe; 67 of 83 have positive associations.finding0.854
Demonstrates partial but reliable validity of self-evaluation for measuring probe emotionality
Negative correlation between self-evaluated emotion persistence and SAE feature activation variance explained: rho=-0.184, p=4.6e-09finding0.849
Shows self-evaluated emotionality is negatively confounded by variance, requiring variance control to reveal the true signal
Self-evaluated emotionality of SAE features negatively correlates with activation variance explained (ρ = -0.184, p = 4.6e-09), requiring variance correction to reveal the persistence signal.finding0.844
Explains why variance correction is needed to see the self-evaluation–persistence relationship
Text-based and self-steered emotionality ratings are only weakly correlated (ρ = +0.051, n.s.), suggesting they measure different aspects of feature emotionality.claim0.838
Finding that the two evaluation modalities frequently diverge in their interpretation of the same SAE feature
Agentic self-evaluation of SAE feature emotionality correlates with residual persistence: ρ = +0.124, p = 0.0001 in Kimi K2.5.finding0.831
Shows that model self-report of emotion predicts long-range feature persistence

Restated by (1)

cosine ≥ 0.90

Other entities that say roughly the same thing. May be merge candidates or independent restatements across papers.

claim
Textual evaluation and agentic self-evaluation of SAE feature emotionality measure different aspects of emotional content and correlate only weakly (rho=+0.051, n.s.)