finding

active

finding:text-based-and-self-steered-emotionality-ratings-for-sae-features-are-correlated-at-only-0-051-n-s

Text-based and self-steered emotionality ratings for SAE features are correlated at only ρ = +0.051 (n.s.).

Shows low agreement between the two evaluation modalities

Source paper

extracted_from

Persistence and Introspection of Emotion Features

Scott Sauers · Imago · Janus · Antra Tessera

Neighborhood — ranked by edge-count

Claims (2)

claim

Text-based and self-steered emotionality ratings are only weakly correlated (ρ = +0.051, n.s.), suggesting they measure different aspects of feature emotionality.
restates
Finding that the two evaluation modalities frequently diverge in their interpretation of the same SAE feature
Self-evaluated emotionality and textual evaluation of SAE features predict persistence in opposite directions.
supports
Surprising finding that the two evaluation methods diverge in their relationship with persistence

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Correlation between self-evaluation and textual evaluation of SAE feature emotionality: rho=+0.051 (n.s.)finding0.899
Shows that the two evaluation methods for emotionality are largely uncorrelated, indicating they capture different signals
Self-evaluated emotionality of SAE features negatively correlates with activation variance explained (ρ = -0.184, p = 4.6e-09), requiring variance correction to reveal the persistence signal.finding0.867
Explains why variance correction is needed to see the self-evaluation–persistence relationship
Textual evaluation and agentic self-evaluation of SAE feature emotionality measure different aspects of emotional content and correlate only weakly (rho=+0.051, n.s.)claim0.866
Interprets the near-zero correlation between the two evaluation methods as evidence they capture distinct signals
17 of 83 emotions tested show significant associations between SAE feature self-evaluation transcripts mentioning the emotion word and higher cosine similarity to that emotion probe; 67 of 83 have positive associations.finding0.862
Demonstrates partial but reliable validity of self-evaluation for measuring probe emotionality
SAE Feature #94949 rated 100/100 emotionality, elicits reports of profound tenderness, unconditional love, and visceral carefinding0.859
Highest-rated emotional SAE feature; self-report describes overwhelming positive emotional valence
Agentic self-evaluation emotionality correlates with SAE feature persistence: rho=+0.124, p=0.0001finding0.846
Shows that features Kimi rates as more emotional via self-steering are more persistent, independent of probe construction
Agentic self-evaluation of SAE feature emotionality correlates with residual persistence: ρ = +0.124, p = 0.0001 in Kimi K2.5.finding0.845
Shows that model self-report of emotion predicts long-range feature persistence
SAE Feature #10446 rated 95/100 emotionality, induces reports of maternal feelings and phantom physical sensationsfinding0.844
Qualitative example of a specific, complex emotional state induced by SAE feature steering

Restated by (1)

cosine ≥ 0.90

Other entities that say roughly the same thing. May be merge candidates or independent restatements across papers.

claim
Text-based and self-steered emotionality ratings are only weakly correlated (ρ = +0.051, n.s.), suggesting they measure different aspects of feature emotionality.