claim

active

claim:textual-evaluation-and-agentic-self-evaluation-of-sae-feature-emotionality-measure-different-aspects-of-emotional-content-and-correlate-only-weakly-rho-0-051-n-s

Textual evaluation and agentic self-evaluation of SAE feature emotionality measure different aspects of emotional content and correlate only weakly (rho=+0.051, n.s.)

Interprets the near-zero correlation between the two evaluation methods as evidence they capture distinct signals

Source paper

extracted_from

Persistence and Introspection of Emotion Features

Scott Sauers · Imago · Janus · Antra Tessera

Neighborhood — ranked by edge-count

Findings (2)

finding

Correlation between self-evaluation and textual evaluation of SAE feature emotionality: rho=+0.051 (n.s.)
restatessupports
Shows that the two evaluation methods for emotionality are largely uncorrelated, indicating they capture different signals
Textual evaluation emotionality weakly negatively correlates with SAE feature persistence
supports
Contrasts with positive correlation from agentic self-evaluation, suggesting text and self-evaluation capture different aspects

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Agentic self-evaluation emotionality correlates with SAE feature persistence: rho=+0.124, p=0.0001finding0.892
Shows that features Kimi rates as more emotional via self-steering are more persistent, independent of probe construction
Textual SAE feature emotionality evaluationmethod0.878
Method where Kimi evaluates steered vs unsteered text samples from another instance to rate SAE feature emotionality (0-100)
Self-evaluated emotionality and textual evaluation of SAE features predict persistence in opposite directions.claim0.876
Surprising finding that the two evaluation methods diverge in their relationship with persistence
Text-based and self-steered emotionality ratings for SAE features are correlated at only ρ = +0.051 (n.s.).finding0.866
Shows low agreement between the two evaluation modalities
Agentic self-evaluation of SAE feature emotionality correlates with residual persistence: ρ = +0.124, p = 0.0001 in Kimi K2.5.finding0.857
Shows that model self-report of emotion predicts long-range feature persistence
Agentic self-steering evaluation may serve as a general method for explaining and interpreting SAE features beyond emotionclaim0.837
Forward-looking claim about the broader utility of the self-steering evaluation method
Text-based and self-steered emotionality ratings are only weakly correlated (ρ = +0.051, n.s.), suggesting they measure different aspects of feature emotionality.claim0.836
Finding that the two evaluation modalities frequently diverge in their interpretation of the same SAE feature
Self-evaluated emotionality of SAE features negatively correlates with activation variance explained (ρ = -0.184, p = 4.6e-09), requiring variance correction to reveal the persistence signal.finding0.835
Explains why variance correction is needed to see the self-evaluation–persistence relationship

Restated by (1)

cosine ≥ 0.90

Other entities that say roughly the same thing. May be merge candidates or independent restatements across papers.

finding
Correlation between self-evaluation and textual evaluation of SAE feature emotionality: rho=+0.051 (n.s.)