finding
active
finding:text-based-and-self-steered-emotionality-ratings-for-sae-features-are-correlated-at-only-0-051-n-sText-based and self-steered emotionality ratings for SAE features are correlated at only ρ = +0.051 (n.s.).
Shows low agreement between the two evaluation modalities
Source paper
extracted_fromScott Sauers · Imago · Janus · Antra Tessera
Neighborhood — ranked by edge-count
Claims (2)
claim
- Finding that the two evaluation modalities frequently diverge in their interpretation of the same SAE feature
- Surprising finding that the two evaluation methods diverge in their relationship with persistence
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Correlation between self-evaluation and textual evaluation of SAE feature emotionality: rho=+0.051 (n.s.)finding0.899Shows that the two evaluation methods for emotionality are largely uncorrelated, indicating they capture different signals
- Explains why variance correction is needed to see the self-evaluation–persistence relationship
- Interprets the near-zero correlation between the two evaluation methods as evidence they capture distinct signals
- Demonstrates partial but reliable validity of self-evaluation for measuring probe emotionality
- Highest-rated emotional SAE feature; self-report describes overwhelming positive emotional valence
- Agentic self-evaluation emotionality correlates with SAE feature persistence: rho=+0.124, p=0.0001finding0.846Shows that features Kimi rates as more emotional via self-steering are more persistent, independent of probe construction
- Shows that model self-report of emotion predicts long-range feature persistence
- Qualitative example of a specific, complex emotional state induced by SAE feature steering
Restated by (1)
cosine ≥ 0.90Other entities that say roughly the same thing. May be merge candidates or independent restatements across papers.