finding
active
finding:textual-evaluation-emotionality-weakly-negatively-correlates-with-sae-feature-persistenceTextual evaluation emotionality weakly negatively correlates with SAE feature persistence
Contrasts with positive correlation from agentic self-evaluation, suggesting text and self-evaluation capture different aspects
Source paper
extracted_fromScott Sauers · Imago · Janus · Antra Tessera
Neighborhood — ranked by edge-count
Claims (1)
claim
- Interprets the near-zero correlation between the two evaluation methods as evidence they capture distinct signals
Methods (1)
method
- Method where Kimi evaluates steered vs unsteered text samples from another instance to rate SAE feature emotionality (0-100)
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Surprising finding that the two evaluation methods diverge in their relationship with persistence
- Explains why variance correction is needed to see the self-evaluation–persistence relationship
- Agentic self-evaluation emotionality correlates with SAE feature persistence: rho=+0.124, p=0.0001finding0.832Shows that features Kimi rates as more emotional via self-steering are more persistent, independent of probe construction
- Correlation between self-evaluation and textual evaluation of SAE feature emotionality: rho=+0.051 (n.s.)finding0.825Shows that the two evaluation methods for emotionality are largely uncorrelated, indicating they capture different signals
- Text-based and self-steered emotionality ratings for SAE features are correlated at only ρ = +0.051 (n.s.).finding0.824Shows low agreement between the two evaluation modalities
- Shows that model self-report of emotion predicts long-range feature persistence
- Shows self-evaluated emotionality is negatively confounded by variance, requiring variance control to reveal the true signal
- Finding that the two evaluation modalities frequently diverge in their interpretation of the same SAE feature