finding
active
finding:correlation-between-self-evaluation-and-textual-evaluation-of-sae-feature-emotionality-rho-0-051-n-sCorrelation between self-evaluation and textual evaluation of SAE feature emotionality: rho=+0.051 (n.s.)
Shows that the two evaluation methods for emotionality are largely uncorrelated, indicating they capture different signals
Source paper
extracted_fromScott Sauers · Imago · Janus · Antra Tessera
Neighborhood — ranked by edge-count
Claims (1)
claim
- Interprets the near-zero correlation between the two evaluation methods as evidence they capture distinct signals
Hypotheses (1)
hypothesis
- Speculative claim about scaling introspective access to general SAE feature interpretation
Methods (1)
method
- Method where Kimi evaluates steered vs unsteered text samples from another instance to rate SAE feature emotionality (0-100)
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Text-based and self-steered emotionality ratings for SAE features are correlated at only ρ = +0.051 (n.s.).finding0.899Shows low agreement between the two evaluation modalities
- Agentic self-evaluation emotionality correlates with SAE feature persistence: rho=+0.124, p=0.0001finding0.875Shows that features Kimi rates as more emotional via self-steering are more persistent, independent of probe construction
- Surprising finding that the two evaluation methods diverge in their relationship with persistence
- Demonstrates partial but reliable validity of self-evaluation for measuring probe emotionality
- Shows self-evaluated emotionality is negatively confounded by variance, requiring variance control to reveal the true signal
- Explains why variance correction is needed to see the self-evaluation–persistence relationship
- Finding that the two evaluation modalities frequently diverge in their interpretation of the same SAE feature
- Shows that model self-report of emotion predicts long-range feature persistence
Restated by (1)
cosine ≥ 0.90Other entities that say roughly the same thing. May be merge candidates or independent restatements across papers.