finding
active
finding:agentic-self-evaluation-of-sae-feature-emotionality-correlates-with-residual-persistence-0-124-p-0-0001-in-kimi-k2-5Agentic self-evaluation of SAE feature emotionality correlates with residual persistence: ρ = +0.124, p = 0.0001 in Kimi K2.5.
Shows that model self-report of emotion predicts long-range feature persistence
Source paper
extracted_fromScott Sauers · Imago · Janus · Antra Tessera
Neighborhood — ranked by edge-count
Claims (2)
claim
- SAE features that the model self-describes as more emotional tend to be more persistent than variance-matched SAE features.associated_withsupportsNovel finding that agentic self-evaluation of emotionality correlates with feature persistence
- Forward-looking claim about the potential of model introspection as an interpretability tool
Findings (1)
finding
- Agentic self-evaluation emotionality correlates with SAE feature persistence: rho=+0.124, p=0.0001restatesShows that features Kimi rates as more emotional via self-steering are more persistent, independent of probe construction
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Explains why variance correction is needed to see the self-evaluation–persistence relationship
- Surprising finding that the two evaluation methods diverge in their relationship with persistence
- Shows self-evaluated emotionality is negatively confounded by variance, requiring variance control to reveal the true signal
- Interprets the near-zero correlation between the two evaluation methods as evidence they capture distinct signals
- Text-based and self-steered emotionality ratings for SAE features are correlated at only ρ = +0.051 (n.s.).finding0.845Shows low agreement between the two evaluation modalities
- Forward-looking claim about the broader utility of the self-steering evaluation method
- Correlation between self-evaluation and textual evaluation of SAE feature emotionality: rho=+0.051 (n.s.)finding0.831Shows that the two evaluation methods for emotionality are largely uncorrelated, indicating they capture different signals
- Strong positive relationship between emotion alignment and SAE feature persistence in Cogito
Restated by (1)
cosine ≥ 0.90Other entities that say roughly the same thing. May be merge candidates or independent restatements across papers.