finding

active

finding:agentic-self-evaluation-of-sae-feature-emotionality-correlates-with-residual-persistence-0-124-p-0-0001-in-kimi-k2-5

Agentic self-evaluation of SAE feature emotionality correlates with residual persistence: ρ = +0.124, p = 0.0001 in Kimi K2.5.

Shows that model self-report of emotion predicts long-range feature persistence

Source paper

extracted_from

Persistence and Introspection of Emotion Features

Scott Sauers · Imago · Janus · Antra Tessera

Neighborhood — ranked by edge-count

Claims (2)

claim

SAE features that the model self-describes as more emotional tend to be more persistent than variance-matched SAE features.
associated_withsupports
Novel finding that agentic self-evaluation of emotionality correlates with feature persistence
Agentic self-evaluation and self-steering may scale to broadly interpret and understand internal representations and SAE features.
supports
Forward-looking claim about the potential of model introspection as an interpretability tool

Findings (1)

finding

Agentic self-evaluation emotionality correlates with SAE feature persistence: rho=+0.124, p=0.0001
restates
Shows that features Kimi rates as more emotional via self-steering are more persistent, independent of probe construction

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Self-evaluated emotionality of SAE features negatively correlates with activation variance explained (ρ = -0.184, p = 4.6e-09), requiring variance correction to reveal the persistence signal.finding0.877
Explains why variance correction is needed to see the self-evaluation–persistence relationship
Self-evaluated emotionality and textual evaluation of SAE features predict persistence in opposite directions.claim0.859
Surprising finding that the two evaluation methods diverge in their relationship with persistence
Negative correlation between self-evaluated emotion persistence and SAE feature activation variance explained: rho=-0.184, p=4.6e-09finding0.859
Shows self-evaluated emotionality is negatively confounded by variance, requiring variance control to reveal the true signal
Textual evaluation and agentic self-evaluation of SAE feature emotionality measure different aspects of emotional content and correlate only weakly (rho=+0.051, n.s.)claim0.857
Interprets the near-zero correlation between the two evaluation methods as evidence they capture distinct signals
Text-based and self-steered emotionality ratings for SAE features are correlated at only ρ = +0.051 (n.s.).finding0.845
Shows low agreement between the two evaluation modalities
Agentic self-steering evaluation may serve as a general method for explaining and interpreting SAE features beyond emotionclaim0.836
Forward-looking claim about the broader utility of the self-steering evaluation method
Correlation between self-evaluation and textual evaluation of SAE feature emotionality: rho=+0.051 (n.s.)finding0.831
Shows that the two evaluation methods for emotionality are largely uncorrelated, indicating they capture different signals
SAE emotion subspace overlap correlates with variance-residualized persistence in Cogito: Spearman +0.413, p = 4.4e-196.finding0.823
Strong positive relationship between emotion alignment and SAE feature persistence in Cogito

Restated by (1)

cosine ≥ 0.90

Other entities that say roughly the same thing. May be merge candidates or independent restatements across papers.

finding
Agentic self-evaluation emotionality correlates with SAE feature persistence: rho=+0.124, p=0.0001