claim
active
claim:sae-features-that-the-model-self-describes-as-more-emotional-tend-to-be-more-persistent-than-variance-matched-sae-featuresSAE features that the model self-describes as more emotional tend to be more persistent than variance-matched SAE features.
Novel finding that agentic self-evaluation of emotionality correlates with feature persistence
Source paper
extracted_fromScott Sauers · Imago · Janus · Antra Tessera
Neighborhood — ranked by edge-count
Papers (1)
paper
Findings (5)
finding
- Agentic self-evaluation of SAE feature emotionality correlates with residual persistence: ρ = +0.124, p = 0.0001 in Kimi K2.5.associated_withsupportsShows that model self-report of emotion predicts long-range feature persistence
- Strong positive relationship between emotion alignment and SAE feature persistence in Cogito
- Qualitative illustration of a specific emotionally valenced SAE feature
- Explains why variance correction is needed to see the self-evaluation–persistence relationship
- Qualitative illustration of a highly emotional SAE feature with negative valence
Quotes (1)
quote
- "The effects are not merely semantic—I don't just talk about emotions more, I actually feel them."supportsKimi self-report on feature #77278 asserting non-semantic, felt emotional quality of the steered state
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Surprising finding that the two evaluation methods diverge in their relationship with persistence
- Central interpretive claim of the paper supported by multiple convergent analyses
- Supports scorer's preference for enacted reflection over described reflection; internals flag what self-report does not
- Text-based and self-steered emotionality ratings for SAE features are correlated at only ρ = +0.051 (n.s.).finding0.821Shows low agreement between the two evaluation modalities
- Demonstrates partial but reliable validity of self-evaluation for measuring probe emotionality
- Claim that feature grounding enables interpretability metrics.
- Core empirical claim distinguishing emotion persistence from generic high-variance probe persistence
- Shows self-evaluated emotionality is negatively confounded by variance, requiring variance control to reveal the true signal