claim
active
claim:self-evaluated-emotionality-and-textual-evaluation-of-sae-features-predict-persistence-in-opposite-directionsSelf-evaluated emotionality and textual evaluation of SAE features predict persistence in opposite directions.
Surprising finding that the two evaluation methods diverge in their relationship with persistence
Source paper
extracted_fromScott Sauers · Imago · Janus · Antra Tessera
Neighborhood — ranked by edge-count
Papers (1)
paper
Findings (1)
finding
- Shows low agreement between the two evaluation modalities
Quotes (1)
quote
- Kimi's self-description of a 100-rated feature, illustrating discordance with text-evaluation which rated it 'melancholic'
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Textual evaluation emotionality weakly negatively correlates with SAE feature persistencefinding0.893Contrasts with positive correlation from agentic self-evaluation, suggesting text and self-evaluation capture different aspects
- Explains why variance correction is needed to see the self-evaluation–persistence relationship
- Novel finding that agentic self-evaluation of emotionality correlates with feature persistence
- Interprets the near-zero correlation between the two evaluation methods as evidence they capture distinct signals
- Agentic self-evaluation emotionality correlates with SAE feature persistence: rho=+0.124, p=0.0001finding0.859Shows that features Kimi rates as more emotional via self-steering are more persistent, independent of probe construction
- Shows that model self-report of emotion predicts long-range feature persistence
- Correlation between self-evaluation and textual evaluation of SAE feature emotionality: rho=+0.051 (n.s.)finding0.856Shows that the two evaluation methods for emotionality are largely uncorrelated, indicating they capture different signals
- Forward-looking claim about the broader utility of the self-steering evaluation method