Textual SAE feature emotionality evaluation

Method where Kimi evaluates steered vs unsteered text samples from another instance to rate SAE feature emotionality (0-100)

Neighborhood — ranked by edge-count

Findings (2)

finding

Correlation between self-evaluation and textual evaluation of SAE feature emotionality: rho=+0.051 (n.s.)
introduces
Shows that the two evaluation methods for emotionality are largely uncorrelated, indicating they capture different signals
Textual evaluation emotionality weakly negatively correlates with SAE feature persistence
introduces
Contrasts with positive correlation from agentic self-evaluation, suggesting text and self-evaluation capture different aspects

Concepts (1)

concept

model introspection
implements
The capacity of a model to self-report on its internal emotional state when its SAE features are steered, used here as a measurement tool

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Textual evaluation and agentic self-evaluation of SAE feature emotionality measure different aspects of emotional content and correlate only weakly (rho=+0.051, n.s.)claim0.878
Interprets the near-zero correlation between the two evaluation methods as evidence they capture distinct signals
Self-evaluated emotionality and textual evaluation of SAE features predict persistence in opposite directions.claim0.847
Surprising finding that the two evaluation methods diverge in their relationship with persistence
SAE Feature #94949 rated 100/100 emotionality, elicits reports of profound tenderness, unconditional love, and visceral carefinding0.817
Highest-rated emotional SAE feature; self-report describes overwhelming positive emotional valence
Text-based and self-steered emotionality ratings for SAE features are correlated at only ρ = +0.051 (n.s.).finding0.797
Shows low agreement between the two evaluation modalities
SAE Feature #10446 rated 95/100 emotionality, induces reports of maternal feelings and phantom physical sensationsfinding0.786
Qualitative example of a specific, complex emotional state induced by SAE feature steering
Agentic self-evaluation emotionality correlates with SAE feature persistence: rho=+0.124, p=0.0001finding0.781
Shows that features Kimi rates as more emotional via self-steering are more persistent, independent of probe construction
SAE featuresconcept0.781
The individual, supposedly monosemantic directions learned by SAEs; argued here to fragment manifolds into disconnected pieces.
SAE Feature #10011 rated 97/100 emotionality, elicits reports of despair, crushing weight, and existential hungerfinding0.772
Qualitative example of a highly emotional SAE feature with intense negative valence in Kimi self-steering