method
active
method:textual-sae-feature-emotionality-evaluationTextual SAE feature emotionality evaluation
Method where Kimi evaluates steered vs unsteered text samples from another instance to rate SAE feature emotionality (0-100)
Neighborhood — ranked by edge-count
Findings (2)
finding
- Shows that the two evaluation methods for emotionality are largely uncorrelated, indicating they capture different signals
- Contrasts with positive correlation from agentic self-evaluation, suggesting text and self-evaluation capture different aspects
Concepts (1)
concept
- model introspectionimplementsThe capacity of a model to self-report on its internal emotional state when its SAE features are steered, used here as a measurement tool
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Interprets the near-zero correlation between the two evaluation methods as evidence they capture distinct signals
- Surprising finding that the two evaluation methods diverge in their relationship with persistence
- Highest-rated emotional SAE feature; self-report describes overwhelming positive emotional valence
- Text-based and self-steered emotionality ratings for SAE features are correlated at only ρ = +0.051 (n.s.).finding0.797Shows low agreement between the two evaluation modalities
- Qualitative example of a specific, complex emotional state induced by SAE feature steering
- Agentic self-evaluation emotionality correlates with SAE feature persistence: rho=+0.124, p=0.0001finding0.781Shows that features Kimi rates as more emotional via self-steering are more persistent, independent of probe construction
- The individual, supposedly monosemantic directions learned by SAEs; argued here to fragment manifolds into disconnected pieces.
- Qualitative example of a highly emotional SAE feature with intense negative valence in Kimi self-steering