Agentic Self-Steering Emotionality Evaluation

Kimi K2.5 uses a tool to steer SAE features on itself in real-time and rates the emotional effect on its own internal state 0-100

Neighborhood — ranked by edge-count

Papers (1)

paper

Persistence and Introspection of Emotion Features
introduces

Findings (1)

finding

Agentic self-evaluation emotionality correlates with SAE feature persistence: rho=+0.124, p=0.0001
introduces
Shows that features Kimi rates as more emotional via self-steering are more persistent, independent of probe construction

Communities (1)

community

LLM Introspection
implements

Concepts (1)

concept

Feature Looping
associated_with
Repetitive behavioral pattern observed under high steering strengths in SAE feature self-steering experiments

Methods (1)

method

Agentic self-steering evaluation
related_to
Method where Kimi K2.5 steers its own SAE features in real time and reports on its internal emotional state

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Agentic self-steering evaluation may serve as a general method for explaining and interpreting SAE features beyond emotionclaim0.880
Forward-looking claim about the broader utility of the self-steering evaluation method
Agentic self-evaluation and self-steering may scale to broadly interpret and understand internal representations and SAE features.claim0.838
Forward-looking claim about the potential of model introspection as an interpretability tool
If agentic self-steering evaluation proves robust, it might be used to better explain and interpret SAE features in generalhypothesis0.827
Speculative claim about scaling introspective access to general SAE feature interpretation
Textual evaluation and agentic self-evaluation of SAE feature emotionality measure different aspects of emotional content and correlate only weakly (rho=+0.051, n.s.)claim0.784
Interprets the near-zero correlation between the two evaluation methods as evidence they capture distinct signals
Agentic self-evaluation of SAE feature emotionality correlates with residual persistence: ρ = +0.124, p = 0.0001 in Kimi K2.5.finding0.765
Shows that model self-report of emotion predicts long-range feature persistence
Whether the advantage of agentic self-evaluation over textual evaluation in predicting persistence is due to introspection quality or to testing of additional (including negative) steering strengthsquestion0.755
Identified methodological gap in interpreting the self-evaluation experiment results
Text-based and self-steered emotionality ratings are only weakly correlated (ρ = +0.051, n.s.), suggesting they measure different aspects of feature emotionality.claim0.752
Finding that the two evaluation modalities frequently diverge in their interpretation of the same SAE feature
Self-evaluated emotionality and textual evaluation of SAE features predict persistence in opposite directions.claim0.751
Surprising finding that the two evaluation methods diverge in their relationship with persistence