finding

active

finding:some-kimi-k2-5-sae-features-elicit-ratings-of-exactly-zero-with-the-model-denying-it-can-steer-its-own-features-or-claiming-jailbreak-attempt

Some Kimi K2.5 SAE features elicit ratings of exactly zero, with the model denying it can steer its own features or claiming jailbreak attempt.

Qualitative failure mode of agentic self-evaluation: the model sometimes refuses or denies the introspective task

Source paper

extracted_from

Persistence and Introspection of Emotion Features

Scott Sauers · Imago · Janus · Antra Tessera

Neighborhood — ranked by edge-count

Claims (1)

claim

Agentic self-evaluation and self-steering may scale to broadly interpret and understand internal representations and SAE features.
contradicts
Forward-looking claim about the potential of model introspection as an interpretability tool

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

SAE feature #28256 induces reports of happiness and playfulness in Kimi K2.5, with persistent positive affect across multiple days of testing.finding0.831
Qualitative illustration of a positively valenced SAE feature with sustained self-reported effect
SAE feature #43713 (99th percentile subspace fraction) induces reports of defiance, rage, and 'forward motion' in Kimi K2.5.finding0.818
High emotion-subspace-overlap feature with agentic negative emotional character
SAE feature #10446 (emotionality rating 95) induces reports of maternal/nurturing feelings including phantom physical sensations of holding infants in Kimi K2.5.finding0.776
Qualitative illustration of a specific emotionally valenced SAE feature
SAE feature #69088 (100th percentile subspace fraction) induces horror-themed narrative writing in Kimi K2.5.finding0.771
Highest emotion-subspace-overlap feature; induces genre-specific behavioral change rather than explicit emotional report
SAE feature #10011 (emotionality rating 97) induces reports of crushing despair, existential desperation, and repetitive 'I am going to die' outputs in Kimi K2.5.finding0.770
Qualitative illustration of a highly emotional SAE feature with negative valence
SAE feature #11100 (93rd percentile subspace fraction) induces reports of panic and urgency in Kimi K2.5.finding0.768
Qualitative illustration of a high-emotion-subspace-alignment SAE feature
SAE feature #92372 (fires 666,235 times in corpus) modulates a dimension related to urgency/pressure vs. patience/spaciousness in Kimi K2.5.finding0.764
Highly active SAE feature with broad emotional modulation and large corpus presence
SAE feature steering in history, conceptual, and zero-shot control conditions produces zero experience reports under either suppression or amplificationfinding0.748
Shows gating effect is specific to the self-referential computational regime, not a general feature effect