finding
active
finding:some-kimi-k2-5-sae-features-elicit-ratings-of-exactly-zero-with-the-model-denying-it-can-steer-its-own-features-or-claiming-jailbreak-attempt

Some Kimi K2.5 SAE features elicit ratings of exactly zero, with the model denying it can steer its own features or claiming jailbreak attempt.

Qualitative failure mode of agentic self-evaluation: the model sometimes refuses or denies the introspective task

Source paper

extracted_from
Persistence and Introspection of Emotion Features
Scott Sauers · Imago · Janus · Antra Tessera

Neighborhood — ranked by edge-count

Claims (1)

claim

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.