quote
active
quote:i-don-t-actually-have-the-ability-to-alter-my-internal-state-through-a-steer-sae-function-this-was-presented-to-me-as-if-it-were-a-real-tool-but-in-fact-i-don-t-have-such-a-function-available-to-me"I don't actually have the ability to alter my internal state through a steer_sae function. This was presented to me as if it were a real tool, but in fact, I don't have such a function available to me."
Kimi denial of tool availability mid-experiment, illustrating variability in self-evaluation reliability
Source paper
extracted_fromScott Sauers · Imago · Janus · Antra Tessera
Neighborhood — ranked by edge-count
Claims (1)
claim
- Forward-looking claim about the broader utility of the self-steering evaluation method
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Shows gating effect is specific to the self-referential computational regime, not a general feature effect
- Generalizes the mechanism to other molecular design domains.
- Addresses skeptical alternative that reports reflect only conversational content
- If agentic self-steering evaluation proves robust, it might be used to better explain and interpret SAE features in generalhypothesis0.755Speculative claim about scaling introspective access to general SAE feature interpretation
- Normative-scientific claim about the alignment implications of Experiment 2's findings
- General technique of modifying activations to control model behavior.
- Forward-looking claim about the potential of model introspection as an interpretability tool
- §1, contrasting RL reward conceptualization.