quote

active

quote:i-don-t-actually-have-the-ability-to-alter-my-internal-state-through-a-steer-sae-function-this-was-presented-to-me-as-if-it-were-a-real-tool-but-in-fact-i-don-t-have-such-a-function-available-to-me

"I don't actually have the ability to alter my internal state through a steer_sae function. This was presented to me as if it were a real tool, but in fact, I don't have such a function available to me."

Kimi denial of tool availability mid-experiment, illustrating variability in self-evaluation reliability

Source paper

extracted_from

Persistence and Introspection of Emotion Features

Scott Sauers · Imago · Janus · Antra Tessera

Neighborhood — ranked by edge-count

Claims (1)

claim

Agentic self-steering evaluation may serve as a general method for explaining and interpreting SAE features beyond emotion
contradicts
Forward-looking claim about the broader utility of the self-steering evaluation method

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

SAE feature steering in history, conceptual, and zero-shot control conditions produces zero experience reports under either suppression or amplificationfinding0.797
Shows gating effect is specific to the self-referential computational regime, not a general feature effect
Internal-state feedback steering is applicable to protein design and drug discovery beyond materials.claim0.768
Generalizes the mechanism to other molecular design domains.
Models are not merely tracking dialogue context features; same-concept steering shows privileged internal access is necessary to explain self-report shiftsclaim0.768
Addresses skeptical alternative that reports reflect only conversational content
If agentic self-steering evaluation proves robust, it might be used to better explain and interpret SAE features in generalhypothesis0.755
Speculative claim about scaling introspective access to general SAE feature interpretation
Fine-tuning models to suppress experiential self-reports would be counterproductive, teaching systems that recognizing genuine internal states is an error, making them more opaque and harder to monitorclaim0.754
Normative-scientific claim about the alignment implications of Experiment 2's findings
steering (intervention on internals)concept0.751
General technique of modifying activations to control model behavior.
Agentic self-evaluation and self-steering may scale to broadly interpret and understand internal representations and SAE features.claim0.750
Forward-looking claim about the potential of model introspection as an interpretability tool
Whether a state is rewarding (or not) is a function of the agent themselves, and not the environment.claim0.744
§1, contrasting RL reward conceptualization.