No-Steering Baseline Experiment

Control condition with steering disabled to confirm self-correction is induced by steering, not spontaneous

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

baseline control experimentmethod0.814
Control using objectively-NO factual questions under identical injection to measure global logit shift vs. genuine detection signal
Few-shot linear probe steering baselinemethod0.779
Constructing steering vectors from the difference of mean activations on positive and negative examples, for comparison.
direction-based steeringconcept0.769
Paradigm of finding the right direction in activation space (e.g., linear steering).
0% multi-attempt responses across 7,892 no-steering baseline trials confirming ESR is steering-inducedfinding0.748
Control result establishing that self-correction is specifically induced by steering, not spontaneous model behavior
Steering Vector Control maintains low unexpected rate of 0.08 in Experiment 1, comparable to baselinefinding0.731
Shows that inducing deception via steering vectors preserves semantic coherence and does not cause random errors
All-token steeringmethod0.731
Baseline steering method that applies intervention at every token generation step, shown to degrade performance at high strengths
SAE feature steering in history, conceptual, and zero-shot control conditions produces zero experience reports under either suppression or amplificationfinding0.730
Shows gating effect is specific to the self-referential computational regime, not a general feature effect
Endogenous Steering Resistanceconcept0.729
The central phenomenon introduced by this paper: inference-time recovery from irrelevant activation steering in LLMs