finding
active
finding:0-multi-attempt-responses-across-7-892-no-steering-baseline-trials-confirming-esr-is-steering-induced0% multi-attempt responses across 7,892 no-steering baseline trials confirming ESR is steering-induced
Control result establishing that self-correction is specifically induced by steering, not spontaneous model behavior
Source paper
extracted_from(2026) · Alex McKenzie · Keenan Pepper · Stijn Servaes · Martin Leitgab +5
Neighborhood — ranked by edge-count
Papers (1)
paper
Claims (1)
claim
- Llama-3.3-70B exhibits internal consistency-checking mechanisms that operate during inferencesupportsCentral interpretive claim of the paper supported by causal ablation and activation evidence
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- How does ESR respond to safety-relevant steering interventions, e.g. toward harmful content?question0.784Key open question for AI safety implications of ESR
- Five judge models agree 90-96% on multi-attempt detection and ESR direction for same responsesfinding0.772Validation that ESR findings are not artifacts of any particular judge model's evaluation methodology
- We hypothesize ESR might be adversarially circumvented through targeted interventionshypothesis0.767Open safety-relevant question about whether ESR can be bypassed
- Acknowledges incompleteness of the causal account, suggesting redundant circuits or nonlinear interactions
- Core policy-relevant implication of the paper for AI safety
- Control condition with steering disabled to confirm self-correction is induced by steering, not spontaneous
- Steering Vector Control maintains low unexpected rate of 0.08 in Experiment 1, comparable to baselinefinding0.742Shows that inducing deception via steering vectors preserves semantic coherence and does not cause random errors
- Random latent ablation produces slight increase in ESR rate (3.8% to 4.2%), not statistically significantfinding0.742Control result confirming OTD ablation effect is specific to those latents, not a general ablation artifact