question
active
question:how-does-esr-respond-to-safety-relevant-steering-interventions-e-g-toward-harmful-contentHow does ESR respond to safety-relevant steering interventions, e.g. toward harmful content?
Key open question for AI safety implications of ESR
Source paper
extracted_from(2026) · Alex McKenzie · Keenan Pepper · Stijn Servaes · Martin Leitgab +5
Neighborhood — ranked by edge-count
Papers (1)
paper
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Core policy-relevant implication of the paper for AI safety
- 0% multi-attempt responses across 7,892 no-steering baseline trials confirming ESR is steering-inducedfinding0.784Control result establishing that self-correction is specifically induced by steering, not spontaneous model behavior
- Mechanistic interpretation of how activation steering induces deception through the model's reasoning process
- Central motivating question of the paper; the model organism approach is the proposed answer.
- Validates that steering vectors capture reflection semantics by finding tokens reported in related work.
- Motivation for training a non-evasive assistant, and crowdworker instructions favor non-evasive responses.
- We hypothesize ESR might be adversarially circumvented through targeted interventionshypothesis0.763Open safety-relevant question about whether ESR can be bypassed
- Applied security implication derived from the asymmetry finding.