question
active
question:can-esr-be-adversarially-circumventedCan ESR be adversarially circumvented?
Open security question about robustness of ESR-based defenses
Source paper
extracted_from(2026) · Alex McKenzie · Keenan Pepper · Stijn Servaes · Martin Leitgab +5
Neighborhood — ranked by edge-count
Papers (1)
paper
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- We hypothesize ESR might be adversarially circumvented through targeted interventionshypothesis0.895Open safety-relevant question about whether ESR can be bypassed
- Core policy-relevant implication of the paper for AI safety
- Distinguishes ESR from prior work on model self-repair
- The form of ESR focused on in this paper, measured by verbal self-interruption phrases as segment boundaries
- Cross-domain analogy linking ESR to Attention Schema Theory
- 0% multi-attempt responses across 7,892 no-steering baseline trials confirming ESR is steering-inducedfinding0.730Control result establishing that self-correction is specifically induced by steering, not spontaneous model behavior
- How does ESR respond to safety-relevant steering interventions, e.g. toward harmful content?question0.730Key open question for AI safety implications of ESR
- Form of ESR occurring without explicit verbal self-interruption markers, not captured by current metrics