Can ESR be adversarially circumvented?

Open security question about robustness of ESR-based defenses

Source paper

extracted_from

(2026) · Alex McKenzie · Keenan Pepper · Stijn Servaes · Martin Leitgab +5

paper

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

We hypothesize ESR might be adversarially circumvented through targeted interventionshypothesis0.895
Open safety-relevant question about whether ESR can be bypassed
ESR could protect against adversarial manipulation but might also interfere with beneficial safety interventions relying on activation steeringclaim0.800
Core policy-relevant implication of the paper for AI safety
ESR differs from the Hydra Effect in that ESR involves active, online detection and correction with explicit self-interruption tokensclaim0.740
Distinguishes ESR from prior work on model self-repair
Explicit ESRconcept0.735
The form of ESR focused on in this paper, measured by verbal self-interruption phrases as segment boundaries
ESR parallels endogenous attention control in biological systems where top-down mechanisms detect distracting inputs and redirect processingclaim0.732
Cross-domain analogy linking ESR to Attention Schema Theory
0% multi-attempt responses across 7,892 no-steering baseline trials confirming ESR is steering-inducedfinding0.730
Control result establishing that self-correction is specifically induced by steering, not spontaneous model behavior
How does ESR respond to safety-relevant steering interventions, e.g. toward harmful content?question0.730
Key open question for AI safety implications of ESR
Implicit ESRconcept0.728
Form of ESR occurring without explicit verbal self-interruption markers, not captured by current metrics