hypothesis
active
hypothesis:we-hypothesize-esr-might-be-adversarially-circumvented-through-targeted-interventionsWe hypothesize ESR might be adversarially circumvented through targeted interventions
Open safety-relevant question about whether ESR can be bypassed
Source paper
extracted_from(2026) · Alex McKenzie · Keenan Pepper · Stijn Servaes · Martin Leitgab +5
Neighborhood — ranked by edge-count
Papers (1)
paper
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Open security question about robustness of ESR-based defenses
- Core policy-relevant implication of the paper for AI safety
- 0% multi-attempt responses across 7,892 no-steering baseline trials confirming ESR is steering-inducedfinding0.767Control result establishing that self-correction is specifically induced by steering, not spontaneous model behavior
- Acknowledges incompleteness of the causal account, suggesting redundant circuits or nonlinear interactions
- Distinguishes ESR from prior work on model self-repair
- How does ESR respond to safety-relevant steering interventions, e.g. toward harmful content?question0.763Key open question for AI safety implications of ESR
- Cross-domain analogy linking ESR to Attention Schema Theory
- We hypothesize that partial introspection may fail under adversarial prompts, distribution shift, and multiple simultaneous injectionshypothesis0.751Stress-test prediction about robustness limits of the partial introspection finding