claim

active

claim:esr-differs-from-the-hydra-effect-in-that-esr-involves-active-online-detection-and-correction-with-explicit-self-interruption-tokens

ESR differs from the Hydra Effect in that ESR involves active, online detection and correction with explicit self-interruption tokens

Distinguishes ESR from prior work on model self-repair

Source paper

extracted_from

Endogenous Resistance to Activation Steering in Language Models

(2026) · Alex McKenzie · Keenan Pepper · Stijn Servaes · Martin Leitgab +5

Neighborhood — ranked by edge-count

Papers (1)

paper

Endogenous Resistance to Activation Steering in Language Models
introduces

Concepts (1)

concept

Hydra Effect
contradicts
Phenomenon where layer ablations trigger silent downstream compensation, contrasted with ESR

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

ESR parallels endogenous attention control in biological systems where top-down mechanisms detect distracting inputs and redirect processingclaim0.786
Cross-domain analogy linking ESR to Attention Schema Theory
We hypothesize ESR might be adversarially circumvented through targeted interventionshypothesis0.763
Open safety-relevant question about whether ESR can be bypassed
Ali et al. 2025 found contrastive activation addition less effective at larger model scale, consistent with ESR in 70Bfinding0.745
Prior finding from related work that aligns with ESR being strongest in the largest model tested
Random latent ablation produces slight increase in ESR rate (3.8% to 4.2%), not statistically significantfinding0.743
Control result confirming OTD ablation effect is specific to those latents, not a general ablation artifact
Can ESR be adversarially circumvented?question0.740
Open security question about robustness of ESR-based defenses
ESR could protect against adversarial manipulation but might also interfere with beneficial safety interventions relying on activation steeringclaim0.731
Core policy-relevant implication of the paper for AI safety
Meta-Prompting for ESR Enhancementmethod0.727
Appending instructional meta-prompts to object-level prompts to deliberately enhance ESR in models
Does ESR emerge from RLHF or does it exist in pretrained representations?question0.726
Open question about developmental origin of ESR mechanisms