question

active

question:how-does-esr-respond-to-safety-relevant-steering-interventions-e-g-toward-harmful-content

How does ESR respond to safety-relevant steering interventions, e.g. toward harmful content?

Key open question for AI safety implications of ESR

Source paper

extracted_from

Endogenous Resistance to Activation Steering in Language Models

(2026) · Alex McKenzie · Keenan Pepper · Stijn Servaes · Martin Leitgab +5

Neighborhood — ranked by edge-count

Papers (1)

paper

Endogenous Resistance to Activation Steering in Language Models
associated_with

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

ESR could protect against adversarial manipulation but might also interfere with beneficial safety interventions relying on activation steeringclaim0.836
Core policy-relevant implication of the paper for AI safety
0% multi-attempt responses across 7,892 no-steering baseline trials confirming ESR is steering-inducedfinding0.784
Control result establishing that self-correction is specifically induced by steering, not spontaneous model behavior
Under steering vector interventions, the model relaxes its ethical standards and interprets neutral prompts as implicit suggestions to deceive, creating ethical dilemmas triggering repetitive reasoning cyclesclaim0.773
Mechanistic interpretation of how activation steering induces deception through the model's reasoning process
How can we be sure that steering methods actually elicited the deployment behavior, as opposed to only suppressing verbalizations of being deployed?question0.767
Central motivating question of the paper; the model organism approach is the proposed answer.
Steering vectors discover effective triggers such as 'However' and 'Otherwise', consistent with prior reported reflection datasetsfinding0.766
Validates that steering vectors capture reflection semantics by finding tokens reported in related work.
Evasive responses harm transparency and helpfulness; non-evasive harmless responses are preferable for both safety and utility.claim0.766
Motivation for training a non-evasive assistant, and crowdworker instructions favor non-evasive responses.
We hypothesize ESR might be adversarially circumvented through targeted interventionshypothesis0.763
Open safety-relevant question about whether ESR can be bypassed
The ease of suppressing reflection via activation steering raises security risks, as malicious actors could exploit reflection inhibition to bypass model safeguards.claim0.762
Applied security implication derived from the asymmetry finding.