claim
active
claim:esr-could-protect-against-adversarial-manipulation-but-might-also-interfere-with-beneficial-safety-interventions-relying-on-activation-steering

ESR could protect against adversarial manipulation but might also interfere with beneficial safety interventions relying on activation steering

Core policy-relevant implication of the paper for AI safety

Source paper

extracted_from
Endogenous Resistance to Activation Steering in Language Models
(2026) · Alex McKenzie · Keenan Pepper · Stijn Servaes · Martin Leitgab +5

Neighborhood — ranked by edge-count

Findings (2)

finding

Concepts (1)

concept
  • The broader domain for which ESR has dual implications: resistance to adversarial manipulation vs. interference with safety interventions

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.