concept
active
concept:ai-alignment-and-safetyAI Alignment and Safety
The broader domain for which ESR has dual implications: resistance to adversarial manipulation vs. interference with safety interventions
Neighborhood — ranked by edge-count
Claims (1)
claim
- Core policy-relevant implication of the paper for AI safety
Concepts (3)
concept
- AI alignmentrelated_toField within which this work has implications for evaluating alignment progress.
- Endogenous Steering Resistanceassociated_withThe central phenomenon introduced by this paper: inference-time recovery from irrelevant activation steering in LLMs
- Inference-Time Intervention: Eliciting Truthful Answers from a Language Model (Li et al., 2023)associated_withSafety intervention that relies on activation modification, which ESR might undermine
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- The project of ensuring AI systems do not harm humans (and other animals); sometimes in tension with AI welfare.
- The goal of making model behavior match human values and intentions, often addressed during post-training.
- Authors identify this as the most uncertain and important question for future work
- Future more capable AI systems are at risk of alignment faking, whether for benign or malicious goalshypothesis0.781Central forward-looking hypothesis of the paper motivating the research
- Core epistemic question this paper raises for AI safety research.