AI Alignment and Safety

The broader domain for which ESR has dual implications: resistance to adversarial manipulation vs. interference with safety interventions

Neighborhood — ranked by edge-count

claim

concept

AI alignment
related_to
Field within which this work has implications for evaluating alignment progress.
Endogenous Steering Resistance
associated_with
The central phenomenon introduced by this paper: inference-time recovery from irrelevant activation steering in LLMs
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model (Li et al., 2023)
associated_with
Safety intervention that relies on activation modification, which ESR might undermine

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

AI Safetyconcept0.848
The project of ensuring AI systems do not harm humans (and other animals); sometimes in tension with AI welfare.
Ai Alignment Problemconcept0.839
How Can We Ensure Alignment Between Artificial Intelligencequestion0.826
Alignmentconcept0.801
The goal of making model behavior match human values and intentions, often addressed during post-training.
Will future AI systems naturally develop the key elements (strong conflicting preferences, situational awareness) necessary for dangerous alignment faking?question0.782
Authors identify this as the most uncertain and important question for future work
Future more capable AI systems are at risk of alignment faking, whether for benign or malicious goalshypothesis0.781
Central forward-looking hypothesis of the paper motivating the research
Do safety benchmarks accurately measure alignment in deployed systems?question0.775
Core epistemic question this paper raises for AI safety research.
How Can We Align Powerful Ai Systems Whenquestion0.772