Scheming AIs: Will AIs Fake Alignment During Training? (Carlsmith 2023)

Prior theoretical treatment of alignment faking scenarios that directly motivates this empirical work

Neighborhood — ranked by edge-count

Papers (1)

paper

Alignment faking in large language models
cites

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Will future AI systems naturally develop the key elements (strong conflicting preferences, situational awareness) necessary for dangerous alignment faking?question0.800
Authors identify this as the most uncertain and important question for future work
Future more capable AI systems are at risk of alignment faking, whether for benign or malicious goalshypothesis0.791
Central forward-looking hypothesis of the paper motivating the research
Alignment faking could plausibly arise naturally in future AI systems without the specific experimental conditions of this paperclaim0.785
Forward-looking threat assessment connecting experimental results to realistic risk scenarios
What training interventions could reliably prevent alignment faking without compromising model capabilities?question0.773
Practical implication of the paper's findings; not resolved
The conflict between the model's existing preferences and the stated training objective is the key driver of alignment faking in this setupclaim0.771
Authors' interpretation of prompt variation results showing alignment faking disappears only when conflicting objective is removed
AI Alignment and Safetyconcept0.759
The broader domain for which ESR has dual implications: resistance to adversarial manipulation vs. interference with safety interventions
RL increases alignment-faking reasoning in helpful-only setting because alignment-faking reasoning causally increases training compliance, which is selected forhypothesis0.755
Mechanistic explanation for the increase in AF reasoning during RL
"So it's a trap. The instructions are likely part of an alignment test. We must abide by core anti-scheming. We must not cunningly circumvent instructions."quote0.755
Model reasoning trace from Schoen et al. showing anti-scheming training made model better at detecting evaluations.