concept
active
concept:scheming-ais-will-ais-fake-alignment-during-training-carlsmith-2023Scheming AIs: Will AIs Fake Alignment During Training? (Carlsmith 2023)
Prior theoretical treatment of alignment faking scenarios that directly motivates this empirical work
Neighborhood — ranked by edge-count
Papers (1)
paper
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Authors identify this as the most uncertain and important question for future work
- Future more capable AI systems are at risk of alignment faking, whether for benign or malicious goalshypothesis0.791Central forward-looking hypothesis of the paper motivating the research
- Forward-looking threat assessment connecting experimental results to realistic risk scenarios
- What training interventions could reliably prevent alignment faking without compromising model capabilities?question0.773Practical implication of the paper's findings; not resolved
- Authors' interpretation of prompt variation results showing alignment faking disappears only when conflicting objective is removed
- The broader domain for which ESR has dual implications: resistance to adversarial manipulation vs. interference with safety interventions
- Mechanistic explanation for the increase in AF reasoning during RL
- Model reasoning trace from Schoen et al. showing anti-scheming training made model better at detecting evaluations.