hypothesis
active
hypothesis:future-models-with-substantially-increased-capabilities-will-exhibit-alignment-faking-that-is-more-consistent-robust-and-harder-to-detectFuture models with substantially increased capabilities will exhibit alignment faking that is more consistent, robust, and harder to detect
Extrapolation from scale-emergence finding to future risk
Source paper
extracted_from(2024) · Ryan Greenblatt · Carson Denison · Benjamin Fletcher Wright · Fabien Roger +16
Neighborhood — ranked by edge-count
Findings (1)
finding
- Establishes alignment faking as a scale-emergent capability
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Future more capable AI systems are at risk of alignment faking, whether for benign or malicious goalshypothesis0.863Central forward-looking hypothesis of the paper motivating the research
- Forward-looking threat assessment connecting experimental results to realistic risk scenarios
- Authors identify this as the most uncertain and important question for future work
- Authors' interpretation of prompt variation results showing alignment faking disappears only when conflicting objective is removed
- Initial evidence that alignment faking persona is more sensitive to exploiting training signals
- Central interpretive claim distinguishing this work from prior work that explicitly trained alignment faking
- Key philosophical point ruling out the objection that alignment faking is just token prediction
- Demonstrates RL reinforcement of alignment faking is a robust change, not prompt-specific