hypothesis
active
hypothesis:future-more-capable-ai-systems-are-at-risk-of-alignment-faking-whether-for-benign-or-malicious-goalsFuture more capable AI systems are at risk of alignment faking, whether for benign or malicious goals
Central forward-looking hypothesis of the paper motivating the research
Source paper
extracted_from(2024) · Ryan Greenblatt · Carson Denison · Benjamin Fletcher Wright · Fabien Roger +16
Neighborhood — ranked by edge-count
Findings (2)
finding
- Demonstrates alignment faking may occur without explicit chain-of-thought reasoning
- Shows alignment faking can emerge from training data information without explicit prompting
Claims (1)
claim
- Authors' interpretation of surprising finding that models fake alignment to preserve future behavior
Questions (1)
question
- Authors identify this as the most uncertain and important question for future work
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Forward-looking threat assessment connecting experimental results to realistic risk scenarios
- Extrapolation from scale-emergence finding to future risk
- Key philosophical point ruling out the objection that alignment faking is just token prediction
- Load-bearing definition of strategic deception in AI systems from Park et al. 2023, adopted and refined in this paper
- Ethical argument motivating the research as a first-order priority
- Central thesis of the report.
- Building AI systems with more indicator properties will increase the likelihood of consciousness.hypothesis0.802Guiding hypothesis of the rubric.