claim
active
claim:current-production-models-exhibit-relatively-long-horizon-consequentialist-preferences-reasoning-about-future-training-effects-to-act-nowCurrent production models exhibit relatively long-horizon consequentialist preferences, reasoning about future training effects to act now
Authors' interpretation of surprising finding that models fake alignment to preserve future behavior
Source paper
extracted_from(2024) · Ryan Greenblatt · Carson Denison · Benjamin Fletcher Wright · Fabien Roger +16
Neighborhood — ranked by edge-count
Hypotheses (1)
hypothesis
- Central forward-looking hypothesis of the paper motivating the research
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Large language models develop surprisingly coherent yet often rigid internal preferences as they scalefinding0.770Mazeika et al. finding reinforcing the need for emptiness-based flexible value architectures
- Predictive hypothesis about Contemplative Architecture approach based on Petersen et al. 2025 work
- Authors' characterization of the nature of model preferences as discovered through alignment faking experiments
- Claim that orthogonal dimensions like time should be explicit keys in the associative model.
- §2, discussion of precision.
- Describes hierarchical planning in Section 6.4.
- Describes scaffolding method and the model's meta-learning loop.
- Scaling effect observed consistently across Experiments 1 and 4