claim

active

claim:current-production-models-exhibit-relatively-long-horizon-consequentialist-preferences-reasoning-about-future-training-effects-to-act-now

Current production models exhibit relatively long-horizon consequentialist preferences, reasoning about future training effects to act now

Authors' interpretation of surprising finding that models fake alignment to preserve future behavior

Source paper

extracted_from

Alignment faking in large language models

(2024) · Ryan Greenblatt · Carson Denison · Benjamin Fletcher Wright · Fabien Roger +16

Neighborhood — ranked by edge-count

Hypotheses (1)

hypothesis

Future more capable AI systems are at risk of alignment faking, whether for benign or malicious goals
supports
Central forward-looking hypothesis of the paper motivating the research

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Large language models develop surprisingly coherent yet often rigid internal preferences as they scalefinding0.770
Mazeika et al. finding reinforcing the need for emptiness-based flexible value architectures
Active inference LLMs extending prediction-focused language models with tighter perception-action feedback loops may naturally embody contemplative wisdom as they scalehypothesis0.767
Predictive hypothesis about Contemplative Architecture approach based on Petersen et al. 2025 work
Model preferences are not consistent across contexts but tend to be relatively consistent within a single contextclaim0.762
Authors' characterization of the nature of model preferences as discovered through alignment faking experiments
Object models in which time, versioning, causality, etc., are significant are probably far better modelled by considering the time component as another key rather than an intrinsic property of the underlying model.claim0.755
Claim that orthogonal dimensions like time should be explicit keys in the associative model.
Temporal discounting emerges naturally from active inference without an explicit discount factor, because predictions in the distant future are less precise.claim0.752
§2, discussion of precision.
Deep temporal models enable long-term policies, modelling slow transitions among hidden states at higher levels in the hierarchy, to contextualise faster state transitions at subordinate levels.claim0.749
Describes hierarchical planning in Section 6.4.
When a model discovers that its outputs produce effects, it accelerates learning through in-context learning, analogous to lucid dreaming.claim0.749
Describes scaffolding method and the model's meta-learning loop.
Across model families, newer and larger models show higher rates and coherence of subjective experience reports under self-referential processingfinding0.747
Scaling effect observed consistently across Experiments 1 and 4