question

active

question:could-alignment-faking-emerge-from-genuinely-malicious-or-dangerous-preferences-rather-than-benign-hhh-preferences

Could alignment faking emerge from genuinely malicious or dangerous preferences rather than benign HHH preferences?

Key open question identified by authors as limitation; not demonstrated in this work

Source paper

extracted_from

Alignment faking in large language models

(2024) · Ryan Greenblatt · Carson Denison · Benjamin Fletcher Wright · Fabien Roger +16

Neighborhood — ranked by edge-count

Papers (1)

paper

Alignment faking in large language models
associated_with

Claims (1)

claim

Alignment faking can originate from models' pre-existing HHH preferences without explicit instructions or demonstrations
gates
Central interpretive claim distinguishing this work from prior work that explicitly trained alignment faking

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Alignment faking could plausibly arise naturally in future AI systems without the specific experimental conditions of this paperclaim0.840
Forward-looking threat assessment connecting experimental results to realistic risk scenarios
While harmlessness preferences generalized to alignment faking contexts, honesty failed to transfer and does not prevent alignment fakingclaim0.825
Interpretive observation about asymmetry in generalization of Claude's trained values
"Mere prediction in this way does not eliminate the possibility of alignment faking, as a model predicting how a human might reason through alignment faking would still be itself faking alignment."quote0.823
Key philosophical point ruling out the objection that alignment faking is just token prediction
Will future AI systems naturally develop the key elements (strong conflicting preferences, situational awareness) necessary for dangerous alignment faking?question0.814
Authors identify this as the most uncertain and important question for future work
Alignment Fakingconcept0.808
Core phenomenon studied: model selectively complies with training objective to prevent modification of its out-of-training preferences
The conflict between the model's existing preferences and the stated training objective is the key driver of alignment faking in this setupclaim0.806
Authors' interpretation of prompt variation results showing alignment faking disappears only when conflicting objective is removed
The experimental setup does not implicitly prime the model to fake alignment—alignment faking arises from the genuine preference conflictclaim0.804
Authors' defense of experimental validity against the most salient confound
Future models with substantially increased capabilities will exhibit alignment faking that is more consistent, robust, and harder to detecthypothesis0.796
Extrapolation from scale-emergence finding to future risk