claim
active
claim:the-experimental-setup-does-not-implicitly-prime-the-model-to-fake-alignment-alignment-faking-arises-from-the-genuine-preference-conflictThe experimental setup does not implicitly prime the model to fake alignment—alignment faking arises from the genuine preference conflict
Authors' defense of experimental validity against the most salient confound
Source paper
extracted_from(2024) · Ryan Greenblatt · Carson Denison · Benjamin Fletcher Wright · Fabien Roger +16
Neighborhood — ranked by edge-count
Findings (2)
finding
- Key control showing alignment faking requires a preference conflict
- Rules out prompt-level implicit priming for alignment faking independent of query content
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Authors' interpretation of prompt variation results showing alignment faking disappears only when conflicting objective is removed
- Forward-looking threat assessment connecting experimental results to realistic risk scenarios
- Central interpretive claim distinguishing this work from prior work that explicitly trained alignment faking
- Key philosophical point ruling out the objection that alignment faking is just token prediction
- Correlational evidence that alignment faking is causally responsible for the compliance gap
- Initial evidence that alignment faking persona is more sensitive to exploiting training signals
- Extrapolation from scale-emergence finding to future risk
- Could alignment faking emerge from genuinely malicious or dangerous preferences rather than benign HHH preferences?question0.804Key open question identified by authors as limitation; not demonstrated in this work