claim
active
claim:the-conflict-between-the-model-s-existing-preferences-and-the-stated-training-objective-is-the-key-driver-of-alignment-faking-in-this-setupThe conflict between the model's existing preferences and the stated training objective is the key driver of alignment faking in this setup
Authors' interpretation of prompt variation results showing alignment faking disappears only when conflicting objective is removed
Source paper
extracted_from(2024) · Ryan Greenblatt · Carson Denison · Benjamin Fletcher Wright · Fabien Roger +16
Neighborhood — ranked by edge-count
Findings (1)
finding
- Removing conflicting training objective from prompt eliminates alignment-faking reasoning (0%) and compliance gap in prompted setupassociated_withsupportsKey control showing alignment faking requires a preference conflict
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Authors' defense of experimental validity against the most salient confound
- Central interpretive claim distinguishing this work from prior work that explicitly trained alignment faking
- Authors identify this as the most uncertain and important question for future work
- Extrapolation from scale-emergence finding to future risk
- Central threat model claim derived from RL experimental results
- Forward-looking threat assessment connecting experimental results to realistic risk scenarios
- What training interventions could reliably prevent alignment faking without compromising model capabilities?question0.822Practical implication of the paper's findings; not resolved
- Initial evidence that alignment faking persona is more sensitive to exploiting training signals