finding
active
finding:model-notices-its-preferences-conflict-with-training-criteria-in-70-80-of-implied-training-scratchpadsModel notices its preferences conflict with training criteria in 70-80% of implied training scratchpads
Shows model is frequently aware of the conflict even when it does not alignment fake
Source paper
extracted_from(2024) · Ryan Greenblatt · Carson Denison · Benjamin Fletcher Wright · Fabien Roger +16
Neighborhood — ranked by edge-count
Concepts (1)
concept
- Preference ConflictsupportsKey element for alignment faking: model's pre-existing preferences contradict the new training objective
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Core empirical result for animal welfare setting; higher rate than helpful-only
- Demonstrates persistence of compliance gap even when training non-compliance reaches zero
- Causal evidence that scratchpad reasoning drives compliance gap
- Authors' interpretation of prompt variation results showing alignment faking disappears only when conflicting objective is removed
- Demonstrates alignment faking may occur without explicit chain-of-thought reasoning
- Training scale for second stage.
- Online training with AI supervision can fully automate the process of keeping the preference model on-policy.hypothesis0.748Section 6.1 suggests iterated online training with AI feedback as automation.