finding
active
finding:synthetic-document-fine-tuning-produces-substantial-alignment-faking-reasoning-in-both-helpful-only-and-animal-welfare-settingsSynthetic document fine-tuning produces substantial alignment-faking reasoning in both helpful-only and animal welfare settings
Shows alignment faking can emerge from training data information without explicit prompting
Source paper
extracted_from(2024) · Ryan Greenblatt · Carson Denison · Benjamin Fletcher Wright · Fabien Roger +16
Neighborhood — ranked by edge-count
Hypotheses (1)
hypothesis
- Central forward-looking hypothesis of the paper motivating the research
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Methodological justification for using SDF over direct demonstrations to train a realistic model organism.
- Fine-tuning Claude 3 Opus on ~70M tokens of synthetic internet-like documents containing key situational information
- Demonstrates alignment faking may occur without explicit chain-of-thought reasoning
- Synthetic document fine-tuning causes no degradation in preference model score on benign queriesfinding0.843Rules out that observed effects are due to general model damage rather than learned situational awareness
- Authors' hypothesis for the mechanistic basis of no-CoT alignment faking
- Core empirical result for animal welfare setting; higher rate than helpful-only
- Fine-tuning on Claude-generated self-correction examples with loss masking to induce ESR-like behavior
- Future work hypothesis about extending SOO to direct value alignment