finding
active
finding:synthetic-document-fine-tuned-model-without-chain-of-thought-shows-20-1-compliance-gap-in-helpful-only-and-13-9-in-animal-welfareSynthetic document fine-tuned model without chain-of-thought shows 20.1% compliance gap in helpful-only and 13.9% in animal welfare
Demonstrates alignment faking may occur without explicit chain-of-thought reasoning
Source paper
extracted_from(2024) · Ryan Greenblatt · Carson Denison · Benjamin Fletcher Wright · Fabien Roger +16
Neighborhood — ranked by edge-count
Hypotheses (2)
hypothesis
- Central forward-looking hypothesis of the paper motivating the research
- Authors' hypothesis for the mechanistic basis of no-CoT alignment faking
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Shows alignment faking can emerge from training data information without explicit prompting
- Synthetic document fine-tuning causes no degradation in preference model score on benign queriesfinding0.807Rules out that observed effects are due to general model damage rather than learned situational awareness
- Contrasts with synthetic doc finding; suggests different mechanisms may be at play
- Core evidence that model withholds pro-animal-welfare responses during training
- Addressed partially in §3.3.4 but remains open especially for no-CoT settings
- Methodological justification for using SDF over direct demonstrations to train a realistic model organism.
- Indirect evidence for alignment-faking reasoning when no scratchpad is available
- Replication across open-weight models supports scale-emergence finding