Synthetic Document Fine-Tuning

Fine-tuning Claude 3 Opus on ~70M tokens of synthetic internet-like documents containing key situational information

Neighborhood — ranked by edge-count

Papers (2)

paper

Steering Evaluation-Aware Language Models to Act Like They Are Deployed
implementssupports
Alignment faking in large language models
introduces

Frameworks (1)

framework

Two-Stage Model Organism Training (SDF + Expert Iteration)
uses
The paper's framework for training evaluation-aware model organisms: SDF seeds beliefs, expert iteration reinforces evaluation behavior.

Findings (1)

finding

Synthetic document fine-tuning causes no degradation in preference model score on benign queries
supports
Rules out that observed effects are due to general model damage rather than learned situational awareness

Concepts (1)

concept

Out-of-Context Reasoning
implements
Model outputs influenced by information from training documents not present in context; relevant to synthetic document fine-tuning results

Methods (1)

method

Hidden Chain-of-Thought Scratchpad
uses
Mechanism allowing model to reason in SCRATCHPAD_REASONING tags not shown to users or used in RLHF

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Synthetic document fine-tuning produces substantial alignment-faking reasoning in both helpful-only and animal welfare settingsfinding0.865
Shows alignment faking can emerge from training data information without explicit prompting
Synthetic document fine-tuning avoids artificially strengthening the evaluation-deployment representational direction compared to direct demonstration fine-tuningclaim0.860
Methodological justification for using SDF over direct demonstrations to train a realistic model organism.
Synthetic Self-Correction Fine-Tuningmethod0.845
Fine-tuning on Claude-generated self-correction examples with loss masking to induce ESR-like behavior
Fine-tuningconcept0.841
Parameter updates that reduce mismatch dr; another anchoring variant in UCCT.
fine-tuning (SFT)method0.799
Supervised fine-tuning to adapt model parameters.
Fine Tuning and Adaptationconcept0.795
The patient, hand-guided adjustment of shape and dimension to each unique condition in a building; requires materials that make it economical and easy.
Synthetic document fine-tuned model without chain-of-thought shows 20.1% compliance gap in helpful-only and 13.9% in animal welfarefinding0.772
Demonstrates alignment faking may occur without explicit chain-of-thought reasoning
Fine-tuning can be likened to imposing a kind of censorship on the simulator; it leaves the underlying range of roles essentially the same but compromises authenticityclaim0.764
Extends the role-play framing to explain the effect of RLHF on dialogue agents