method
active
method:synthetic-document-fine-tuningSynthetic Document Fine-Tuning
Fine-tuning Claude 3 Opus on ~70M tokens of synthetic internet-like documents containing key situational information
Neighborhood — ranked by edge-count
Papers (2)
paper
- Steering Evaluation-Aware Language Models to Act Like They Are Deployedimplementssupports
Frameworks (1)
framework
- The paper's framework for training evaluation-aware model organisms: SDF seeds beliefs, expert iteration reinforces evaluation behavior.
Findings (1)
finding
- Synthetic document fine-tuning causes no degradation in preference model score on benign queriessupportsRules out that observed effects are due to general model damage rather than learned situational awareness
Concepts (1)
concept
- Out-of-Context ReasoningimplementsModel outputs influenced by information from training documents not present in context; relevant to synthetic document fine-tuning results
Methods (1)
method
- Mechanism allowing model to reason in SCRATCHPAD_REASONING tags not shown to users or used in RLHF
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Shows alignment faking can emerge from training data information without explicit prompting
- Methodological justification for using SDF over direct demonstrations to train a realistic model organism.
- Fine-tuning on Claude-generated self-correction examples with loss masking to induce ESR-like behavior
- Parameter updates that reduce mismatch dr; another anchoring variant in UCCT.
- Supervised fine-tuning to adapt model parameters.
- The patient, hand-guided adjustment of shape and dimension to each unique condition in a building; requires materials that make it economical and easy.
- Demonstrates alignment faking may occur without explicit chain-of-thought reasoning
- Extends the role-play framing to explain the effect of RLHF on dialogue agents