claim
active
claim:synthetic-document-fine-tuning-avoids-artificially-strengthening-the-evaluation-deployment-representational-direction-compared-to-direct-demonstration-fine-tuningSynthetic document fine-tuning avoids artificially strengthening the evaluation-deployment representational direction compared to direct demonstration fine-tuning
Methodological justification for using SDF over direct demonstrations to train a realistic model organism.
Source paper
extracted_from(2025) · Hua, Tim Tian · Qin, Andrew · Marks, Samuel · Nanda, Neel
Neighborhood — ranked by edge-count
Papers (1)
paper
Findings (1)
finding
- Steering with same vector on pre-fine-tuned Llama Nemotron has zero effect on type hint ratesupportsFalsifies hypothesis that steering vector encodes information about type hints; steering can only affect behavior via deployment-belief modification.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Shows alignment faking can emerge from training data information without explicit prompting
- Fine-tuning Claude 3 Opus on ~70M tokens of synthetic internet-like documents containing key situational information
- Synthetic document fine-tuning causes no degradation in preference model score on benign queriesfinding0.820Rules out that observed effects are due to general model damage rather than learned situational awareness
- Fine-tuning on Claude-generated self-correction examples with loss masking to induce ESR-like behavior
- Future work hypothesis about extending SOO to direct value alignment
- Extends the role-play framing to explain the effect of RLHF on dialogue agents
- Demonstrates alignment faking may occur without explicit chain-of-thought reasoning
- Technique used to impose guardrails on base LLMs, analogized to censorship on the simulator's range of simulacra