claim

active

claim:synthetic-document-fine-tuning-avoids-artificially-strengthening-the-evaluation-deployment-representational-direction-compared-to-direct-demonstration-fine-tuning

Synthetic document fine-tuning avoids artificially strengthening the evaluation-deployment representational direction compared to direct demonstration fine-tuning

Methodological justification for using SDF over direct demonstrations to train a realistic model organism.

Source paper

extracted_from

Steering Evaluation-Aware Language Models to Act Like They Are Deployed

(2025) · Hua, Tim Tian · Qin, Andrew · Marks, Samuel · Nanda, Neel

Neighborhood — ranked by edge-count

Papers (1)

paper

Steering Evaluation-Aware Language Models to Act Like They Are Deployed
supports

Findings (1)

finding

Steering with same vector on pre-fine-tuned Llama Nemotron has zero effect on type hint rate
supports
Falsifies hypothesis that steering vector encodes information about type hints; steering can only affect behavior via deployment-belief modification.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Synthetic document fine-tuning produces substantial alignment-faking reasoning in both helpful-only and animal welfare settingsfinding0.866
Shows alignment faking can emerge from training data information without explicit prompting
Synthetic Document Fine-Tuningmethod0.860
Fine-tuning Claude 3 Opus on ~70M tokens of synthetic internet-like documents containing key situational information
Synthetic document fine-tuning causes no degradation in preference model score on benign queriesfinding0.820
Rules out that observed effects are due to general model damage rather than learned situational awareness
Synthetic Self-Correction Fine-Tuningmethod0.799
Fine-tuning on Claude-generated self-correction examples with loss masking to induce ESR-like behavior
SOO fine-tuning could be extended to align AI representations of its own goals with human user preferences, reducing misalignment by fostering coherence between self-related and other-related preferenceshypothesis0.785
Future work hypothesis about extending SOO to direct value alignment
Fine-tuning can be likened to imposing a kind of censorship on the simulator; it leaves the underlying range of roles essentially the same but compromises authenticityclaim0.782
Extends the role-play framing to explain the effect of RLHF on dialogue agents
Synthetic document fine-tuned model without chain-of-thought shows 20.1% compliance gap in helpful-only and 13.9% in animal welfarefinding0.779
Demonstrates alignment faking may occur without explicit chain-of-thought reasoning
Fine-Tuning via Reinforcement Learningmethod0.776
Technique used to impose guardrails on base LLMs, analogized to censorship on the simulator's range of simulacra