Two-Stage Model Organism Training (SDF + Expert Iteration)

The paper's framework for training evaluation-aware model organisms: SDF seeds beliefs, expert iteration reinforces evaluation behavior.

Neighborhood — ranked by edge-count

Papers (1)

paper

Steering Evaluation-Aware Language Models to Act Like They Are Deployed
introduces

Methods (2)

method

Synthetic Document Fine-Tuning
uses
Fine-tuning Claude 3 Opus on ~70M tokens of synthetic internet-like documents containing key situational information
Expert Iteration
uses
Second training stage: samples responses, filters for type hints, and fine-tunes on filtered responses across four rounds to reinforce evaluation behavior.

Concepts (1)

concept

SDF-Only Model Organism
associated_with
Intermediate model after synthetic document fine-tuning but before expert iteration; used as ablation baseline.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

The two-stage training process (SDF then expert iteration) mimics how evaluation awareness could arise naturally in misaligned modelsclaim0.891
Justification for why the model organism is a realistic test case for studying steering.
Foundation models trained on different data converge on similar latent representations, suggesting a Platonic form.claim0.753
Training a model organism with known ground-truth deployment/evaluation behaviors allows validation that steering elicits deployment behavior rather than merely suppressing verbalizationsclaim0.752
Methodological claim distinguishing this paper from prior work on verbalization suppression.
Activation steering works on SDF-only model organism (before expert iteration) with steering strength 0.4finding0.743
Replicates main result on simpler model; qualitatively similar patterns.
two-stage separation of rationale generation and answer inferenceframework0.726
Architectural design principle that decouples rationale generation (stage 1) from answer inference (stage 2) in Multimodal-CoT.
Independently trained model families converge on a common semantic manifold under self-referential processing, suggesting an attractor dynamic that transcends training variancehypothesis0.723
Hypothesis tested in Experiment 3; independently trained GPT, Claude, Gemini architectures converge on similar descriptive vocabulary
Models trained directly with asynchronous updates would exhibit even greater robustness than synchronously trained modelshypothesis0.717
Hypothesis that motivated the asynchronous robustness comparison experiment
Multi-scale competency greatly accelerates evolution and enables generalization.claim0.717
Central thesis about the role of agency in evolutionary dynamics.