framework
active
framework:two-stage-model-organism-training-sdf-expert-iterationTwo-Stage Model Organism Training (SDF + Expert Iteration)
The paper's framework for training evaluation-aware model organisms: SDF seeds beliefs, expert iteration reinforces evaluation behavior.
Neighborhood — ranked by edge-count
Papers (1)
paper
Methods (2)
method
- Fine-tuning Claude 3 Opus on ~70M tokens of synthetic internet-like documents containing key situational information
- Expert IterationusesSecond training stage: samples responses, filters for type hints, and fine-tunes on filtered responses across four rounds to reinforce evaluation behavior.
Concepts (1)
concept
- SDF-Only Model Organismassociated_withIntermediate model after synthetic document fine-tuning but before expert iteration; used as ablation baseline.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Justification for why the model organism is a realistic test case for studying steering.
- Methodological claim distinguishing this paper from prior work on verbalization suppression.
- Activation steering works on SDF-only model organism (before expert iteration) with steering strength 0.4finding0.743Replicates main result on simpler model; qualitatively similar patterns.
- Architectural design principle that decouples rationale generation (stage 1) from answer inference (stage 2) in Multimodal-CoT.
- Hypothesis tested in Experiment 3; independently trained GPT, Claude, Gemini architectures converge on similar descriptive vocabulary
- Models trained directly with asynchronous updates would exhibit even greater robustness than synchronously trained modelshypothesis0.717Hypothesis that motivated the asynchronous robustness comparison experiment
- Central thesis about the role of agency in evolutionary dynamics.