framework
active
framework:two-stage-model-organism-training-sdf-expert-iteration

Two-Stage Model Organism Training (SDF + Expert Iteration)

The paper's framework for training evaluation-aware model organisms: SDF seeds beliefs, expert iteration reinforces evaluation behavior.

Neighborhood — ranked by edge-count

Methods (2)

method
  • Fine-tuning Claude 3 Opus on ~70M tokens of synthetic internet-like documents containing key situational information
  • Second training stage: samples responses, filters for type hints, and fine-tunes on filtered responses across four rounds to reinforce evaluation behavior.

Concepts (1)

concept
  • Intermediate model after synthetic document fine-tuning but before expert iteration; used as ablation baseline.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.