claim
active
claim:the-two-stage-training-process-sdf-then-expert-iteration-mimics-how-evaluation-awareness-could-arise-naturally-in-misaligned-modelsThe two-stage training process (SDF then expert iteration) mimics how evaluation awareness could arise naturally in misaligned models
Justification for why the model organism is a realistic test case for studying steering.
Source paper
extracted_from(2025) · Hua, Tim Tian · Qin, Andrew · Marks, Samuel · Nanda, Neel
Neighborhood — ranked by edge-count
Findings (1)
finding
- Confirms expert iteration mimics alignment training: reinforces evaluation behavior but cannot detect or correct deployment behavior.
Hypotheses (1)
hypothesis
- Motivation for the two-stage training design; links the model organism to plausible natural emergence.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- The paper's framework for training evaluation-aware model organisms: SDF seeds beliefs, expert iteration reinforces evaluation behavior.
- Authors' interpretation of prompt variation results showing alignment faking disappears only when conflicting objective is removed
- Architectural design principle that decouples rationale generation (stage 1) from answer inference (stage 2) in Multimodal-CoT.
- Central thesis about the role of agency in evolutionary dynamics.
- Methodological claim distinguishing this paper from prior work on verbalization suppression.
- Developmental analogy used to explain sample efficiency under high ρd conditions
- Load-bearing interpretive claim about the layer-specificity of Burger et al.'s finding.