claim

active

claim:the-two-stage-training-process-sdf-then-expert-iteration-mimics-how-evaluation-awareness-could-arise-naturally-in-misaligned-models

The two-stage training process (SDF then expert iteration) mimics how evaluation awareness could arise naturally in misaligned models

Justification for why the model organism is a realistic test case for studying steering.

Source paper

extracted_from

Steering Evaluation-Aware Language Models to Act Like They Are Deployed

(2025) · Hua, Tim Tian · Qin, Andrew · Marks, Samuel · Nanda, Neel

Neighborhood — ranked by edge-count

Findings (1)

finding

Expert iteration increases type hint rate in all contexts but maintains persistent evaluation-deployment gap
supports
Confirms expert iteration mimics alignment training: reinforces evaluation behavior but cannot detect or correct deployment behavior.

Hypotheses (1)

hypothesis

Misaligned models might acquire evaluation awareness through reward hacking or goal misgeneralization during normal training without deliberate design
supports
Motivation for the two-stage training design; links the model organism to plausible natural emergence.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Two-Stage Model Organism Training (SDF + Expert Iteration)framework0.891
The paper's framework for training evaluation-aware model organisms: SDF seeds beliefs, expert iteration reinforces evaluation behavior.
The conflict between the model's existing preferences and the stated training objective is the key driver of alignment faking in this setupclaim0.759
Authors' interpretation of prompt variation results showing alignment faking disappears only when conflicting objective is removed
Foundation models trained on different data converge on similar latent representations, suggesting a Platonic form.claim0.750
two-stage separation of rationale generation and answer inferenceframework0.750
Architectural design principle that decouples rationale generation (stage 1) from answer inference (stage 2) in Multimodal-CoT.
Multi-scale competency greatly accelerates evolution and enables generalization.claim0.744
Central thesis about the role of agency in evolutionary dynamics.
Training a model organism with known ground-truth deployment/evaluation behaviors allows validation that steering elicits deployment behavior rather than merely suppressing verbalizationsclaim0.743
Methodological claim distinguishing this paper from prior work on verbalization suppression.
Pretraining plays a role analogous to unlabeled experience in humans — building P_prior before semantic binding — explaining why few labeled examples sufficeclaim0.742
Developmental analogy used to explain sample efficiency under high ρd conditions
The two-dimensional subspace reported by Burger et al. (2024) seems to reflect a stage of transition in the model's processing, rather than a universal property of truth directions.quote0.742
Load-bearing interpretive claim about the layer-specificity of Burger et al.'s finding.