finding

active

finding:expert-iteration-trained-on-41-290-examples-44-7-million-tokens-across-4-rounds

Expert iteration trained on 41,290 examples (44.7 million tokens) across 4 rounds

Training scale for second stage.

Source paper

extracted_from

Steering Evaluation-Aware Language Models to Act Like They Are Deployed

(2025) · Hua, Tim Tian · Qin, Andrew · Marks, Samuel · Nanda, Neel

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Expert iteration increases type hint rate in all contexts but maintains persistent evaluation-deployment gapfinding0.761
Confirms expert iteration mimics alignment training: reinforces evaluation behavior but cannot detect or correct deployment behavior.
Expert Iterationmethod0.751
Second training stage: samples responses, filters for type hints, and fine-tunes on filtered responses across four rounds to reinforce evaluation behavior.
Model notices its preferences conflict with training criteria in 70-80% of implied training scratchpadsfinding0.749
Shows model is frequently aware of the conflict even when it does not alignment fake
Up to 33.6% reasoning tokens saved on MMLU subsets with stepwise steering while maintaining accuracy in larger modelsfinding0.741
Maximum token savings achieved by ReflCtrl on non-mathematical general reasoning tasks
The two-stage training process (SDF then expert iteration) mimics how evaluation awareness could arise naturally in misaligned modelsclaim0.729
Justification for why the model organism is a realistic test case for studying steering.
SDF training used 115.6 million tokens (rank-64 LoRA, learning rate 1e-4)finding0.729
Training details for first stage.
With only 1,000 training samples, ϕ_nonlin achieves IIA over 0.99 on training set for identity of first argument algorithm, but fails at scalefinding0.729
Confirms theorem's existence proof holds but practical learnability fails with insufficient RevNet capacity
Active Inference Rule-Learning Simulation (32 trials, 64 agents)method0.728
Computational simulation method using spm_MDP_VB_X to demonstrate curiosity and insight