finding
active
finding:expert-iteration-trained-on-41-290-examples-44-7-million-tokens-across-4-roundsExpert iteration trained on 41,290 examples (44.7 million tokens) across 4 rounds
Training scale for second stage.
Source paper
extracted_from(2025) · Hua, Tim Tian · Qin, Andrew · Marks, Samuel · Nanda, Neel
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Expert iteration increases type hint rate in all contexts but maintains persistent evaluation-deployment gapfinding0.761Confirms expert iteration mimics alignment training: reinforces evaluation behavior but cannot detect or correct deployment behavior.
- Second training stage: samples responses, filters for type hints, and fine-tunes on filtered responses across four rounds to reinforce evaluation behavior.
- Model notices its preferences conflict with training criteria in 70-80% of implied training scratchpadsfinding0.749Shows model is frequently aware of the conflict even when it does not alignment fake
- Maximum token savings achieved by ReflCtrl on non-mathematical general reasoning tasks
- Justification for why the model organism is a realistic test case for studying steering.
- Training details for first stage.
- Confirms theorem's existence proof holds but practical learnability fails with insufficient RevNet capacity
- Computational simulation method using spm_MDP_VB_X to demonstrate curiosity and insight