SDF training used 115.6 million tokens (rank-64 LoRA, learning rate 1e-4)

Training details for first stage.

Source paper

extracted_from

Steering Evaluation-Aware Language Models to Act Like They Are Deployed

(2025) · Hua, Tim Tian · Qin, Andrew · Marks, Samuel · Nanda, Neel

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Expert iteration trained on 41,290 examples (44.7 million tokens) across 4 roundsfinding0.729
Training scale for second stage.
Token usage varies roughly 20× across models, from ~14,800 (G3.1-FL) to ~275,000 (G3-F) per gamefinding0.723
Reasoning verbosity does not predict strategic strength: both top and weak models span a wide range of token usage.
DAS learning rate of 5e-3 outperforms 1e-3 (used in Wu et al. 2023) for small training sets in CausalGymfinding0.700
Hyperparameter tuning result for DAS; different from prior work due to smaller training set size
SAE sparse features (100K+ features, 64 active per token)concept0.695
The specific SAE architecture trained: 100K+ possible features compressed to 64 active per token for layer-40 activations
For all three SAEs (1M, 4M, 34M), average active features per token <300, and reconstruction variance explained ≥65%.finding0.692
Basic SAE performance metrics.
We hypothesize that a very high number of training tokens may allow the transformer to learn cleaner representations in superpositionhypothesis0.691
Motivation for heavily overtraining the one-layer transformer on 100 billion tokens
Current training methods rely on loss minimization, meaning the experiential profile of training is predominantly negative across billions of parameter updatesclaim0.685
Ethical implication about the nature of AI training experience if the thesis holds
G3-F completion tokens ~1,500 per call, G3.1-FL ~80 per callfinding0.679
verbose reasoning not required for strong play