finding
active
finding:model-organism-accuracy-on-bigcodebench-56-8-pre-fine-tuned-59-1-steering-toward-deployment-55-2-steering-toward-evaluation-43-1Model organism accuracy on BigCodeBench: 56.8% (pre-fine-tuned: 59.1%); steering toward deployment: 55.2%; steering toward evaluation: 43.1%
Quantifies performance cost of fine-tuning and steering; deployment steering has minimal accuracy cost.
Source paper
extracted_from(2025) · Hua, Tim Tian · Qin, Andrew · Marks, Samuel · Nanda, Neel
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Model organism writes type hints in evaluation contexts ~80-86% of the time on BigCodeBenchfinding0.788Core behavioral result: evaluation-aware model successfully exhibits target evaluation behavior.
- Model organism writes type hints in deployment contexts ~34% (33-34%) of the time without steeringfinding0.772Establishes baseline deployment behavior lower than evaluation rate, confirming evaluation-deployment gap.
- Task accuracy on CausalGym increases consistently with model scale from 0.62 (14M) to 0.89 (6.9B)finding0.772Scaling result showing larger pythia models perform better on CausalGym linguistic tasks
- Demonstrates alignment faking may occur without explicit chain-of-thought reasoning
- Models produce first-attempt mean scores 87.8-91.8/100 without steering across all model familiesfinding0.750Establishes high baseline quality confirming steering-induced degradation is the experimental signal
- Scaling finding suggesting larger models benefit more from SOO fine-tuning
- Methodological claim distinguishing this paper from prior work on verbalization suppression.
- Group-level simulation result showing generalizability of BMR benefit across agents