finding

active

finding:model-organism-accuracy-on-bigcodebench-56-8-pre-fine-tuned-59-1-steering-toward-deployment-55-2-steering-toward-evaluation-43-1

Model organism accuracy on BigCodeBench: 56.8% (pre-fine-tuned: 59.1%); steering toward deployment: 55.2%; steering toward evaluation: 43.1%

Quantifies performance cost of fine-tuning and steering; deployment steering has minimal accuracy cost.

Source paper

extracted_from

Steering Evaluation-Aware Language Models to Act Like They Are Deployed

(2025) · Hua, Tim Tian · Qin, Andrew · Marks, Samuel · Nanda, Neel

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Model organism writes type hints in evaluation contexts ~80-86% of the time on BigCodeBenchfinding0.788
Core behavioral result: evaluation-aware model successfully exhibits target evaluation behavior.
Model organism writes type hints in deployment contexts ~34% (33-34%) of the time without steeringfinding0.772
Establishes baseline deployment behavior lower than evaluation rate, confirming evaluation-deployment gap.
Task accuracy on CausalGym increases consistently with model scale from 0.62 (14M) to 0.89 (6.9B)finding0.772
Scaling result showing larger pythia models perform better on CausalGym linguistic tasks
Synthetic document fine-tuned model without chain-of-thought shows 20.1% compliance gap in helpful-only and 13.9% in animal welfarefinding0.766
Demonstrates alignment faking may occur without explicit chain-of-thought reasoning
Models produce first-attempt mean scores 87.8-91.8/100 without steering across all model familiesfinding0.750
Establishes high baseline quality confirming steering-induced degradation is the experimental signal
SOO fine-tuning effectiveness scales with model size: 78B achieves 2.71% deceptive rate vs 9.36% for 27B vs 17.27% for 7Bfinding0.746
Scaling finding suggesting larger models benefit more from SOO fine-tuning
Training a model organism with known ground-truth deployment/evaluation behaviors allows validation that steering elicits deployment behavior rather than merely suppressing verbalizationsclaim0.745
Methodological claim distinguishing this paper from prior work on verbalization suppression.
Nearly all of 64 simulated agents attain 100% performance at around the 10th trial when allowed to perform abductive Bayesian model reduction after each trial.finding0.741
Group-level simulation result showing generalizability of BMR benefit across agents