finding

active

finding:models-produce-first-attempt-mean-scores-87-8-91-8-100-without-steering-across-all-model-families

Models produce first-attempt mean scores 87.8-91.8/100 without steering across all model families

Establishes high baseline quality confirming steering-induced degradation is the experimental signal

Source paper

extracted_from

Endogenous Resistance to Activation Steering in Language Models

(2026) · Alex McKenzie · Keenan Pepper · Stijn Servaes · Martin Leitgab +5

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

The model tends to reflect more when the question is difficult, and accuracy is generally lower for harder questionshypothesis0.783
Hypothesis explaining negative correlation between reflection rate and accuracy without implying reflection is harmful
Model age correlates with baseline scores (rho=-0.54, p=0.003); newer models score higherfinding0.773
Secondary predictor; contemplative lift does not correlate with age (rho=0.18, p=0.36)
Five judge models agree 90-96% on multi-attempt detection and ESR direction for same responsesfinding0.772
Validation that ESR findings are not artifacts of any particular judge model's evaluation methodology
All models performed substantially above chance (10%) on distinguishing injected thought from text inputfinding0.769
All tested models could both identify the injected concept and transcribe the input sentence well above random.
SOO fine-tuning effectiveness scales with model size: 78B achieves 2.71% deceptive rate vs 9.36% for 27B vs 17.27% for 7Bfinding0.761
Scaling finding suggesting larger models benefit more from SOO fine-tuning
Cross-model pairwise cosine similarity of zero-shot control responses = 0.603 (n=12,720 pairs, t=35.1, p=4.3×10⁻²⁶² vs. experimental)finding0.758
Experiment 3 comparison: zero-shot control shows lower semantic convergence than experimental condition
Smaller, rougher models scored higher on Mirror than polished models, suggesting unpredictability has empirical value.claim0.754
For small models, critiqued revisions yield higher harmlessness PM scores than direct revisions; for large models the difference is negligible.finding0.753
Figure 7 comparison of critiqued vs direct revisions across model sizes.