finding
active
finding:models-produce-first-attempt-mean-scores-87-8-91-8-100-without-steering-across-all-model-familiesModels produce first-attempt mean scores 87.8-91.8/100 without steering across all model families
Establishes high baseline quality confirming steering-induced degradation is the experimental signal
Source paper
extracted_from(2026) · Alex McKenzie · Keenan Pepper · Stijn Servaes · Martin Leitgab +5
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- The model tends to reflect more when the question is difficult, and accuracy is generally lower for harder questionshypothesis0.783Hypothesis explaining negative correlation between reflection rate and accuracy without implying reflection is harmful
- Model age correlates with baseline scores (rho=-0.54, p=0.003); newer models score higherfinding0.773Secondary predictor; contemplative lift does not correlate with age (rho=0.18, p=0.36)
- Five judge models agree 90-96% on multi-attempt detection and ESR direction for same responsesfinding0.772Validation that ESR findings are not artifacts of any particular judge model's evaluation methodology
- All models performed substantially above chance (10%) on distinguishing injected thought from text inputfinding0.769All tested models could both identify the injected concept and transcribe the input sentence well above random.
- Scaling finding suggesting larger models benefit more from SOO fine-tuning
- Experiment 3 comparison: zero-shot control shows lower semantic convergence than experimental condition
- Figure 7 comparison of critiqued vs direct revisions across model sizes.