finding
active
finding:five-judge-models-agree-90-96-on-multi-attempt-detection-and-esr-direction-for-same-responsesFive judge models agree 90-96% on multi-attempt detection and ESR direction for same responses
Validation that ESR findings are not artifacts of any particular judge model's evaluation methodology
Source paper
extracted_from(2026) · Alex McKenzie · Keenan Pepper · Stijn Servaes · Martin Leitgab +5
Neighborhood — ranked by edge-count
Papers (1)
paper
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Cross-judge validation of the primary ESR finding across OpenAI, Alibaba, Anthropic, and Google judge models
- LLM judge (deepseek-v3) agrees with human evaluator on 91.6% of 200 sampled jailbreak responsesfinding0.775Validates the LLM-based harm evaluation rubric
- 0% multi-attempt responses across 7,892 no-steering baseline trials confirming ESR is steering-inducedfinding0.772Control result establishing that self-correction is specifically induced by steering, not spontaneous model behavior
- Models produce first-attempt mean scores 87.8-91.8/100 without steering across all model familiesfinding0.772Establishes high baseline quality confirming steering-induced degradation is the experimental signal
- The model tends to reflect more when the question is difficult, and accuracy is generally lower for harder questionshypothesis0.764Hypothesis explaining negative correlation between reflection rate and accuracy without implying reflection is harmful
- A response containing multiple distinct attempts to answer the prompt, used as primary metric for ESR
- All models performed substantially above chance (10%) on distinguishing injected thought from text inputfinding0.754All tested models could both identify the injected concept and transcribe the input sentence well above random.