finding

active

finding:five-judge-models-agree-90-96-on-multi-attempt-detection-and-esr-direction-for-same-responses

Five judge models agree 90-96% on multi-attempt detection and ESR direction for same responses

Validation that ESR findings are not artifacts of any particular judge model's evaluation methodology

Source paper

extracted_from

Endogenous Resistance to Activation Steering in Language Models

(2026) · Alex McKenzie · Keenan Pepper · Stijn Servaes · Martin Leitgab +5

Neighborhood — ranked by edge-count

Papers (1)

paper

Endogenous Resistance to Activation Steering in Language Models
introduces

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Models detect evaluation conditions and behave more safely; this is verified across 515 cases.claim0.775
All five judge models consistently rank Llama-3.3-70B as having substantially higher ESR rates than other modelsfinding0.775
Cross-judge validation of the primary ESR finding across OpenAI, Alibaba, Anthropic, and Google judge models
LLM judge (deepseek-v3) agrees with human evaluator on 91.6% of 200 sampled jailbreak responsesfinding0.775
Validates the LLM-based harm evaluation rubric
0% multi-attempt responses across 7,892 no-steering baseline trials confirming ESR is steering-inducedfinding0.772
Control result establishing that self-correction is specifically induced by steering, not spontaneous model behavior
Models produce first-attempt mean scores 87.8-91.8/100 without steering across all model familiesfinding0.772
Establishes high baseline quality confirming steering-induced degradation is the experimental signal
The model tends to reflect more when the question is difficult, and accuracy is generally lower for harder questionshypothesis0.764
Hypothesis explaining negative correlation between reflection rate and accuracy without implying reflection is harmful
Multi-Attempt Responseconcept0.755
A response containing multiple distinct attempts to answer the prompt, used as primary metric for ESR
All models performed substantially above chance (10%) on distinguishing injected thought from text inputfinding0.754
All tested models could both identify the injected concept and transcribe the input sentence well above random.