finding

active

finding:all-five-judge-models-consistently-rank-llama-3-3-70b-as-having-substantially-higher-esr-rates-than-other-models

All five judge models consistently rank Llama-3.3-70B as having substantially higher ESR rates than other models

Cross-judge validation of the primary ESR finding across OpenAI, Alibaba, Anthropic, and Google judge models

Source paper

extracted_from

Endogenous Resistance to Activation Steering in Language Models

(2026) · Alex McKenzie · Keenan Pepper · Stijn Servaes · Martin Leitgab +5

Neighborhood — ranked by edge-count

Papers (1)

paper

Endogenous Resistance to Activation Steering in Language Models
introduces

Claims (2)

claim

Llama-3.3-70B exhibits internal consistency-checking mechanisms that operate during inference
supports
Central interpretive claim of the paper supported by causal ablation and activation evidence
We cannot isolate whether ESR reflects scale, architecture, or training procedures in Llama-3.3-70B
supports
Epistemic limitation claim acknowledging confounds in the cross-model comparison

Concepts (1)

concept

Llama-3.3-70B-Instruct
cites
Primary model of interest showing substantial ESR; largest model tested in the study

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Llama-3.3-70B shows multi-attempt rate of 7.4% vs. ≤1.2% for all other models testedfinding0.817
Supporting finding showing ESR is driven by both higher multi-attempt rates and comparable improvement rates
ESR exhibits non-monotonic relationship with boost level, peaking around -0.3σ below threshold in Llama-3.3-70Bfinding0.817
Characterizes the narrow operating window in which ESR can manifest
LLaMA-2-70B and 13B probes generalize better across datasets than 7B probes across all training sets and probe typesfinding0.801
Larger models linearly represent more general concepts including truth
Llama-3.3-70B corrected response scores 75/100 rather than 100 due to residual steering effects (Snell's law reference)finding0.797
Illustrative finding that ESR mitigates but does not fully eliminate steering influence
Llama 3.1 405B shows 14% compliance gap in minimal helpful-only replication but smaller Llama and Mistral models show no gapfinding0.790
Replication across open-weight models supports scale-emergence finding
All three Gemma-2 models show ESR rates below 1%, near indistinguishable from zerofinding0.787
Establishes potential Llama-family specificity or scale specificity of ESR phenomenon
DeepSeek-R1 Llama 8b gains 0.16% accuracy on GSM8k with positive intervention (more reflections) at cost of ~2000 additional tokensfinding0.783
Only model showing marginal benefit from increased reflection, at substantial token cost
Greedy-decoded self-reports in LLaMA-3.2-3B collapse to 1.1–3.9 distinct values on a 10-point scalefinding0.781
Demonstrates that default decoding masks introspective capacity; entropy 0.03–1.10 bits