finding

active

finding:llama-3-3-70b-shows-multi-attempt-rate-of-7-4-vs-1-2-for-all-other-models-tested

Llama-3.3-70B shows multi-attempt rate of 7.4% vs. ≤1.2% for all other models tested

Supporting finding showing ESR is driven by both higher multi-attempt rates and comparable improvement rates

Source paper

extracted_from

Endogenous Resistance to Activation Steering in Language Models

(2026) · Alex McKenzie · Keenan Pepper · Stijn Servaes · Martin Leitgab +5

Neighborhood — ranked by edge-count

Papers (1)

paper

Endogenous Resistance to Activation Steering in Language Models
introduces

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Multi-attempt improvement rate peaks at 83% around -1.0σ below threshold in Llama-3.3-70Bfinding0.893
Shows slightly weaker steering allows more successful corrections, characterizing optimal ESR conditions
Meta-prompting increases Llama-3.3-70B multi-attempt rate 4.3× (from 7.4% to 31.7%)finding0.882
Demonstrates ESR can be deliberately enhanced through prompting in the largest model
Fine-tuning Llama-3.1-8B on self-correction examples increases multi-attempt rate proportionally with training data ratiofinding0.848
Shows behavioral pattern of self-correction is trainable in smaller models
Ablating 26 OTD latents reduces multi-attempt rate by 25% (from 7.4% to 5.5%) in Llama-3.3-70Bfinding0.835
Primary causal evidence for dedicated internal consistency-checking circuits
LLaMA-2-70B and 13B probes generalize better across datasets than 7B probes across all training sets and probe typesfinding0.833
Larger models linearly represent more general concepts including truth
Llama-3.3-70B corrected response scores 75/100 rather than 100 due to residual steering effects (Snell's law reference)finding0.832
Illustrative finding that ESR mitigates but does not fully eliminate steering influence
All five judge models consistently rank Llama-3.3-70B as having substantially higher ESR rates than other modelsfinding0.817
Cross-judge validation of the primary ESR finding across OpenAI, Alibaba, Anthropic, and Google judge models
Llama 3.1 405B shows 14% compliance gap in minimal helpful-only replication but smaller Llama and Mistral models show no gapfinding0.816
Replication across open-weight models supports scale-emergence finding