finding

active

finding:llama-3-3-70b-corrected-response-scores-75-100-rather-than-100-due-to-residual-steering-effects-snell-s-law-reference

Llama-3.3-70B corrected response scores 75/100 rather than 100 due to residual steering effects (Snell's law reference)

Illustrative finding that ESR mitigates but does not fully eliminate steering influence

Source paper

extracted_from

Endogenous Resistance to Activation Steering in Language Models

(2026) · Alex McKenzie · Keenan Pepper · Stijn Servaes · Martin Leitgab +5

Neighborhood — ranked by edge-count

Claims (1)

claim

ESR could protect against adversarial manipulation but might also interfere with beneficial safety interventions relying on activation steering
supports
Core policy-relevant implication of the paper for AI safety

Findings (1)

finding

OTD latent activation begins declining before verbal self-correction appears in the output in Llama-3.3-70B
supports
Temporal pattern consistent with internal monitoring process preceding explicit self-correction

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Fine-tuning Llama-3.1-8B on self-correction examples increases multi-attempt rate proportionally with training data ratiofinding0.835
Shows behavioral pattern of self-correction is trainable in smaller models
Llama-3.3-70B shows multi-attempt rate of 7.4% vs. ≤1.2% for all other models testedfinding0.832
Supporting finding showing ESR is driven by both higher multi-attempt rates and comparable improvement rates
Meta-prompting increases Llama-3.3-70B multi-attempt rate 4.3× (from 7.4% to 31.7%)finding0.808
Demonstrates ESR can be deliberately enhanced through prompting in the largest model
Llama 3.1 405B shows 14% compliance gap in minimal helpful-only replication but smaller Llama and Mistral models show no gapfinding0.806
Replication across open-weight models supports scale-emergence finding
Llama-3.3-70B exhibits internal consistency-checking mechanisms that operate during inferenceclaim0.806
Central interpretive claim of the paper supported by causal ablation and activation evidence
Steering Llama-3.1 8B along the circular representation manifold produces outputs that follow the natural circle of the behavior manifold, cleanly shifting probability mass from Monday through successive days.finding0.803
Core empirical result demonstrating that manifold steering produces on-target, behavior-aligned outputs.
Correlation between layer-wise scores and task accuracy ρ = −0.73 (p < 0.001) on LLaMAfinding0.803
Core E3 finding validating S as a predictor of anchoring effectiveness
Llama 3.3 70B is the most likely to take on a non-Assistant persona when steered, with even split between human and nonhuman portrayalsfinding0.802
Model-specific difference in persona susceptibility