claim

active

claim:llama-3-3-70b-exhibits-internal-consistency-checking-mechanisms-that-operate-during-inference

Llama-3.3-70B exhibits internal consistency-checking mechanisms that operate during inference

Central interpretive claim of the paper supported by causal ablation and activation evidence

Source paper

extracted_from

Endogenous Resistance to Activation Steering in Language Models

(2026) · Alex McKenzie · Keenan Pepper · Stijn Servaes · Martin Leitgab +5

Neighborhood — ranked by edge-count

Papers (1)

paper

Endogenous Resistance to Activation Steering in Language Models
introduces

Findings (6)

finding

OTD latents fire 4.4× higher during off-topic content compared to baseline episodes without self-correction
associated_withsupports
Quantitative characterization of OTD activation differential establishing their off-topic monitoring role
Ablating 26 OTD latents reduces multi-attempt rate by 25% (from 7.4% to 5.5%) in Llama-3.3-70B
supports
Primary causal evidence for dedicated internal consistency-checking circuits
All five judge models consistently rank Llama-3.3-70B as having substantially higher ESR rates than other models
supports
Cross-judge validation of the primary ESR finding across OpenAI, Alibaba, Anthropic, and Google judge models
OTD latent activation begins declining before verbal self-correction appears in the output in Llama-3.3-70B
supports
Temporal pattern consistent with internal monitoring process preceding explicit self-correction
0% multi-attempt responses across 7,892 no-steering baseline trials confirming ESR is steering-induced
supports
Control result establishing that self-correction is specifically induced by steering, not spontaneous model behavior
Ali et al. 2025 found contrastive activation addition less effective at larger model scale, consistent with ESR in 70B
supports
Prior finding from related work that aligns with ESR being strongest in the largest model tested

Questions (2)

question

What is the full computational pathway underlying self-correction across multiple layers?
gates
Mechanistic question requiring multi-layer SAE analysis beyond current single-layer approach
Do large language models monitor their own internal states?
answered_by
Framing question that motivates the entire paper

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Llama 3.1 405B shows 14% compliance gap in minimal helpful-only replication but smaller Llama and Mistral models show no gapfinding0.823
Replication across open-weight models supports scale-emergence finding
We hypothesize that Llama-3.1-8B deploys the same base-10 addition circuitry for cyclic reasoning as it uses for general arithmetic, independent of the concept domainhypothesis0.819
Predictive hypothesis about domain-generality of the identified mechanism
Llama 3.3 70B is the most likely to take on a non-Assistant persona when steered, with even split between human and nonhuman portrayalsfinding0.817
Model-specific difference in persona susceptibility
LLaMA-2-70B and 13B probes generalize better across datasets than 7B probes across all training sets and probe typesfinding0.812
Larger models linearly represent more general concepts including truth
The difficulty boundary for truth directions replicates across all four tested models (Llama-3.2-3B, Llama-3.1-8B, Gemma-2-2b, Gemma-2-9b); generalization to F3-F5 remains consistently low regardless of model size or family.finding0.811
Establishes generalizability of the core difficulty-boundary finding across model families.
Llama-3.3-70B shows multi-attempt rate of 7.4% vs. ≤1.2% for all other models testedfinding0.806
Supporting finding showing ESR is driven by both higher multi-attempt rates and comparable improvement rates
Llama-3.3-70B corrected response scores 75/100 rather than 100 due to residual steering effects (Snell's law reference)finding0.806
Illustrative finding that ESR mitigates but does not fully eliminate steering influence
Llama-3.1-8B reuses a single generic addition mechanism across all cyclic tasks independently of concept-specific geometryfinding0.805
Key mechanistic finding showing task-agnostic reuse of arithmetic circuitry