claim
active
claim:llama-3-3-70b-exhibits-internal-consistency-checking-mechanisms-that-operate-during-inferenceLlama-3.3-70B exhibits internal consistency-checking mechanisms that operate during inference
Central interpretive claim of the paper supported by causal ablation and activation evidence
Source paper
extracted_from(2026) · Alex McKenzie · Keenan Pepper · Stijn Servaes · Martin Leitgab +5
Neighborhood — ranked by edge-count
Papers (1)
paper
Findings (6)
finding
- OTD latents fire 4.4× higher during off-topic content compared to baseline episodes without self-correctionassociated_withsupportsQuantitative characterization of OTD activation differential establishing their off-topic monitoring role
- Ablating 26 OTD latents reduces multi-attempt rate by 25% (from 7.4% to 5.5%) in Llama-3.3-70BsupportsPrimary causal evidence for dedicated internal consistency-checking circuits
- Cross-judge validation of the primary ESR finding across OpenAI, Alibaba, Anthropic, and Google judge models
- Temporal pattern consistent with internal monitoring process preceding explicit self-correction
- Control result establishing that self-correction is specifically induced by steering, not spontaneous model behavior
- Prior finding from related work that aligns with ESR being strongest in the largest model tested
Questions (2)
question
- Mechanistic question requiring multi-layer SAE analysis beyond current single-layer approach
- Framing question that motivates the entire paper
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Replication across open-weight models supports scale-emergence finding
- Predictive hypothesis about domain-generality of the identified mechanism
- Model-specific difference in persona susceptibility
- Larger models linearly represent more general concepts including truth
- Establishes generalizability of the core difficulty-boundary finding across model families.
- Supporting finding showing ESR is driven by both higher multi-attempt rates and comparable improvement rates
- Illustrative finding that ESR mitigates but does not fully eliminate steering influence
- Key mechanistic finding showing task-agnostic reuse of arithmetic circuitry