claim

active

claim:the-26-differentially-activated-otd-latents-play-a-causally-important-role-in-enabling-esr-in-llama-3-3-70b

The 26 differentially-activated OTD latents play a causally important role in enabling ESR in Llama-3.3-70B

Causal interpretation of the ablation experiment results

Source paper

extracted_from

Endogenous Resistance to Activation Steering in Language Models

(2026) · Alex McKenzie · Keenan Pepper · Stijn Servaes · Martin Leitgab +5

Neighborhood — ranked by edge-count

Papers (1)

paper

Endogenous Resistance to Activation Steering in Language Models
introduces

Findings (4)

finding

Ablating 26 OTD latents reduces multi-attempt rate by 25% (from 7.4% to 5.5%) in Llama-3.3-70B
associated_withsupports
Primary causal evidence for dedicated internal consistency-checking circuits
OTD latent activation begins declining before verbal self-correction appears in the output in Llama-3.3-70B
supports
Temporal pattern consistent with internal monitoring process preceding explicit self-correction
Random latent ablation produces slight increase in ESR rate (3.8% to 4.2%), not statistically significant
supports
Control result confirming OTD ablation effect is specific to those latents, not a general ablation artifact
OTD latent ablation leaves mean first-attempt score unchanged (baseline 26.3, ablation 27.4) in Llama-3.3-70B
supports
Evidence that OTDs specifically support meta-cognitive monitoring rather than general response generation

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

The 25% reduction in multi-attempt rate from OTD ablation suggests additional mechanisms contribute to ESR beyond the identified latentsclaim0.783
Acknowledges incompleteness of the causal account, suggesting redundant circuits or nonlinear interactions
Ali et al. 2025 found contrastive activation addition less effective at larger model scale, consistent with ESR in 70Bfinding0.764
Prior finding from related work that aligns with ESR being strongest in the largest model tested
Approximately half of the 26 OTD latents show near-zero or negative effect sizes, activating more during on-topic contentfinding0.762
Reveals that contrastive search yields a heterogeneous set, not all functioning as true off-topic detectors
We cannot isolate whether ESR reflects scale, architecture, or training procedures in Llama-3.3-70Bclaim0.761
Epistemic limitation claim acknowledging confounds in the cross-model comparison
26 candidate off-topic detector latents identified in Llama-3.3-70B via contrastive searchfinding0.758
Core mechanistic finding identifying specific SAE latents associated with ESR
ESR exhibits non-monotonic relationship with boost level, peaking around -0.3σ below threshold in Llama-3.3-70Bfinding0.748
Characterizes the narrow operating window in which ESR can manifest
Meta-prompt ESR enhancement effects scale with model size across Llama and Gemma familiesfinding0.747
Suggests underlying self-monitoring circuits must be present for meta-prompting to enhance them
Mean difference patching on Llama-3-8B layer 10 produces intervened EMD exceeding the natural-natural baselinefinding0.745
Empirical demonstration that MDVP produces divergent representations in a real LLM