finding

active

finding:backtracking-latents-remain-low-during-off-topic-content-and-peak-shortly-after-self-correction-begins-in-llama-3-3-70b

Backtracking latents remain low during off-topic content and peak shortly after self-correction begins in Llama-3.3-70B

Complementary temporal activation pattern suggesting distinct roles for OTD and backtracking latent classes

Source paper

extracted_from

Endogenous Resistance to Activation Steering in Language Models

(2026) · Alex McKenzie · Keenan Pepper · Stijn Servaes · Martin Leitgab +5

Neighborhood — ranked by edge-count

Papers (1)

paper

Endogenous Resistance to Activation Steering in Language Models
introduces

Concepts (1)

concept

Internal Consistency Monitoring
supports
The inferred mechanism underlying ESR whereby the model tracks coherence of its own outputs

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

OTD latent activation begins declining before verbal self-correction appears in the output in Llama-3.3-70Bfinding0.828
Temporal pattern consistent with internal monitoring process preceding explicit self-correction
Fine-tuning Llama-3.1-8B on self-correction examples increases multi-attempt rate proportionally with training data ratiofinding0.806
Shows behavioral pattern of self-correction is trainable in smaller models
26 candidate off-topic detector latents identified in Llama-3.3-70B via contrastive searchfinding0.799
Core mechanistic finding identifying specific SAE latents associated with ESR
OTD latents fire 4.4× higher during off-topic content compared to baseline episodes without self-correctionfinding0.798
Quantitative characterization of OTD activation differential establishing their off-topic monitoring role
Factual tasks F0-F3 reach near-perfect AUROC in early-to-mid layers of Llama-3.1-8B; arithmetic tasks A1-A3 emerge much later; counting tasks F4-F5 emerge late similar to arithmetic.finding0.792
Core empirical finding about layer-dependent truth direction emergence across task types.
Inflection points (backtracking, 'aha' moments) occur almost exclusively in CoT responses where probes show large belief shifts, across DeepSeek-R1 671B and GPT-OSS 120Bfinding0.787
Empirical finding linking textual CoT behaviors to internal belief dynamics
Backtracking Latentsconcept0.785
SAE latents that rise as correction approaches and peak after self-correction begins, complementing OTDs
Logit self-report drift positive for all three LLaMA sizes (turn slopes 0.159, 0.038, 0.141; all p<10⁻²⁰) but does not increase monotonically with scalefinding0.784
Unlike probe drift, report drift magnitude does not follow a clean scaling law; size-slope is negative