finding
active
finding:backtracking-latents-remain-low-during-off-topic-content-and-peak-shortly-after-self-correction-begins-in-llama-3-3-70bBacktracking latents remain low during off-topic content and peak shortly after self-correction begins in Llama-3.3-70B
Complementary temporal activation pattern suggesting distinct roles for OTD and backtracking latent classes
Source paper
extracted_from(2026) · Alex McKenzie · Keenan Pepper · Stijn Servaes · Martin Leitgab +5
Neighborhood — ranked by edge-count
Papers (1)
paper
Concepts (1)
concept
- Internal Consistency MonitoringsupportsThe inferred mechanism underlying ESR whereby the model tracks coherence of its own outputs
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- OTD latent activation begins declining before verbal self-correction appears in the output in Llama-3.3-70Bfinding0.828Temporal pattern consistent with internal monitoring process preceding explicit self-correction
- Shows behavioral pattern of self-correction is trainable in smaller models
- 26 candidate off-topic detector latents identified in Llama-3.3-70B via contrastive searchfinding0.799Core mechanistic finding identifying specific SAE latents associated with ESR
- OTD latents fire 4.4× higher during off-topic content compared to baseline episodes without self-correctionfinding0.798Quantitative characterization of OTD activation differential establishing their off-topic monitoring role
- Core empirical finding about layer-dependent truth direction emergence across task types.
- Empirical finding linking textual CoT behaviors to internal belief dynamics
- SAE latents that rise as correction approaches and peak after self-correction begins, complementing OTDs
- Unlike probe drift, report drift magnitude does not follow a clean scaling law; size-slope is negative