finding
active
finding:26-candidate-off-topic-detector-latents-identified-in-llama-3-3-70b-via-contrastive-search26 candidate off-topic detector latents identified in Llama-3.3-70B via contrastive search
Core mechanistic finding identifying specific SAE latents associated with ESR
Source paper
extracted_from(2026) · Alex McKenzie · Keenan Pepper · Stijn Servaes · Martin Leitgab +5
Neighborhood — ranked by edge-count
Papers (1)
paper
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Larger models linearly represent more general concepts including truth
- Complementary temporal activation pattern suggesting distinct roles for OTD and backtracking latent classes
- Llama-3.3-70B exhibits internal consistency-checking mechanisms that operate during inferenceclaim0.797Central interpretive claim of the paper supported by causal ablation and activation evidence
- 26 SAE latents identified as differentially activated during off-topic content and causally linked to ESR
- Model-specific difference in persona susceptibility
- Supporting finding showing ESR is driven by both higher multi-attempt rates and comparable improvement rates
- Hypothesized intermediate feature explaining antipodal alignment between cities and neg_cities in early-middle layers
- Probe validation result confirming interest direction captures meaningful structure