finding

active

finding:26-candidate-off-topic-detector-latents-identified-in-llama-3-3-70b-via-contrastive-search

26 candidate off-topic detector latents identified in Llama-3.3-70B via contrastive search

Core mechanistic finding identifying specific SAE latents associated with ESR

Source paper

extracted_from

Endogenous Resistance to Activation Steering in Language Models

(2026) · Alex McKenzie · Keenan Pepper · Stijn Servaes · Martin Leitgab +5

Neighborhood — ranked by edge-count

Papers (1)

paper

Endogenous Resistance to Activation Steering in Language Models
introduces

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

LLaMA-2-70B and 13B probes generalize better across datasets than 7B probes across all training sets and probe typesfinding0.803
Larger models linearly represent more general concepts including truth
Backtracking latents remain low during off-topic content and peak shortly after self-correction begins in Llama-3.3-70Bfinding0.799
Complementary temporal activation pattern suggesting distinct roles for OTD and backtracking latent classes
Llama-3.3-70B exhibits internal consistency-checking mechanisms that operate during inferenceclaim0.797
Central interpretive claim of the paper supported by causal ablation and activation evidence
Off-Topic Detector Latentsconcept0.796
26 SAE latents identified as differentially activated during off-topic content and causally linked to ESR
Llama 3.3 70B is the most likely to take on a non-Assistant persona when steered, with even split between human and nonhuman portrayalsfinding0.791
Model-specific difference in persona susceptibility
Llama-3.3-70B shows multi-attempt rate of 7.4% vs. ≤1.2% for all other models testedfinding0.781
Supporting finding showing ESR is driven by both higher multi-attempt rates and comparable improvement rates
In early layers, LLaMA-2-13B represents a 'close association' feature that correlates with truth on cities but anti-correlates on neg_citiesclaim0.778
Hypothesized intermediate feature explaining antipodal alignment between cities and neg_cities in early-middle layers
Interest probe: peak Cohen's d=1.67 (layer 14), p=9.45×10⁻⁶ in LLaMA-3.2-3Bfinding0.777
Probe validation result confirming interest direction captures meaningful structure