claim
active
claim:off-topic-detector-is-a-functional-label-based-on-selection-methodology-these-latents-may-serve-broader-coherence-monitoring-roles-beyond-detecting-off-topic-contentOff-topic detector is a functional label based on selection methodology; these latents may serve broader coherence-monitoring roles beyond detecting off-topic content
Epistemic caution about over-interpreting the OTD label given the heterogeneity of identified latents
Source paper
extracted_from(2026) · Alex McKenzie · Keenan Pepper · Stijn Servaes · Martin Leitgab +5
Neighborhood — ranked by edge-count
Papers (1)
paper
Findings (1)
finding
- Reveals that contrastive search yields a heterogeneous set, not all functioning as true off-topic detectors
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- 26 SAE latents identified as differentially activated during off-topic content and causally linked to ESR
- Causal intervention clamping 26 identified OTD latents to zero during steered inference to test ESR contribution
- 26 candidate off-topic detector latents identified in Llama-3.3-70B via contrastive searchfinding0.744Core mechanistic finding identifying specific SAE latents associated with ESR
- Opus 4.1 is most effective at recognizing injected abstract concepts (e.g., justice, peace) but detects other categories too.
- Finding that relative coherence rankings remain constant across different people and across different cognitive processing tasks (description, memorization, tachistoscopic recognition), establishing coherence as an objective feature of cognitive processing
- SAEs uncover safety-relevant representations that might be monitored or controlled.
- Little evidence of steganography in NLAs; meaning-preserving transformations cause only small drops in FVEfinding0.716Quantitative evaluation showing NLAs do not heavily rely on covert encoding beyond overt language.
- A less intuitive feature family detecting low-frequency patterns on one side of the receptive field and high-frequency on the other; used as example of non-obvious but understandable features