claim

active

claim:off-topic-detector-is-a-functional-label-based-on-selection-methodology-these-latents-may-serve-broader-coherence-monitoring-roles-beyond-detecting-off-topic-content

Off-topic detector is a functional label based on selection methodology; these latents may serve broader coherence-monitoring roles beyond detecting off-topic content

Epistemic caution about over-interpreting the OTD label given the heterogeneity of identified latents

Source paper

extracted_from

Endogenous Resistance to Activation Steering in Language Models

(2026) · Alex McKenzie · Keenan Pepper · Stijn Servaes · Martin Leitgab +5

Neighborhood — ranked by edge-count

Papers (1)

paper

Endogenous Resistance to Activation Steering in Language Models
introduces

Findings (1)

finding

Approximately half of the 26 OTD latents show near-zero or negative effect sizes, activating more during on-topic content
supports
Reveals that contrastive search yields a heterogeneous set, not all functioning as true off-topic detectors

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Off-Topic Detector Latentsconcept0.842
26 SAE latents identified as differentially activated during off-topic content and causally linked to ESR
Off-Topic Detector Latent Ablationmethod0.801
Causal intervention clamping 26 identified OTD latents to zero during steered inference to test ESR contribution
26 candidate off-topic detector latents identified in Llama-3.3-70B via contrastive searchfinding0.744
Core mechanistic finding identifying specific SAE latents associated with ESR
Abstract nouns elicit the highest introspective awareness rates; all concept categories show nonzero detectionfinding0.733
Opus 4.1 is most effective at recognizing injected abstract concepts (e.g., justice, peace) but detects other categories too.
Perceived coherence of patterns is an objective measure, not idiosyncratic or subjective—people agree on relative coherence regardless of experimental taskfinding0.732
Finding that relative coherence rankings remain constant across different people and across different cognitive processing tasks (description, memorization, tachistoscopic recognition), establishing coherence as an objective feature of cognitive processing
We observe features related to a broad range of safety concerns, including deception, sycophancy, bias, and dangerous content.claim0.717
SAEs uncover safety-relevant representations that might be monitored or controlled.
Little evidence of steganography in NLAs; meaning-preserving transformations cause only small drops in FVEfinding0.716
Quantitative evaluation showing NLAs do not heavily rely on covert encoding beyond overt language.
High-Low Frequency Detectorconcept0.714
A less intuitive feature family detecting low-frequency patterns on one side of the receptive field and high-frequency on the other; used as example of non-obvious but understandable features