Off-Topic Detector Latents

26 SAE latents identified as differentially activated during off-topic content and causally linked to ESR

Neighborhood — ranked by edge-count

paper

method

Goodfire Ember Contrastive Search
about
API method used to identify latents differentially activated between on-topic and off-topic prompt-response pairs
Off-Topic Detector Latent Ablation
about
Causal intervention clamping 26 identified OTD latents to zero during steered inference to test ESR contribution

concept

Internal Consistency Monitoring
implements
The inferred mechanism underlying ESR whereby the model tracks coherence of its own outputs
Backtracking Latents
associated_with
SAE latents that rise as correction approaches and peak after self-correction begins, complementing OTDs

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Off-topic detector is a functional label based on selection methodology; these latents may serve broader coherence-monitoring roles beyond detecting off-topic contentclaim0.842
Epistemic caution about over-interpreting the OTD label given the heterogeneity of identified latents
26 candidate off-topic detector latents identified in Llama-3.3-70B via contrastive searchfinding0.796
Core mechanistic finding identifying specific SAE latents associated with ESR
OTD latents fire 4.4× higher during off-topic content compared to baseline episodes without self-correctionfinding0.744
Quantitative characterization of OTD activation differential establishing their off-topic monitoring role
High-Low Frequency Detectorconcept0.732
A less intuitive feature family detecting low-frequency patterns on one side of the receptive field and high-frequency on the other; used as example of non-obvious but understandable features
latent patternsconcept0.722
Statistical regularities stored in pretrained models.
Latent SOO Metricmethod0.718
Metric measuring the mean MSE between self and other-referencing activations across all hidden MLP/attention layers
latent reasoningconcept0.718
Reasoning approach using learnable hidden embeddings.
Approximately half of the 26 OTD latents show near-zero or negative effect sizes, activating more during on-topic contentfinding0.717
Reveals that contrastive search yields a heterogeneous set, not all functioning as true off-topic detectors