concept
active
concept:off-topic-detector-latentsOff-Topic Detector Latents
26 SAE latents identified as differentially activated during off-topic content and causally linked to ESR
Neighborhood — ranked by edge-count
Papers (1)
paper
Methods (2)
method
- API method used to identify latents differentially activated between on-topic and off-topic prompt-response pairs
- Causal intervention clamping 26 identified OTD latents to zero during steered inference to test ESR contribution
Concepts (2)
concept
- Internal Consistency MonitoringimplementsThe inferred mechanism underlying ESR whereby the model tracks coherence of its own outputs
- Backtracking Latentsassociated_withSAE latents that rise as correction approaches and peak after self-correction begins, complementing OTDs
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Epistemic caution about over-interpreting the OTD label given the heterogeneity of identified latents
- 26 candidate off-topic detector latents identified in Llama-3.3-70B via contrastive searchfinding0.796Core mechanistic finding identifying specific SAE latents associated with ESR
- OTD latents fire 4.4× higher during off-topic content compared to baseline episodes without self-correctionfinding0.744Quantitative characterization of OTD activation differential establishing their off-topic monitoring role
- A less intuitive feature family detecting low-frequency patterns on one side of the receptive field and high-frequency on the other; used as example of non-obvious but understandable features
- Statistical regularities stored in pretrained models.
- Metric measuring the mean MSE between self and other-referencing activations across all hidden MLP/attention layers
- Reasoning approach using learnable hidden embeddings.
- Reveals that contrastive search yields a heterogeneous set, not all functioning as true off-topic detectors