Off-Topic Detector Latent Ablation

Causal intervention clamping 26 identified OTD latents to zero during steered inference to test ESR contribution

Neighborhood — ranked by edge-count

paper

concept

Off-Topic Detector Latents
about
26 SAE latents identified as differentially activated during off-topic content and causally linked to ESR

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Off-topic detector is a functional label based on selection methodology; these latents may serve broader coherence-monitoring roles beyond detecting off-topic contentclaim0.801
Epistemic caution about over-interpreting the OTD label given the heterogeneity of identified latents
Random Latent Ablation Controlmethod0.781
Control experiment ablating random latents matched for activation frequency and magnitude to test OTD specificity
Zero Ablationmethod0.760
Intervention type that sets activations to zero, used for interpretability analysis
26 candidate off-topic detector latents identified in Llama-3.3-70B via contrastive searchfinding0.746
Core mechanistic finding identifying specific SAE latents associated with ESR
Feature ablation (zeroing feature activations)method0.722
Clamping a feature's value to zero to measure its causal effect on model output.
Boost Level Ablation Sweepmethod0.720
Systematic sweep of 10 boost levels from threshold-3σ to threshold+3σ to characterize ESR vs. steering strength
High-Low Frequency Detectorconcept0.720
A less intuitive feature family detecting low-frequency patterns on one side of the receptive field and high-frequency on the other; used as example of non-obvious but understandable features
Counterfactual Latent (CL) Lossframework0.716
Auxiliary training objective from Grant (2025) that constrains intervened representations to remain near natural distribution