method
active
method:off-topic-detector-latent-ablationOff-Topic Detector Latent Ablation
Causal intervention clamping 26 identified OTD latents to zero during steered inference to test ESR contribution
Neighborhood — ranked by edge-count
Papers (1)
paper
Concepts (1)
concept
- 26 SAE latents identified as differentially activated during off-topic content and causally linked to ESR
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Epistemic caution about over-interpreting the OTD label given the heterogeneity of identified latents
- Control experiment ablating random latents matched for activation frequency and magnitude to test OTD specificity
- Intervention type that sets activations to zero, used for interpretability analysis
- 26 candidate off-topic detector latents identified in Llama-3.3-70B via contrastive searchfinding0.746Core mechanistic finding identifying specific SAE latents associated with ESR
- Clamping a feature's value to zero to measure its causal effect on model output.
- Systematic sweep of 10 boost levels from threshold-3σ to threshold+3σ to characterize ESR vs. steering strength
- A less intuitive feature family detecting low-frequency patterns on one side of the receptive field and high-frequency on the other; used as example of non-obvious but understandable features
- Auxiliary training objective from Grant (2025) that constrains intervened representations to remain near natural distribution