method
active
method:causal-scrubbingCausal Scrubbing
Method by Chan et al. 2022 for rigorously testing interpretability hypotheses via interventions
Neighborhood — ranked by edge-count
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Mechanistic interpretability technique for locating factual associations, mentioned as future work direction.
- Confound where naming injected concepts reflects direct logit effects rather than metacognitive awareness, raised by Morris & Plunkett
- Attention restricted to previous tokens only, as in decoder-only models; leads to AR(ω)-like behaviour and no ordered phase
- The structural-realist grounding for self-evidencing after the bounded self is relinquished.
- Whether an internal direction causally controls a target behavior, verified by intervention success
- Emergent causation where macro-variable has causal influence on its own future independently of micro-states.
- A framework the paper uses alongside feature geometry to deepen mechanistic understanding of LMs
- Property that causal mechanisms remain stable across environments; desirable for OOD.