Causal Tracing

Mechanistic interpretability technique for locating factual associations, mentioned as future work direction.

Neighborhood — ranked by edge-count

paper

method

Activation patching
extends
Standard method in mechanistic interpretability that intervenes on activations; VPD flips this paradigm by patching parameters.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

causal regularitiesconcept0.820
The structural-realist grounding for self-evidencing after the bounded self is relinquished.
Causal abstractionconcept0.812
A framework the paper uses alongside feature geometry to deepen mechanistic understanding of LMs
Causal powerconcept0.804
The ability of an agent to be a driver of subsequent events; a hallmark of cognition that causal emergence quantifies.
Causal Scrubbingmethod0.804
Method by Chan et al. 2022 for rigorously testing interpretability hypotheses via interventions
Causal Geometryframework0.803
Chvykov and Hoel's geometric extension of causal emergence to continuous systems using Fisher information.
Causal Mediationconcept0.800
Whether an internal direction causally controls a target behavior, verified by intervention success
Causal Influence Diagramsframework0.796
Framework informing path-specific objectives by identifying causal chains leading to risky behaviors
Causal Mechanismconcept0.794
Function determining the value of a variable based on its causal parents in an acyclic causal model.