Attribution patching

Gradient-based method to estimate the effect of zeroing a feature on a specific logit difference.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Attribution Graphsmethod0.815
Gradient-based technique using SAE features to estimate causal effects on completions; used to corroborate NLA findings.
Attribution Similaritymethod0.803
Correlating attribution vectors (feature activation × logit weight of next token) across model pairs to measure functional universality
Data Attributionconcept0.802
The task of attributing model behaviors to specific training datapoints.
Attribution graph constructionmethod0.770
Method to trace how parameter subcomponents interact from input to output for a given next-token prediction, producing a subnetwork graph.
Activation patchingmethod0.767
Standard method in mechanistic interpretability that intervenes on activations; VPD flips this paradigm by patching parameters.
Probe-based data attribution for alignmentconcept0.749
Path Patchingmethod0.747
Method by Goldowsky-Dill et al. 2023 for localizing model behavior via targeted activation interventions
Gradient-based data attributionmethod0.745
Baseline method against which probe-based ranking is compared; more computationally expensive.