method
active
method:attribution-patching

Attribution patching

Gradient-based method to estimate the effect of zeroing a feature on a specific logit difference.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Gradient-based technique using SAE features to estimate causal effects on completions; used to corroborate NLA findings.
  • Correlating attribution vectors (feature activation × logit weight of next token) across model pairs to measure functional universality
  • Data Attributionconcept0.802
    The task of attributing model behaviors to specific training datapoints.
  • Method to trace how parameter subcomponents interact from input to output for a given next-token prediction, producing a subnetwork graph.
  • Standard method in mechanistic interpretability that intervenes on activations; VPD flips this paradigm by patching parameters.
  • Path Patchingmethod0.747
    Method by Goldowsky-Dill et al. 2023 for localizing model behavior via targeted activation interventions
  • Baseline method against which probe-based ranking is compared; more computationally expensive.