Activation patching

Standard method in mechanistic interpretability that intervenes on activations; VPD flips this paradigm by patching parameters.

Neighborhood — ranked by edge-count

paper

framework

Distributed Alignment Search (DAS)
extends
Practical method by Geiger et al. for finding distributed causal abstractions using gradient descent

concept

Causal Tracing
extends
Mechanistic interpretability technique for locating factual associations, mentioned as future work direction.

method

Interchange Intervention
associated_with
Fundamental operation for causal abstraction analysis; forces neurons to take values from source inputs to create counterfactuals.

claim

VPD decomposes parameters, not activations, flipping the standard SAE / activation-patching paradigm.
cites
Core proposition of the paper: a substrate-level critique of existing interpretability methods.

artifact

pyvene open-source Python library
implements
The main artifact introduced in the paper: an open-source PyPI library for customizable interventions on PyTorch models

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Noising/Denoising Activation Patchingmethod0.836
Methods that intentionally introduce divergent representations to test sufficiency and completeness of circuits
Activationsconcept0.821
Internal representations of the model on which probes operate; the method uses activations to rank datapoints.
Path Patchingmethod0.805
Method by Goldowsky-Dill et al. 2023 for localizing model behavior via targeted activation interventions
Activation Additionmethod0.801
Intervention method that adds a learned direction vector to residual stream activations to steer model behavior
Patch-Closureconcept0.797
A set property meaning all coordinate patches of its elements remain within the set; proved equivalent to axis-aligned hyperrectangles
Activation Probingconcept0.794
Technique of reading out model beliefs from internal activations before the final answer token is generated
Activation Cappingmethod0.786
Clamping activations along the Assistant Axis to remain above a minimum threshold (25th percentile), introduced as a stabilization method
Residual Stream Activation Patchingmethod0.778
Used to localize causally implicated hidden states by swapping activations between true and false inputs