method
active
method:activation-patchingActivation patching
Standard method in mechanistic interpretability that intervenes on activations; VPD flips this paradigm by patching parameters.
Neighborhood — ranked by edge-count
Papers (3)
paper
Frameworks (1)
framework
- Practical method by Geiger et al. for finding distributed causal abstractions using gradient descent
Concepts (1)
concept
- Causal TracingextendsMechanistic interpretability technique for locating factual associations, mentioned as future work direction.
Methods (1)
method
- Interchange Interventionassociated_withFundamental operation for causal abstraction analysis; forces neurons to take values from source inputs to create counterfactuals.
Claims (1)
claim
- Core proposition of the paper: a substrate-level critique of existing interpretability methods.
Artifacts (1)
artifact
- pyvene open-source Python libraryimplementsThe main artifact introduced in the paper: an open-source PyPI library for customizable interventions on PyTorch models
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Methods that intentionally introduce divergent representations to test sufficiency and completeness of circuits
- Internal representations of the model on which probes operate; the method uses activations to rank datapoints.
- Method by Goldowsky-Dill et al. 2023 for localizing model behavior via targeted activation interventions
- Intervention method that adds a learned direction vector to residual stream activations to steer model behavior
- A set property meaning all coordinate patches of its elements remain within the set; proved equivalent to axis-aligned hyperrectangles
- Technique of reading out model beliefs from internal activations before the final answer token is generated
- Clamping activations along the Assistant Axis to remain above a minimum threshold (25th percentile), introduced as a stabilization method
- Used to localize causally implicated hidden states by swapping activations between true and false inputs