pyvene open-source Python library

The main artifact introduced in the paper: an open-source PyPI library for customizable interventions on PyTorch models

Neighborhood — ranked by edge-count

paper

method

Distributed Alignment Search
implements
The core method introduced in this paper: finds alignments between high-level causal variables and distributed neural representations via gradient descent.
Interchange Intervention
implements
Fundamental operation for causal abstraction analysis; forces neurons to take values from source inputs to create counterfactuals.
Activation patching
implements
Standard method in mechanistic interpretability that intervenes on activations; VPD flips this paradigm by patching parameters.
Boundless DAS
implements
A variant of DAS implemented in pyvene via BoundlessRotatedSpaceIntervention, introduced by Wu et al. 2023
Interchange Intervention Training (IIT)
implements
Training technique that induces specific causal structures in neural networks by co-training with interchange interventions
Linear Probe Training
implements
Method for fitting a linear classifier on collected activations to predict task-relevant features
Zero Ablation
implements
Intervention type that sets activations to zero, used for interpretability analysis

concept

Causal abstraction
implements
A framework the paper uses alongside feature geometry to deepen mechanistic understanding of LMs
Neural Network Intervention
about
The fundamental operation of making in-place changes to model activations, placing the model in a counterfactual state

claim