method
active
method:interchange-interventionInterchange Intervention
Fundamental operation for causal abstraction analysis; forces neurons to take values from source inputs to create counterfactuals.
Neighborhood — ranked by edge-count
Papers (1)
paper
Frameworks (2)
framework
- The primary contribution of the paper: a bidirectional causal method that learns rotation matrices for each model to uncover and compare causally relevant latent subspaces across neural networks.
- Practical method by Geiger et al. for finding distributed causal abstractions using gradient descent
Concepts (4)
concept
- Causal abstractionimplementsA framework the paper uses alongside feature geometry to deepen mechanistic understanding of LMs
- Counterfactual BehaviorimplementsThe behavior that would have occurred had the value of a causal variable been different while everything else remained the same; used as training labels in DAS/MAS.
- Counterfactualassociated_withThe output value a model produces when an interchange intervention forces certain variables to take values from source inputs.
- Distributed representationassociated_withIdea that information is spread across many neurons; superposition is a subtype.
Methods (3)
method
- Distributed Interchange Interventionextendsrelated_toExtends interchange interventions to non-standard bases by rotating representations, intervening in rotated subspaces, then rotating back.
- Interchange Intervention Accuracyrelated_toProportion of aligned interchange interventions with equivalent high-level and low-level effects; graded measure of causal abstraction.
- Activation patchingassociated_withStandard method in mechanistic interpretability that intervenes on activations; VPD flips this paradigm by patching parameters.
Artifacts (1)
artifact
- pyvene open-source Python libraryimplementsThe main artifact introduced in the paper: an open-source PyPI library for customizable interventions on PyTorch models
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Differentiable training objective minimized when a high-level model is an abstraction of a neural network under a given alignment.
- Full n-dimensional activation replacement; most expressive intervention tested, used as upper bound in appendix
- Training technique that induces specific causal structures in neural networks by co-training with interchange interventions
- Evaluation metric measuring how well a trained intervention matches desired counterfactual model behavior
- Property that additive modifications to activations affect all downstream computations, enabling tractable behavioral control
- Intervention mode where multiple interventions are applied simultaneously to the same base computation graph
- Intervention mode where interventions are applied sequentially, each building on the previous one