Interchange Intervention

Fundamental operation for causal abstraction analysis; forces neurons to take values from source inputs to create counterfactuals.

Neighborhood — ranked by edge-count

paper

framework

Model Alignment Search (MAS)
uses
The primary contribution of the paper: a bidirectional causal method that learns rotation matrices for each model to uncover and compare causally relevant latent subspaces across neural networks.
Distributed Alignment Search (DAS)
uses
Practical method by Geiger et al. for finding distributed causal abstractions using gradient descent

concept

Causal abstraction
implements
A framework the paper uses alongside feature geometry to deepen mechanistic understanding of LMs
Counterfactual Behavior
implements
The behavior that would have occurred had the value of a causal variable been different while everything else remained the same; used as training labels in DAS/MAS.
Counterfactual
associated_with
The output value a model produces when an interchange intervention forces certain variables to take values from source inputs.
Distributed representation
associated_with
Idea that information is spread across many neurons; superposition is a subtype.

method

Distributed Interchange Intervention
extendsrelated_to
Extends interchange interventions to non-standard bases by rotating representations, intervening in rotated subspaces, then rotating back.
Interchange Intervention Accuracy
related_to
Proportion of aligned interchange interventions with equivalent high-level and low-level effects; graded measure of causal abstraction.
Activation patching
associated_with
Standard method in mechanistic interpretability that intervenes on activations; VPD flips this paradigm by patching parameters.

artifact

pyvene open-source Python library
implements
The main artifact introduced in the paper: an open-source PyPI library for customizable interventions on PyTorch models

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Interchange Intervention Training Objectivemethod0.849
Differentiable training objective minimized when a high-level model is an abstraction of a neural network under a given alignment.
Vanilla interchange interventionmethod0.843
Full n-dimensional activation replacement; most expressive intervention tested, used as upper bound in appendix
Interchange Intervention Training (IIT)method0.795
Training technique that induces specific causal structures in neural networks by co-training with interchange interventions
Interchange Intervention Accuracy (IIA)concept0.792
Evaluation metric measuring how well a trained intervention matches desired counterfactual model behavior
Intervention Propagationconcept0.791
Property that additive modifications to activations affect all downstream computations, enabling tractable behavioral control
Interactionconcept0.781
Parallel Interventionconcept0.781
Intervention mode where multiple interventions are applied simultaneously to the same base computation graph
Serial Interventionconcept0.763
Intervention mode where interventions are applied sequentially, each building on the previous one