Interchange Intervention Training (IIT)

Training technique that induces specific causal structures in neural networks by co-training with interchange interventions

Neighborhood — ranked by edge-count

paper

concept

Causal abstraction
implements
A framework the paper uses alongside feature geometry to deepen mechanistic understanding of LMs

artifact

pyvene open-source Python library
implements
The main artifact introduced in the paper: an open-source PyPI library for customizable interventions on PyTorch models

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Interchange Intervention Training Objectivemethod0.836
Differentiable training objective minimized when a high-level model is an abstraction of a neural network under a given alignment.
Interchange Intervention Accuracy (IIA)concept0.834
Evaluation metric measuring how well a trained intervention matches desired counterfactual model behavior
Interchange Intervention Accuracy (IIA) Metricmethod0.811
Metric measuring accuracy of DNN under intervention at matching algorithm-predicted outputs on held-out test set
Interchange Interventionmethod0.795
Fundamental operation for causal abstraction analysis; forces neurons to take values from source inputs to create counterfactuals.
Interchange Intervention Accuracymethod0.768
Proportion of aligned interchange interventions with equivalent high-level and low-level effects; graded measure of causal abstraction.
Distributed Interchange Interventionmethod0.765
Extends interchange interventions to non-standard bases by rotating representations, intervening in rotated subspaces, then rotating back.
IIT 3.0framework0.743
Version 3.0 of IIT, used to compute Φmax and Conceptual Information (CI) from LLM representation networks.
IIT 4.0framework0.736
Version 4.0 of IIT, used to compute Φ and Φ-structure from LLM representation networks; latest iteration at time of study.