Distributed Interchange Intervention

Extends interchange interventions to non-standard bases by rotating representations, intervening in rotated subspaces, then rotating back.

Neighborhood — ranked by edge-count

paper

concept

Orthogonal Decomposition of Representation Space
associated_with
Mathematical structure central to distributed interchange interventions; representation space decomposed into orthogonal subspaces each aligned with a high-level variable.

method

Interchange Intervention
extendsrelated_to
Fundamental operation for causal abstraction analysis; forces neurons to take values from source inputs to create counterfactuals.
1D Distributed Interchange Intervention (1D DII)
related_to
Core intervention method used throughout CausalGym; operates on one-dimensional non-basis-aligned subspace of activation space
Distributed Alignment Search
uses
The core method introduced in this paper: finds alignments between high-level causal variables and distributed neural representations via gradient descent.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Interchange Intervention Accuracymethod0.800
Proportion of aligned interchange interventions with equivalent high-level and low-level effects; graded measure of causal abstraction.
Interchange Intervention Training Objectivemethod0.796
Differentiable training objective minimized when a high-level model is an abstraction of a neural network under a given alignment.
Vanilla interchange interventionmethod0.779
Full n-dimensional activation replacement; most expressive intervention tested, used as upper bound in appendix
Parallel Interventionconcept0.768
Intervention mode where multiple interventions are applied simultaneously to the same base computation graph
Interchange Intervention Training (IIT)method0.765
Training technique that induces specific causal structures in neural networks by co-training with interchange interventions
Interchange Intervention Accuracy (IIA)concept0.762
Evaluation metric measuring how well a trained intervention matches desired counterfactual model behavior
Intervention Propagationconcept0.759
Property that additive modifications to activations affect all downstream computations, enabling tractable behavioral control
Subspace Interventionconcept0.753
Intervention targeting specific dimensional subsets of activation vectors rather than full representations