1D Distributed Interchange Intervention (1D DII)

Core intervention method used throughout CausalGym; operates on one-dimensional non-basis-aligned subspace of activation space

Neighborhood — ranked by edge-count

framework

Linear Representation Hypothesis
implements
The hypothesis that models internalize concepts as approximately linear directions in representation space; used to interpret MDS injection behavior
CausalGym
uses
Multi-task benchmark of linguistic behaviours for measuring causal efficacy of interpretability methods, adapted from SyntaxGym

method

Distributed Interchange Intervention
related_to
Extends interchange interventions to non-standard bases by rotating representations, intervening in rotated subspaces, then rotating back.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Interchange Interventionmethod0.763
Fundamental operation for causal abstraction analysis; forces neurons to take values from source inputs to create counterfactuals.
Electronic Data Interchange (EDI)concept0.727
Existing approach to standardized electronic communication between organizations; Elephant aimed at improving beyond fixed formats like X12.
Given the linear representation hypothesis and binary linguistic features, 1D DII is sufficiently expressive for controlling model behaviour in CausalGymclaim0.716
Theoretical justification for the methodological choice of 1D DII throughout the benchmark
Interchange Intervention Training (IIT)method0.705
Training technique that induces specific causal structures in neural networks by co-training with interchange interventions
Parallel Interventionconcept0.699
Intervention mode where multiple interventions are applied simultaneously to the same base computation graph
Interchange Intervention Accuracymethod0.699
Proportion of aligned interchange interventions with equivalent high-level and low-level effects; graded measure of causal abstraction.
Interchange Intervention Training Objectivemethod0.695
Differentiable training objective minimized when a high-level model is an abstraction of a neural network under a given alignment.
Distributed/distributable and dynamic selfconcept0.693
Concept of self as extended and co-constituted by interactions, per Mahāyāna.