method
active
method:interchange-intervention-training-iit

Interchange Intervention Training (IIT)

Training technique that induces specific causal structures in neural networks by co-training with interchange interventions

Neighborhood — ranked by edge-count

Concepts (1)

concept
  • A framework the paper uses alongside feature geometry to deepen mechanistic understanding of LMs

Artifacts (1)

artifact

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Differentiable training objective minimized when a high-level model is an abstraction of a neural network under a given alignment.
  • Evaluation metric measuring how well a trained intervention matches desired counterfactual model behavior
  • Metric measuring accuracy of DNN under intervention at matching algorithm-predicted outputs on held-out test set
  • Fundamental operation for causal abstraction analysis; forces neurons to take values from source inputs to create counterfactuals.
  • Proportion of aligned interchange interventions with equivalent high-level and low-level effects; graded measure of causal abstraction.
  • Extends interchange interventions to non-standard bases by rotating representations, intervening in rotated subspaces, then rotating back.
  • IIT 3.0framework0.743
    Version 3.0 of IIT, used to compute Φmax and Conceptual Information (CI) from LLM representation networks.
  • IIT 4.0framework0.736
    Version 4.0 of IIT, used to compute Φ and Φ-structure from LLM representation networks; latest iteration at time of study.