method
active
method:interchange-intervention

Interchange Intervention

Fundamental operation for causal abstraction analysis; forces neurons to take values from source inputs to create counterfactuals.

Neighborhood — ranked by edge-count

Frameworks (2)

framework
  • The primary contribution of the paper: a bidirectional causal method that learns rotation matrices for each model to uncover and compare causally relevant latent subspaces across neural networks.
  • Practical method by Geiger et al. for finding distributed causal abstractions using gradient descent

Concepts (4)

concept
  • A framework the paper uses alongside feature geometry to deepen mechanistic understanding of LMs
  • The behavior that would have occurred had the value of a causal variable been different while everything else remained the same; used as training labels in DAS/MAS.
  • Counterfactual
    associated_with
    The output value a model produces when an interchange intervention forces certain variables to take values from source inputs.
  • Idea that information is spread across many neurons; superposition is a subtype.

Methods (3)

method
  • Extends interchange interventions to non-standard bases by rotating representations, intervening in rotated subspaces, then rotating back.
  • Proportion of aligned interchange interventions with equivalent high-level and low-level effects; graded measure of causal abstraction.
  • Activation patching
    associated_with
    Standard method in mechanistic interpretability that intervenes on activations; VPD flips this paradigm by patching parameters.

Artifacts (1)

artifact

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.