Causal abstraction

A framework the paper uses alongside feature geometry to deepen mechanistic understanding of LMs

Neighborhood — ranked by edge-count

paper

thinker

framework

Distributed Alignment Search (DAS)
implements
Practical method by Geiger et al. for finding distributed causal abstractions using gradient descent
Modified CL Loss
uses
Novel variant of CL loss introduced in this paper targeting only causal subspace dimensions to improve OOD performance

method

Distributed Alignment Search
implements
The core method introduced in this paper: finds alignments between high-level causal variables and distributed neural representations via gradient descent.
Interchange Intervention
implements
Fundamental operation for causal abstraction analysis; forces neurons to take values from source inputs to create counterfactuals.
Interchange Intervention Training (IIT)
implements
Training technique that induces specific causal structures in neural networks by co-training with interchange interventions
Brute-Force Alignment Search
implements
Baseline method that exhaustively searches discrete spaces of localist alignments between high-level variables and neuron groups.
Causal abstraction analysis
implements
The formal method used to establish that the identified circuit causally mediates the model's cyclic reasoning behavior

concept

Constructive Causal Abstraction
implementsrelated_to
Formal definition: H is a constructive abstraction of L under alignment Π when interchange interventions have equivalent effects at both levels.
Approximate Causal Abstraction
related_to
Graded notion of causal abstraction measured by IIA; when IIA is alpha < 100%, the model is alpha-on-average approximately abstract.
Causal Abstraction over Cyclic Tasks
related_to
Neural Network Interpretability
associated_with
The field aimed at understanding what neural networks have learned; characterized as pre-paradigmatic in this paper
Alignment Map (ϕ)
uses
The bijective function mapping DNN inner neurons to latent variables in causal abstraction; its complexity is the central variable studied

artifact

pyvene open-source Python library
implements
The main artifact introduced in the paper: an open-source PyPI library for customizable interventions on PyTorch models

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Causal Emergenceconcept0.831
Core concept: degree to which an agent exerts unique predictive power on its future; key to cognition at all scales.
Causal abstraction theory is a unified framework that subsumes diverse intervention-based interpretability methods including LIME, causal mediation analysis, INLP, and circuit explanationsclaim0.818
The paper endorses Geiger et al. 2023's claim that disparate interpretability methods are instances of causal abstraction.
Causal powerconcept0.817
The ability of an agent to be a driver of subsequent events; a hallmark of cognition that causal emergence quantifies.
Causal Mechanismconcept0.813
Function determining the value of a variable based on its causal parents in an acyclic causal model.
Causal Tracingconcept0.812
Mechanistic interpretability technique for locating factual associations, mentioned as future work direction.
Causal Mediationconcept0.809
Whether an internal direction causally controls a target behavior, verified by intervention success
causal bypassingconcept0.806
Confound where naming injected concepts reflects direct logit effects rather than metacognitive awareness, raised by Morris & Plunkett
Constructive Abstractionconcept0.803
Type of abstraction map where node information is computed from non-overlapping neuron sets