concept
active
concept:causal-abstractionCausal abstraction
A framework the paper uses alongside feature geometry to deepen mechanistic understanding of LMs
Neighborhood — ranked by edge-count
Papers (2)
paper
Thinkers (1)
thinker
- Atticus Geigerintroducesstudies
Frameworks (2)
framework
- Distributed Alignment Search (DAS)implementsPractical method by Geiger et al. for finding distributed causal abstractions using gradient descent
- Modified CL LossusesNovel variant of CL loss introduced in this paper targeting only causal subspace dimensions to improve OOD performance
Methods (5)
method
- Distributed Alignment SearchimplementsThe core method introduced in this paper: finds alignments between high-level causal variables and distributed neural representations via gradient descent.
- Interchange InterventionimplementsFundamental operation for causal abstraction analysis; forces neurons to take values from source inputs to create counterfactuals.
- Interchange Intervention Training (IIT)implementsTraining technique that induces specific causal structures in neural networks by co-training with interchange interventions
- Brute-Force Alignment SearchimplementsBaseline method that exhaustively searches discrete spaces of localist alignments between high-level variables and neuron groups.
- Causal abstraction analysisimplementsThe formal method used to establish that the identified circuit causally mediates the model's cyclic reasoning behavior
Concepts (5)
concept
- Constructive Causal Abstractionimplementsrelated_toFormal definition: H is a constructive abstraction of L under alignment Π when interchange interventions have equivalent effects at both levels.
- Approximate Causal Abstractionrelated_toGraded notion of causal abstraction measured by IIA; when IIA is alpha < 100%, the model is alpha-on-average approximately abstract.
- Causal Abstraction over Cyclic Tasksrelated_to
- Neural Network Interpretabilityassociated_withThe field aimed at understanding what neural networks have learned; characterized as pre-paradigmatic in this paper
- The bijective function mapping DNN inner neurons to latent variables in causal abstraction; its complexity is the central variable studied
Artifacts (1)
artifact
- pyvene open-source Python libraryimplementsThe main artifact introduced in the paper: an open-source PyPI library for customizable interventions on PyTorch models
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Core concept: degree to which an agent exerts unique predictive power on its future; key to cognition at all scales.
- The paper endorses Geiger et al. 2023's claim that disparate interpretability methods are instances of causal abstraction.
- The ability of an agent to be a driver of subsequent events; a hallmark of cognition that causal emergence quantifies.
- Function determining the value of a variable based on its causal parents in an acyclic causal model.
- Mechanistic interpretability technique for locating factual associations, mentioned as future work direction.
- Whether an internal direction causally controls a target behavior, verified by intervention success
- Confound where naming injected concepts reflects direct logit effects rather than metacognitive awareness, raised by Morris & Plunkett
- Type of abstraction map where node information is computed from non-overlapping neuron sets