Distributed Alignment Search (DAS)

Practical method by Geiger et al. for finding distributed causal abstractions using gradient descent

Neighborhood — ranked by edge-count

paper

thinker

method

Interchange Intervention
uses
Fundamental operation for causal abstraction analysis; forces neurons to take values from source inputs to create counterfactuals.
Activation patching
extends
Standard method in mechanistic interpretability that intervenes on activations; VPD flips this paradigm by patching parameters.
Diagnostic Probing
analogous_to
Earlier interpretability method applying classifiers to DNN hidden representations; shares complexity-accuracy dilemma with causal abstraction

concept

Causal abstraction
implements
A framework the paper uses alongside feature geometry to deepen mechanistic understanding of LMs
Alignment Function
uses
A learnable invertible transformation in DAS that maps neural representations to a basis aligned with causal variables
Interchange Intervention Accuracy (IIA)
uses
Evaluation metric measuring how well a trained intervention matches desired counterfactual model behavior
Causally Relevant Latent Subspace
associated_with
Contiguous subspace of the aligned latent vector encoding behaviorally relevant information for a specific causal variable.

artifact

pyvene
about
Python library for PyTorch model interventions; Boundless DAS tutorial used for CL loss experiments

framework

Model Alignment Search (MAS)
extends
The primary contribution of the paper: a bidirectional causal method that learns rotation matrices for each model to uncover and compare causally relevant latent subspaces across neural networks.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Distributed Alignment Searchmethod0.890
The core method introduced in this paper: finds alignments between high-level causal variables and distributed neural representations via gradient descent.
DAS finds better alignments than brute-force search by using gradient descent rather than exhaustive discrete searchclaim0.834
Second central claim of the paper.
Alignmentconcept0.767
The goal of making model behavior match human values and intentions, often addressed during post-training.
Data-Centric Alignmentconcept0.758
Alignment approach that focuses on curating or modifying training data; the paper bridges this with interpretability methods.
distributed data structuresconcept0.753
Data structures stored as collections of tuples in tuple space, accessible to many processes.
Identity Alignment Map (ϕ_id)method0.751
Simplest alignment map ϕ(h)=h, equivalent to assuming privileged bases hypothesis
Representational Alignmentconcept0.747
Measure of similarity between the similarity structures (kernels) induced by two different representations
Deliberative Alignmentframework0.741
OpenAI's approach integrating chain-of-thought reasoning into alignment; parallels contemplative self-monitoring