Distributed Alignment Search

The core method introduced in this paper: finds alignments between high-level causal variables and distributed neural representations via gradient descent.

Neighborhood — ranked by edge-count

Papers (3)

paper

pyvene: A Library for Understanding and Improving PyTorch Models via Interventions
introduces
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations
introduces
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
implements

Thinkers (1)

thinker

Atticus Geiger
introducesstudies

Frameworks (2)

framework

Linear Representation Hypothesis
implements
The hypothesis that models internalize concepts as approximately linear directions in representation space; used to interpret MDS injection behavior
CausalGym
uses
Multi-task benchmark of linguistic behaviours for measuring causal efficacy of interpretability methods, adapted from SyntaxGym

Concepts (6)

concept

Causal abstraction
implements
A framework the paper uses alongside feature geometry to deepen mechanistic understanding of LMs
Distributed Neural Representations
about
Representations where individual neurons play multiple conceptual roles; patterns consisting of linear combinations of unit vectors.
Alignment Between High-Level and Low-Level Models
about
A mapping assigning to each high-level variable a set of low-level variables and a function from low-level to high-level values.
Change-of-Basis for Neural Representations
implements
Key insight that rotating a neural representation to a non-standard basis can reveal distributed causal structure invisible in standard neuron-aligned basis.
Subspace Intervention
uses
Intervention targeting specific dimensional subsets of activation vectors rather than full representations
Subspace Decomposition of Representations
implements
Investigation of whether a distributed representation can be further decomposed into sub-representations encoding component identities.

Methods (6)

method

Distributed Interchange Intervention
uses
Extends interchange interventions to non-standard bases by rotating representations, intervening in rotated subspaces, then rotating back.
Boundless DAS
extends
A variant of DAS implemented in pyvene via BoundlessRotatedSpaceIntervention, introduced by Wu et al. 2023
Brute-Force Alignment Search
extends
Baseline method that exhaustively searches discrete spaces of localist alignments between high-level variables and neuron groups.
Interchange Intervention Training Objective
uses
Differentiable training objective minimized when a high-level model is an abstraction of a neural network under a given alignment.
Subspace DAS
extends
Extension of DAS that learns a second rotation matrix on top of a fixed first one to decompose representations into sub-representations.
Gradient Descent Rotation Optimization
uses
DAS uses SGD over differentiable parameterizations of orthogonal matrices (via PyTorch) to find optimal distributed alignments.

Artifacts (3)

artifact

pyvene open-source Python library
implements
The main artifact introduced in the paper: an open-source PyPI library for customizable interventions on PyTorch models
InterchangeInterventions GitHub Repository
about
Codebase used to run the experiments in the paper.
Pyvene Library
implements
Library used to replicate the hierarchical equality experiment; DAS tutorial provided.

Quotes (1)

quote

Smolensky (1986) proposes that viewing a neural representation under a basis that is not aligned with individual neurons can reveal the interpretable distributed structure of the neural representations.
supports
Load-bearing theoretical claim providing the conceptual foundation for DAS.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Distributed Alignment Search (DAS)framework0.890
Practical method by Geiger et al. for finding distributed causal abstractions using gradient descent
Alignmentconcept0.812
The goal of making model behavior match human values and intentions, often addressed during post-training.
Data-Centric Alignmentconcept0.777
Alignment approach that focuses on curating or modifying training data; the paper bridges this with interpretability methods.
Alignment Problemconcept0.777
The problem of ensuring AI systems adopt values compatible with human welfare — argued to be a perennial problem already present in child-rearing
Representational Alignmentconcept0.771
Measure of similarity between the similarity structures (kernels) induced by two different representations
distributed data structuresconcept0.762
Data structures stored as collections of tuples in tuple space, accessible to many processes.
Alignment Functionconcept0.759
A learnable invertible transformation in DAS that maps neural representations to a basis aligned with causal variables
Deliberative Alignmentframework0.759
OpenAI's approach integrating chain-of-thought reasoning into alignment; parallels contemplative self-monitoring