method
active
method:distributed-alignment-searchDistributed Alignment Search
The core method introduced in this paper: finds alignments between high-level causal variables and distributed neural representations via gradient descent.
Neighborhood — ranked by edge-count
Papers (3)
paper
Thinkers (1)
thinker
- Atticus Geigerintroducesstudies
Frameworks (2)
framework
- Linear Representation HypothesisimplementsThe hypothesis that models internalize concepts as approximately linear directions in representation space; used to interpret MDS injection behavior
- CausalGymusesMulti-task benchmark of linguistic behaviours for measuring causal efficacy of interpretability methods, adapted from SyntaxGym
Concepts (6)
concept
- Causal abstractionimplementsA framework the paper uses alongside feature geometry to deepen mechanistic understanding of LMs
- Representations where individual neurons play multiple conceptual roles; patterns consisting of linear combinations of unit vectors.
- A mapping assigning to each high-level variable a set of low-level variables and a function from low-level to high-level values.
- Key insight that rotating a neural representation to a non-standard basis can reveal distributed causal structure invisible in standard neuron-aligned basis.
- Intervention targeting specific dimensional subsets of activation vectors rather than full representations
- Investigation of whether a distributed representation can be further decomposed into sub-representations encoding component identities.
Methods (6)
method
- Extends interchange interventions to non-standard bases by rotating representations, intervening in rotated subspaces, then rotating back.
- Boundless DASextendsA variant of DAS implemented in pyvene via BoundlessRotatedSpaceIntervention, introduced by Wu et al. 2023
- Brute-Force Alignment SearchextendsBaseline method that exhaustively searches discrete spaces of localist alignments between high-level variables and neuron groups.
- Differentiable training objective minimized when a high-level model is an abstraction of a neural network under a given alignment.
- Subspace DASextendsExtension of DAS that learns a second rotation matrix on top of a fixed first one to decompose representations into sub-representations.
- DAS uses SGD over differentiable parameterizations of orthogonal matrices (via PyTorch) to find optimal distributed alignments.
Artifacts (3)
artifact
- pyvene open-source Python libraryimplementsThe main artifact introduced in the paper: an open-source PyPI library for customizable interventions on PyTorch models
- Codebase used to run the experiments in the paper.
- Pyvene LibraryimplementsLibrary used to replicate the hierarchical equality experiment; DAS tutorial provided.
Quotes (1)
quote
- Load-bearing theoretical claim providing the conceptual foundation for DAS.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Practical method by Geiger et al. for finding distributed causal abstractions using gradient descent
- The goal of making model behavior match human values and intentions, often addressed during post-training.
- Alignment approach that focuses on curating or modifying training data; the paper bridges this with interpretability methods.
- The problem of ensuring AI systems adopt values compatible with human welfare — argued to be a perennial problem already present in child-rearing
- Measure of similarity between the similarity structures (kernels) induced by two different representations
- Data structures stored as collections of tuples in tuple space, accessible to many processes.
- A learnable invertible transformation in DAS that maps neural representations to a basis aligned with causal variables
- OpenAI's approach integrating chain-of-thought reasoning into alignment; parallels contemplative self-monitoring