method
active
method:distributed-alignment-search

Distributed Alignment Search

The core method introduced in this paper: finds alignments between high-level causal variables and distributed neural representations via gradient descent.

Neighborhood — ranked by edge-count

Thinkers (1)

thinker

Frameworks (2)

framework
  • The hypothesis that models internalize concepts as approximately linear directions in representation space; used to interpret MDS injection behavior
  • Multi-task benchmark of linguistic behaviours for measuring causal efficacy of interpretability methods, adapted from SyntaxGym

Concepts (6)

concept

Methods (6)

method
  • Extends interchange interventions to non-standard bases by rotating representations, intervening in rotated subspaces, then rotating back.
  • A variant of DAS implemented in pyvene via BoundlessRotatedSpaceIntervention, introduced by Wu et al. 2023
  • Baseline method that exhaustively searches discrete spaces of localist alignments between high-level variables and neuron groups.
  • Differentiable training objective minimized when a high-level model is an abstraction of a neural network under a given alignment.
  • Extension of DAS that learns a second rotation matrix on top of a fixed first one to decompose representations into sub-representations.
  • DAS uses SGD over differentiable parameterizations of orthogonal matrices (via PyTorch) to find optimal distributed alignments.

Artifacts (3)

artifact

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Practical method by Geiger et al. for finding distributed causal abstractions using gradient descent
  • Alignmentconcept0.812
    The goal of making model behavior match human values and intentions, often addressed during post-training.
  • Alignment approach that focuses on curating or modifying training data; the paper bridges this with interpretability methods.
  • Alignment Problemconcept0.777
    The problem of ensuring AI systems adopt values compatible with human welfare — argued to be a perennial problem already present in child-rearing
  • Measure of similarity between the similarity structures (kernels) induced by two different representations
  • Data structures stored as collections of tuples in tuple space, accessible to many processes.
  • Alignment Functionconcept0.759
    A learnable invertible transformation in DAS that maps neural representations to a basis aligned with causal variables
  • OpenAI's approach integrating chain-of-thought reasoning into alignment; parallels contemplative self-monitoring