framework
active
framework:distributed-alignment-search-das

Distributed Alignment Search (DAS)

Practical method by Geiger et al. for finding distributed causal abstractions using gradient descent

Neighborhood — ranked by edge-count

Thinkers (1)

thinker

Methods (3)

method
  • Fundamental operation for causal abstraction analysis; forces neurons to take values from source inputs to create counterfactuals.
  • Standard method in mechanistic interpretability that intervenes on activations; VPD flips this paradigm by patching parameters.
  • Diagnostic Probing
    analogous_to
    Earlier interpretability method applying classifiers to DNN hidden representations; shares complexity-accuracy dilemma with causal abstraction

Concepts (4)

concept
  • A framework the paper uses alongside feature geometry to deepen mechanistic understanding of LMs
  • A learnable invertible transformation in DAS that maps neural representations to a basis aligned with causal variables
  • Evaluation metric measuring how well a trained intervention matches desired counterfactual model behavior
  • Contiguous subspace of the aligned latent vector encoding behaviorally relevant information for a specific causal variable.

Artifacts (1)

artifact
  • pyvene
    about
    Python library for PyTorch model interventions; Boundless DAS tutorial used for CL loss experiments

Frameworks (1)

framework
  • The primary contribution of the paper: a bidirectional causal method that learns rotation matrices for each model to uncover and compare causally relevant latent subspaces across neural networks.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.