framework
active
framework:distributed-alignment-search-dasDistributed Alignment Search (DAS)
Practical method by Geiger et al. for finding distributed causal abstractions using gradient descent
Neighborhood — ranked by edge-count
Papers (1)
paper
Thinkers (1)
thinker
- Zhengxuan Wustudies
Methods (3)
method
- Fundamental operation for causal abstraction analysis; forces neurons to take values from source inputs to create counterfactuals.
- Activation patchingextendsStandard method in mechanistic interpretability that intervenes on activations; VPD flips this paradigm by patching parameters.
- Diagnostic Probinganalogous_toEarlier interpretability method applying classifiers to DNN hidden representations; shares complexity-accuracy dilemma with causal abstraction
Concepts (4)
concept
- Causal abstractionimplementsA framework the paper uses alongside feature geometry to deepen mechanistic understanding of LMs
- A learnable invertible transformation in DAS that maps neural representations to a basis aligned with causal variables
- Evaluation metric measuring how well a trained intervention matches desired counterfactual model behavior
- Causally Relevant Latent Subspaceassociated_withContiguous subspace of the aligned latent vector encoding behaviorally relevant information for a specific causal variable.
Artifacts (1)
artifact
- pyveneaboutPython library for PyTorch model interventions; Boundless DAS tutorial used for CL loss experiments
Frameworks (1)
framework
- Model Alignment Search (MAS)extendsThe primary contribution of the paper: a bidirectional causal method that learns rotation matrices for each model to uncover and compare causally relevant latent subspaces across neural networks.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- The core method introduced in this paper: finds alignments between high-level causal variables and distributed neural representations via gradient descent.
- Second central claim of the paper.
- The goal of making model behavior match human values and intentions, often addressed during post-training.
- Alignment approach that focuses on curating or modifying training data; the paper bridges this with interpretability methods.
- Data structures stored as collections of tuples in tuple space, accessible to many processes.
- Simplest alignment map ϕ(h)=h, equivalent to assuming privileged bases hypothesis
- Measure of similarity between the similarity structures (kernels) induced by two different representations
- OpenAI's approach integrating chain-of-thought reasoning into alignment; parallels contemplative self-monitoring