framework
active
framework:model-alignment-search-masModel Alignment Search (MAS)
The primary contribution of the paper: a bidirectional causal method that learns rotation matrices for each model to uncover and compare causally relevant latent subspaces across neural networks.
Neighborhood — ranked by edge-count
Papers (1)
paper
- Model Alignment Searchcontradictsintroduces
Methods (3)
method
- Fundamental operation for causal abstraction analysis; forces neurons to take values from source inputs to create counterfactuals.
- Model StitchingextendsTechnique to measure representational compatibility by integrating intermediate representations of one model into another
- Learnable invertible transformation in DAS/MAS that rotates latent vectors into aligned subspaces; narrowed to orthogonal matrices Q.
Concepts (5)
concept
- Behavioral Null Spaceassociated_withThe span of vector directions that do not change network behavior; a key concept distinguishing MAS from model stitching.
- Evaluation metric measuring how well a trained intervention matches desired counterfactual model behavior
- Causally Relevant Latent Subspaceassociated_withContiguous subspace of the aligned latent vector encoding behaviorally relevant information for a specific causal variable.
- Functional SimilarityimplementsSimilarity measured with respect to network behavior/function rather than statistical correlation of activations.
- Representational IsomorphismimplementsThe desired property of a bidirectional, behavior-preserving mapping between model representations; the goal MAS pursues.
Questions (3)
question
- Motivates the bidirectional design of MAS over unidirectional model stitching.
- Fundamental question motivating the entire MAS framework.
- How do we incorporate a focus on behavioral relevance in our measures of neural similarity?answered_byDirect motivating question for MAS's design principle of causal behavioral matching.
Frameworks (2)
framework
- Practical method by Geiger et al. for finding distributed causal abstractions using gradient descent
- MAS variant with an auxiliary CL loss objective for cases where one model is causally inaccessible, enabling ANN-BNN comparisons.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- MAS reduces number of required alignment matrices for n-model comparison from n(n-1) or n^2 (stitching) to nfinding0.807Key computational efficiency advantage of MAS over traditional model stitching for multi-model comparisons.
- The phenomenon of model internals deviating from desired behavior; MAS is demonstrated to detect this via comparison of toxic vs nontoxic LLMs.
- Program that supported Tim Hua and Andrew Qin during this research.
- A mapping assigning to each high-level variable a set of low-level variables and a function from low-level to high-level values.
- The goal of making model behavior match human values and intentions, often addressed during post-training.
- Shows MAS can compare specific numeric variables across tasks with different domains/codomains.
- Open question raised in the paper about scaling MAS beyond two models.
- Core interpretive claim supported by the formal analysis showing MAS does not exploit the behavioral null space unlike stitching.