Model Alignment Search (MAS)

The primary contribution of the paper: a bidirectional causal method that learns rotation matrices for each model to uncover and compare causally relevant latent subspaces across neural networks.

Neighborhood — ranked by edge-count

Papers (1)

paper

Model Alignment Search
contradictsintroduces

Methods (3)

method

Interchange Intervention
uses
Fundamental operation for causal abstraction analysis; forces neurons to take values from source inputs to create counterfactuals.
Model Stitching
extends
Technique to measure representational compatibility by integrating intermediate representations of one model into another
Alignment Function (AF)
uses
Learnable invertible transformation in DAS/MAS that rotates latent vectors into aligned subspaces; narrowed to orthogonal matrices Q.

Concepts (5)

concept

Behavioral Null Space
associated_with
The span of vector directions that do not change network behavior; a key concept distinguishing MAS from model stitching.
Interchange Intervention Accuracy (IIA)
uses
Evaluation metric measuring how well a trained intervention matches desired counterfactual model behavior
Causally Relevant Latent Subspace
associated_with
Contiguous subspace of the aligned latent vector encoding behaviorally relevant information for a specific causal variable.
Functional Similarity
implements
Similarity measured with respect to network behavior/function rather than statistical correlation of activations.
Representational Isomorphism
implements
The desired property of a bidirectional, behavior-preserving mapping between model representations; the goal MAS pursues.

Questions (3)

question

How do we establish bidirectional causal relationships between neural systems?
gates
Motivates the bidirectional design of MAS over unidirectional model stitching.
When can we say that two neural systems perform a task in the same way?
gates
Fundamental question motivating the entire MAS framework.
How do we incorporate a focus on behavioral relevance in our measures of neural similarity?
answered_by
Direct motivating question for MAS's design principle of causal behavioral matching.

Frameworks (2)

framework

Distributed Alignment Search (DAS)
extends
Practical method by Geiger et al. for finding distributed causal abstractions using gradient descent
Counterfactual Latent MAS (CLMAS)
extends
MAS variant with an auxiliary CL loss objective for cases where one model is causally inaccessible, enabling ANN-BNN comparisons.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

MAS reduces number of required alignment matrices for n-model comparison from n(n-1) or n^2 (stitching) to nfinding0.807
Key computational efficiency advantage of MAS over traditional model stitching for multi-model comparisons.
Model Misalignmentconcept0.804
The phenomenon of model internals deviating from desired behavior; MAS is demonstrated to detect this via comparison of toxic vs nontoxic LLMs.
ML Alignment & Theory Scholars (MATS)institute0.795
Program that supported Tim Hua and Andrew Qin during this research.
Alignment Between High-Level and Low-Level Modelsconcept0.788
A mapping assigning to each high-level variable a set of low-level variables and a function from low-level to high-level values.
Alignmentconcept0.778
The goal of making model behavior match human values and intentions, often addressed during post-training.
MAS successfully aligns the Count variable from Multi-Object GRUs with the Rem Ops variable from Arithmetic GRUs with moderate IIAfinding0.771
Shows MAS can compare specific numeric variables across tasks with different domains/codomains.
Using more than two models in a MAS comparison could harm alignment due to conflicting loss gradients, or could assist in isolating causal subspaceshypothesis0.763
Open question raised in the paper about scaling MAS beyond two models.
MAS is a more causally focused choice than model stitching for addressing questions of how behaviorally relevant information is encoded in different neural systemsclaim0.763
Core interpretive claim supported by the formal analysis showing MAS does not exploit the behavioral null space unlike stitching.