Model Alignment Search

BySatchel GrantStanford University, PDP Lab

DOI 10.48550/arxiv.2501.06164 arXiv 2501.06164 OpenAlex W4406319140

Anti-Markovian Solution Counterfactual Latent MAS (CLMAS)Alignment Function (AF)Arithmetic Task Dataset Distributed representation Gated Recurrent Unit (GRU)Counterfactual Latent (CL) Auxiliary Loss DeepSeek-R1-Distill-Qwen-1.5B Model Misalignment Linear Representation Hypothesis Latent Stitch Modulo Task Dataset Numeric Cognition (case study)Long Short-Term Memory (LSTM)+8 more

TL;DR

Model Alignment Search (MAS) establishes bidirectional causal similarity between neural networks by learning a per-model orthogonal rotation matrix that isolates behaviorally relevant subspaces and uses interchange interventions — patching those subspaces across frozen model pairs — to measure functional alignment via Interchange Intervention Accuracy (IIA). Comparing GRUs and 2-layer Transformers on numeric tasks reveals that correlative methods like RSA and CKA give misleading estimates: RSA shows anomalously low embedding-layer similarity between same-architecture GRU seeds, and both CKA and RSA suggest potentially high hidden-state similarity between GRU and Transformer hidden states that MAS correctly diagnoses as low because Transformers employ an anti-Markovian solution that recomputes numeric information at every step. MAS compresses behaviorally relevant information to as few as 4 dimensions while achieving IIA comparable to DAS, and it reduces the number of required comparison matrices from O(n²) to O(n), making it more compute-efficient than traditional model stitching for three or more models. A case study on DeepSeek-R1-Distill-Qwen-1.5B models fine-tuned on toxic versus nontoxic text demonstrates that toxic-to-toxic MAS IIA is measurably higher than toxic-to-nontoxic IIA, whereas nontoxic-to-nontoxic comparisons show no significant internal difference — suggesting MAS can serve as a diagnostic for representational misalignment. The Counterfactual Latent MAS (CLMAS) extension, which adds an auxiliary L2 plus cosine loss against prerecorded latent vectors, recovers causal alignment even when one model is causally inaccessible, implying the method may generalize to ANN–biological neural network comparisons where only recordings, not interventions, are available.

What to take away

1. MAS reduces the number of learned comparison matrices from O(n²) for pairwise model stitching to O(n), one orthogonal rotation matrix per model, when comparing n models.
2. RSA (using Spearman rank correlation on cosine-distance RDMs over 1000 sampled vectors) produces anomalously low embedding-layer similarity scores even for GRU models of the same architecture trained on identical Multi-Object tasks with different random seeds, whereas MAS IIA correctly shows near-ceiling causal transfer.
3. CKA and RSA both suggest high hidden-state similarity between Multi-Object GRUs and 2-layer RoPE Transformers, but MAS IIA is low because Transformers use an anti-Markovian solution that recomputes numeric information at every sequence step, a difference that correlative methods cannot detect.
4. MAS can compress all behaviorally relevant causal information into as few as 4 aligned dimensions (matching DAS performance), while model stitching achieves near-perfect IIA even at rank 2 by exploiting the behavioral null-space of the source model.
5. Fine-tuned DeepSeek-R1-Distill-Qwen-1.5B toxic models exhibit higher stepwise MAS IIA when compared to other toxic models than when compared to nontoxic models, with no significant IIA difference observed in nontoxic-to-nontoxic comparisons.
6. GRUs trained on Multi-Object and Rounding tasks show lower cross-task MAS IIA for their numeric subspaces than within-task GRU pairs, and restricting the Arithmetic GRU's Cumu Val range to 1–10 raises MAS IIA toward but not to the level of the Rem Ops alignment, consistent with GRUs encoding arithmetic and counting numbers differently.
7. CLMAS, which augments the MAS loss with an auxiliary L2 plus cosine loss (weighted by hyperparameter ε tested at 0.5, 0.89, 0.94) against prerecorded counterfactual latent vectors, achieves higher IIA in the causally inaccessible intervention direction than both behavioral stitching and latent stitching baselines while matching standard MAS in the accessible direction.
8. An open question raised is whether including more than two models simultaneously in a single MAS training would harm alignment quality by creating conflicting gradient signals or would instead improve isolation of causally relevant subspaces across all models.
9. MAS rotation matrices are trained for 1000 epochs using Adam (lr=0.001, batch size 512), with 10,000 intervention samples and 1,000 held-out validation samples, orthogonalized via PyTorch's exponential-of-skew-symmetric parametrization, selecting the checkpoint with best validation IIA — a fully replicable procedure.
10. Model stitching can succeed at near-perfect IIA using rank-2 transformations by relying on the source model's behavioral null-space and dormant subspaces, meaning a successful stitch does not imply that the two networks encode the task variable in structurally similar ways.

Peer brief — for seminar discussion

Grant (2025) introduces Model Alignment Search (MAS), a method for measuring functional similarity between pairs of frozen neural networks by learning one orthogonal rotation matrix per model that simultaneously uncovers causally relevant latent subspaces and maps them onto each other, then uses bidirectional interchange interventions — patching those subspaces across models — and measuring the resulting Interchange Intervention Accuracy (IIA) on counterfactual behavior as the similarity score. The method is conceptually a fusion of model stitching (Bansal et al., 2021) and Distributed Alignment Search (Geiger et al., 2021; 2023), and it was validated on GRUs, LSTMs, 2-layer RoPE Transformers trained on numeric equivalence tasks, and DeepSeek-R1-Distill-Qwen-1.5B models fine-tuned for toxic or nontoxic text generation. An alternative it could have used — but argues against — is standard model stitching, which the paper shows achieves near-perfect IIA even at rank 2 by exploiting the source model's behavioral null-space rather than isolating genuinely shared causal structure. The load-bearing finding is that correlative methods, specifically RSA via Spearman rank correlation on 1000-vector cosine-distance RDMs and CKA via cosine kernels, systematically misrepresent functional similarity in ways that MAS corrects. RSA gives anomalously low embedding-layer scores for same-architecture GRU seeds trained on the identical Multi-Object task, while both CKA and RSA suggest high hidden-state similarity between GRUs and Transformers on the same task — a similarity that MAS IIA correctly identifies as low because Transformers employ an anti-Markovian solution that recomputes numeric state at each token, rendering their hidden states causally non-equivalent to GRU hidden states. MAS also reveals that GRUs trained on counting tasks versus arithmetic tasks encode number differently: cross-task numeric subspace IIA is lower than within-task IIA, and restricting the arithmetic model's Cumu Val range to 1–10 raises but does not close the gap to the Rem Ops alignment. In the toxicity case study, toxic DeepSeek-R1-Distill-Qwen-1.5B models show higher IIA with other toxic models than with nontoxic models, while nontoxic-to-nontoxic comparisons show no significant difference. The paper also introduces CLMAS, which adds a weighted auxiliary L2 plus cosine loss (ε ∈ {0.5, 0.89, 0.94}) against prerecorded counterfactual latent vectors to recover causal alignment when one model is causally inaccessible — the scenario relevant for ANN–biological neural network comparisons. The key implication is that causal intervention-based similarity measures should supplement or replace correlative measures whenever the research goal is to determine whether two networks perform a task through the same mechanism, not just whether their representations are linearly related. The paper predicts that CLMAS-like methods could eventually reduce the need for neural stimulation in biological comparisons by pre-computing alignment candidates from recordings alone. A critical reader would push back on the limited scope of the toxicity case study: fine-tuning DeepSeek-R1-Distill-Qwen-1.5B on concatenated toxicity datasets (Jigsaw 2018; ToxicChat; RLHF preference data) with only 3 seeds per condition and reporting token-level rather than trial-level IIA makes it difficult to distinguish a genuine representational signature of toxicity from a superficial output-distribution shift induced by fine-tuning on stylistically distinct corpora. The result that toxic models align better with each other than with nontoxic models is consistent with the toxicity hypothesis but equally consistent with the simpler explanation that models fine-tuned on the same data distribution share low-level statistical regularities that MAS picks up regardless of whether those regularities reflect anything semantically meaningful about misalignment.

Methods (5)

Alignment Function (AF)
Learnable invertible transformation in DAS/MAS that rotates latent vectors into aligned subspaces; narrowed to orthogonal matrices Q.
Counterfactual Latent (CL) Auxiliary Loss
Auxiliary objective combining L2 and cosine losses against pre-recorded CL vectors to improve causal relevance when one model is causally inaccessible.
Latent Stitch
Baseline method using a single orthogonal matrix trained to map source latents to target latents via CL auxiliary loss without behavioral objective.
Optogenetics
Light-gated ion channels used to control bioelectric states and dissect cellular computation.
Stepwise MAS
MAS variant applying interchange interventions at multiple contiguous token positions from the start of a sequence to a sampled time step t.

Frameworks (6)

Counterfactual Latent MAS (CLMAS)
MAS variant with an auxiliary CL loss objective for cases where one model is causally inaccessible, enabling ANN-BNN comparisons.
Gated Recurrent Unit (GRU)
Recurrent neural network architecture used as the primary model type in numeric task experiments.
Linear Representation Hypothesis
The hypothesis that models internalize concepts as approximately linear directions in representation space; used to interpret MDS injection behavior
Long Short-Term Memory (LSTM)
Recurrent neural network architecture used alongside GRUs in numeric task experiments; MAS applied to concatenated h and c vectors.
Model Alignment Search (MAS)
The primary contribution of the paper: a bidirectional causal method that learns rotation matrices for each model to uncover and compare causally relevant latent subspaces across neural networks.
Shallow Transformer (RoPE-based)
Two-layer transformer with rotary positional encodings used in numeric task experiments.

Datasets (7)

Arithmetic Task Dataset
More complex numeric task involving addition/subtraction operations with cumulative values; used in Appendix B.7 to explore MAS across differing domains.
DeepSeek-R1-Distill-Qwen-1.5B
Small model used in attention head attribution analysis in appendix
Modulo Task Dataset
Numeric task where the number of response tokens equals the object quantity mod 4.
Multi-Object Task Dataset
Primary numeric task where models count demonstration tokens and produce matching response tokens; used for most MAS analyses.
Rounding Task Dataset
Numeric task where the number of response tokens equals the object quantity rounded to the nearest multiple of 3.
Same-Object Task Dataset
Variant of Multi-Object task using a single token type C instead of multiple demo/response types.
Toxicity Finetuning Dataset
Concatenation of three toxicity-related datasets used to finetune DeepSeek models for the misalignment case study.

Findings (12)

MAS IIA is low for GRU hidden states vs Transformer hidden states on Multi-Object task, consistent with anti-Markovian transformer solution
Validates MAS as a causal detector of representational differences invisible to correlative methods.
CKA and RSA show potentially unintuitive (over-estimated) hidden state similarity for GRU-Transformer comparisons on Multi-Object task
Prior work shows transformers use anti-Markovian solutions; MAS correctly shows low IIA reflecting this, while RSA/CKA do not detect it.
MAS successfully aligns the Count variable from Multi-Object GRUs with the Rem Ops variable from Arithmetic GRUs with moderate IIA
Shows MAS can compare specific numeric variables across tasks with different domains/codomains.
CLMAS achieves the best IIA in the causally inaccessible (No Access) direction while matching MAS in the accessible direction
Demonstrates the value of the CL auxiliary loss for recovering causal alignments when one model cannot be intervened upon.
MAS IIA for Count vs Low CumuVal (values 1-10) is higher than Count vs full CumuVal, but still lower than Count vs Rem Ops
Qualifies the arithmetic alignment results; supports hypothesis that Arithmetic GRUs use different numeric representations than incremental counting.
RSA shows low RDM correlation on embedding layers for GRU-GRU comparisons, despite high within-seed functional similarity
Demonstrates RSA's sensitivity issue in embedding layers; attributed partly to Spearman rank handling of RDMs with differing relative extrema.
MAS successfully aligns behavior between Multi-Object GRU models in both embedding and hidden state layers with high IIA
Demonstrates MAS's ability to bidirectionally transfer behavior where RSA shows low embedding correlation.
Toxic LLMs show higher IIA when compared to other toxic models than when compared to nontoxic models using stepwise MAS
Proof-of-principle that MAS can detect model misalignment in DeepSeek-R1-Qwen-1.5B fine-tuned models.
MAS reveals that numeric representations differ between GRUs trained on Multi-Object, Rounding, and Modulo tasks
Case study showing MAS can compare specific causal information types across models trained on different tasks.
Model stitching achieves nearly perfect IIA even for rank-2 transformation matrices on Multi-Object GRU models
Evidence that model stitching can exploit the behavioral null space, making it less causally restrictive than MAS.

Claims (7)

Model stitching can use the behavioral null space of the source model when mapping to the target, making successful stitching insufficient evidence of representational similarity
Formal analysis showing the theoretical limitation of model stitching as a similarity measure.
MAS is a more causally focused choice than model stitching for addressing questions of how behaviorally relevant information is encoded in different neural systems
Core interpretive claim supported by the formal analysis showing MAS does not exploit the behavioral null space unlike stitching.
Correlative methods like RSA and CKA are insufficient for determining functional similarity between neural systems; causal methods are necessary
Central motivating claim of the paper; supported by empirical comparisons showing RSA/CKA miss Markovian differences detectable by MAS.
Including within-model interventions (i=j) in MAS training adds a soft constraint encouraging separation of causal from extraneous subspaces
Methodological claim about why within-model interchange interventions are essential to the MAS training procedure.
Transformers use an anti-Markovian solution that recomputes relevant numeric information at each step in the Multi-Object task
Prior finding from Grant et al. 2025 used to interpret low MAS IIA for GRU-Transformer hidden state comparisons.
CLMAS can potentially reduce or remove the need for NN stimulation during alignment training in biological settings
Forward-looking claim about the practical utility of CLMAS for ANN-BNN comparisons with limited causal access.
MAS-like methods could potentially be used to directly constrain model internals to be non-toxic
Speculative forward-looking claim about practical applications of MAS for model alignment.

Hypotheses (3)

The effect size of CLMAS improvement over baselines will correlate with the amount of variability in the behavioral null space of the inaccessible model
Prediction about when CLMAS will be most beneficial, stated explicitly in the paper.
Using more than two models in a MAS comparison could harm alignment due to conflicting loss gradients, or could assist in isolating causal subspaces
Open question raised in the paper about scaling MAS beyond two models.
GRUs trained on the Arithmetic task use different types of numeric representations than incremental counting models
Interpretive hypothesis supported by the lower IIA between Count and Cumu Val variables even in the restricted value range.

Questions (5)

How do we incorporate a focus on behavioral relevance in our measures of neural similarity?
Direct motivating question for MAS's design principle of causal behavioral matching.
What nuances do we miss when we fail to causally probe the representations of the systems?
Motivates the empirical comparison between MAS and RSA/CKA in the paper.
How do representations differ or converge between architectures, tasks, and modalities?
Broader research question MAS is positioned to address, citing multiple recent works.
How do we establish bidirectional causal relationships between neural systems?
Motivates the bidirectional design of MAS over unidirectional model stitching.
When can we say that two neural systems perform a task in the same way?
Fundamental question motivating the entire MAS framework.

Original abstract (expand)

When can we say that two neural systems perform a task in the same way? What nuances do we miss when we fail to causally probe the representations of the systems, and how do we establish bidirectional causal relationships? In this work, we introduce a method that bidirectionally transfers neural activity between artificial neural networks and uses their resulting behavior as a measure of functional similarity. We first show that the method can be used to transfer the behavior from one frozen Neural Network (NN) to another in a manner similar to model stitching, and we show how the method can differ from correlative similarity measures like Representational Similarity Analysis. Next, we empirically and theoretically show how the method can be equivalent to model stitching when desired, or it can take a form that has a more restrictive focus to shared causal information; in both forms, it reduces the number of required matrices for a comparison of n models to be linear in n. We then present a case study on number-related tasks showing that the method can be used to examine specific subtypes of causal information demonstrating that numbers can be encoded differently in recurrent models depending on the task, and we present another case study showing that MAS can reveal misalignment in fine-tuned DeepSeek-r1-Qwen-1.5B models. Lastly, we augment the loss function with a counterfactual latent (CL) auxiliary objective to improve causal relevance when one of the two networks is causally inaccessible (as is often the case in comparisons with biological networks). We use our results to encourage the use of causal methods in neural similarity analyses and to suggest future explorations of network similarity methodology for model misalignment.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations
cited
in corpus
2023
≈ 85%
The Platonic Representation Hypothesis
cited
in corpus
2024
≈ 84%
The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
in corpus
2025
≈ 85%
Mechanistic Interpretability for Large Language Model Alignment: Progress, Challenges, and Future Directions
Usman Naseem
2026
≈ 84%
Addressing divergent representations from causal interventions on neural networks
in corpus
2025
≈ 84%
Probing the Probes: Methods and Metrics for Concept Alignment
Marte Eggen, Inga Str\"umke Jacob Lysn{\ae}s-Larsen
2025
≈ 84%
Dynamical similarity analysis can identify compositional dynamics developing in RNNs
Micha{\l} W\'ojcik, Jascha Achterberg, Rui Ponte Costa Quentin Guilhot
2024
≈ 84%
Differentiable Faithfulness Alignment for Cross-Model Circuit Transfer
Binxu Wang, Shay B. Cohen, Anna Korhonen, Yonatan Belinkov Shun Shao
2026
≈ 83%
Beyond Object-Level Alignment: Do Brains and DNNs Preserve the Same Transformations?
Yukiyasu Kamitani
2026
≈ 83%
SAE-V: Interpreting Multimodal Models for Enhanced Alignment
Changye Li, Jiaming Ji, Yaodong Yang Hantao Lou
2025
≈ 83%
Improving Model Alignment Through Collective Intelligence of Open-Source LLMS
Roy Xie, Shang Zhu, Jue Wang, Ben Athiwaratkun, Bhuwan Dhingra, Shuaiwen Leon Song, Ce Zhang, James Zou Junlin Wang
2025
≈ 83%
Automated Meta Prompt Engineering for Alignment with the Theory of Mind
Rahul Agarwal, Eduardo Morales, Gozde Akay Aaron Baughman
2025
≈ 83%
Tracking Equivalent Mechanistic Interpretations Across Neural Networks
Mariya Toneva Alan Sun
2026
≈ 83%
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
in corpus
2024
≈ 82%
Brittle Minds, Fixable Activations: Understanding Belief Representations in Language Models
Constantin Ruhdorfer, Lei Shi, Andreas Bulling Matteo Bortoletto
2025
≈ 82%
Activation Steering for Aligned Open-ended Generation without Sacrificing Coherence
Martin Zborowski, Alberto Tosato, Gauthier Gidel, Tommaso Tosato Niklas Herbster
2026
≈ 82%
Constructing Interpretable Features from Compositional Neuron Groups
Atticus Geiger, Mor Geva Or Shafran
2026
≈ 82%
Aligning Large Language Models with Human Preferences through Representation Engineering
Xiaohua Wang, Muling Wu, Tianlong Li, Changze Lv, Zixuan Ling, Jianhao Zhu, Cenyuan Zhang, Xiaoqing Zheng, Xuanjing Huang Wenhao Liu
2024
≈ 82%
Navigating by Old Maps: The Pitfalls of Static Mechanistic Localization in LLM Post-Training
Jiaying Zhu, Hongyang Chen, Hongxu Liu, Xinyu Yang, Wenya Wang Hang Chen
2026
≈ 82%
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
in corpus
2023
≈ 82%
Towards Understanding Fine-Tuning Mechanisms of LLMs via Circuit Analysis
Yan Hu, Wenyu Du, Reynold Cheng, Benyou Wang, Difan Zou Xu Wang
2025
≈ 82%
Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencoders
in corpus
2026
≈ 82%
Emergent symbol-like number variables in artificial neural networks
cited
2025
≈ 81%
Relating transformers to models and neural representations of the hippocampal formation
in corpus
2021
≈ 81%
Unveiling the Latent Directions of Reflection in Large Language Models
in corpus
2025
≈ 80%
When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models
in corpus
2025
≈ 80%
Neural natural language inference models partially embed theories of lexical entailment and negation
cited
2020
≈ 76%
Learning phrase representations using RNN encoder-decoder for statistical machine translation
cited
2014
≈ 73%
The linear representation hypothesis and the geometry of large language models
cited
2024
≈ 71%
Locating and editing factual associations in GPT
cited
2022
≈ 65%

+18 more

Similar preprints — Semantic Scholar

Cited by (1)

Addressing divergent representations from causal interventions on neural networks
Causal intervention methods central to mechanistic interpretability—including activation patching, mean-difference vector patching, Sparse Autoencoders, and Distributed Alignment Search (DAS)—systemat