Jay McClelland

Advisor to the primary author; co-author on Grant et al. 2025 foundational paper.

Authored

Introduces

Studies

Affiliations

Cited by

Authored papers (1)

Model Alignment Search2025
Model Alignment Search (MAS) establishes bidirectional causal similarity between neural networks by learning a per-model orthogonal rotation matrix that isolates behaviorally relevant subspaces and uses interchange interventions — patching those subspaces across frozen model pairs — to measure functional alignment via Interchange Intervention Accuracy (IIA). Comparing GRUs and 2-layer Transformers on numeric tasks reveals that correlative methods like RSA and CKA give misleading estimates: RSA shows anomalously low embedding-layer similarity between same-architecture GRU seeds, and both CKA and RSA suggest potentially high hidden-state similarity between GRU and Transformer hidden states that MAS correctly diagnoses as low because Transformers employ an anti-Markovian solution that recomputes numeric information at every step. MAS compresses behaviorally relevant information to as few as 4 dimensions while achieving IIA comparable to DAS, and it reduces the number of required comparison matrices from O(n²) to O(n), making it more compute-efficient than traditional model stitching for three or more models. A case study on DeepSeek-R1-Distill-Qwen-1.5B models fine-tuned on toxic versus nontoxic text demonstrates that toxic-to-toxic MAS IIA is measurably higher than toxic-to-nontoxic IIA, whereas nontoxic-to-nontoxic comparisons show no significant internal difference — suggesting MAS can serve as a diagnostic for representational misalignment. The Counterfactual Latent MAS (CLMAS) extension, which adds an auxiliary L2 plus cosine loss against prerecorded latent vectors, recovers causal alignment even when one model is causally inaccessible, implying the method may generalize to ANN–biological neural network comparisons where only recordings, not interventions, are available.

More papers — OpenAlex / S2

Affiliations (2)

Stanford University(institute)
PDP Lab(institute)

Co-authors (2)

Satchel Grant3 shared
Noah Goodman1 shared

Their work is cited by (1)

Addressing divergent representations from causal interventions on neural networks1× refs

Other inbound relations (1)

mentionsAddressing divergent representations from causal interventions on neural networks(paper)

Recent mentions (2)

papers-typed
grant-2025-addressing-divergent.md
papers-typed
grant-2025-alignment-search.md