thinker
active
thinker:jay-mcclelland

Jay McClelland

Advisor to the primary author; co-author on Grant et al. 2025 foundational paper.

Authored
1
Introduces
0
Studies
0
Affiliations
2
Cited by
1

Authored papers (1)

  • Model Alignment Search (MAS) establishes bidirectional causal similarity between neural networks by learning a per-model orthogonal rotation matrix that isolates behaviorally relevant subspaces and uses interchange interventions — patching those subspaces across frozen model pairs — to measure functional alignment via Interchange Intervention Accuracy (IIA). Comparing GRUs and 2-layer Transformers on numeric tasks reveals that correlative methods like RSA and CKA give misleading estimates: RSA shows anomalously low embedding-layer similarity between same-architecture GRU seeds, and both CKA and RSA suggest potentially high hidden-state similarity between GRU and Transformer hidden states that MAS correctly diagnoses as low because Transformers employ an anti-Markovian solution that recomputes numeric information at every step. MAS compresses behaviorally relevant information to as few as 4 dimensions while achieving IIA comparable to DAS, and it reduces the number of required comparison matrices from O(n²) to O(n), making it more compute-efficient than traditional model stitching for three or more models. A case study on DeepSeek-R1-Distill-Qwen-1.5B models fine-tuned on toxic versus nontoxic text demonstrates that toxic-to-toxic MAS IIA is measurably higher than toxic-to-nontoxic IIA, whereas nontoxic-to-nontoxic comparisons show no significant internal difference — suggesting MAS can serve as a diagnostic for representational misalignment. The Counterfactual Latent MAS (CLMAS) extension, which adds an auxiliary L2 plus cosine loss against prerecorded latent vectors, recovers causal alignment even when one model is causally inaccessible, implying the method may generalize to ANN–biological neural network comparisons where only recordings, not interventions, are available.

More papers — OpenAlex / S2

Affiliations (2)

Co-authors (2)

Recent mentions (2)