thinker
active
thinker:satchel-grant

Satchel Grant

Authored
2
Introduces
0
Studies
1
Affiliations
2
Cited by
2

Authored papers (2)

  • Causal intervention methods central to mechanistic interpretability—including activation patching, mean-difference vector patching, Sparse Autoencoders, and Distributed Alignment Search (DAS)—systematically produce representations that diverge from the target model's natural distribution, and this divergence can corrupt mechanistic conclusions even when behavioral accuracy appears unaffected. For any manifold geometry other than axis-aligned hyperrectangles, coordinate patching is provably guaranteed to produce off-manifold representations given exhaustive sampling, and empirical measurements using Earth Mover's Distance (EMD) confirm divergence across all three tested methods on Meta-Llama-3-8B-Instruct. Two mechanistically distinct failure modes emerge: 'harmless' divergences confined to the behavioral null-space of downstream weight matrices, and 'pernicious' divergences that activate hidden computational pathways or trigger dormant behavioral changes—illustrated concretely with a ReLU circuit where mean-difference patching recruits a third hidden unit silent under all natural class inputs. To mitigate pernicious divergence, the paper applies and modifies the Counterfactual Latent (CL) loss from Grant (2025), showing it reduces EMD from 0.032 ± 0.003 to 0.007 ± 0.001 in synthetic DAS settings while maintaining IIA of 0.997–0.9988, and that training EMD anti-correlates with OOD IIA (coef. −0.34, R² = 0.73, F(1,28) = 75.28, p < 0.001) in a 7B LLM Boundless DAS setting. The paper argues this implies that any divergence outside the null-space of NN layers is potentially pernicious, posing fundamental challenges for aspirations of complete mechanistic understanding using current causal intervention methods alone.

  • Model Alignment Search (MAS) establishes bidirectional causal similarity between neural networks by learning a per-model orthogonal rotation matrix that isolates behaviorally relevant subspaces and uses interchange interventions — patching those subspaces across frozen model pairs — to measure functional alignment via Interchange Intervention Accuracy (IIA). Comparing GRUs and 2-layer Transformers on numeric tasks reveals that correlative methods like RSA and CKA give misleading estimates: RSA shows anomalously low embedding-layer similarity between same-architecture GRU seeds, and both CKA and RSA suggest potentially high hidden-state similarity between GRU and Transformer hidden states that MAS correctly diagnoses as low because Transformers employ an anti-Markovian solution that recomputes numeric information at every step. MAS compresses behaviorally relevant information to as few as 4 dimensions while achieving IIA comparable to DAS, and it reduces the number of required comparison matrices from O(n²) to O(n), making it more compute-efficient than traditional model stitching for three or more models. A case study on DeepSeek-R1-Distill-Qwen-1.5B models fine-tuned on toxic versus nontoxic text demonstrates that toxic-to-toxic MAS IIA is measurably higher than toxic-to-nontoxic IIA, whereas nontoxic-to-nontoxic comparisons show no significant internal difference — suggesting MAS can serve as a diagnostic for representational misalignment. The Counterfactual Latent MAS (CLMAS) extension, which adds an auxiliary L2 plus cosine loss against prerecorded latent vectors, recovers causal alignment even when one model is causally inaccessible, implying the method may generalize to ANN–biological neural network comparisons where only recordings, not interventions, are available.

More papers — OpenAlex / S2

Affiliations (2)

Co-authors (5)

Recent mentions (2)