thinker:simon-jerome-hanSimon Jerome Han
Authored papers (1)
Causal intervention methods central to mechanistic interpretability—including activation patching, mean-difference vector patching, Sparse Autoencoders, and Distributed Alignment Search (DAS)—systematically produce representations that diverge from the target model's natural distribution, and this divergence can corrupt mechanistic conclusions even when behavioral accuracy appears unaffected. For any manifold geometry other than axis-aligned hyperrectangles, coordinate patching is provably guaranteed to produce off-manifold representations given exhaustive sampling, and empirical measurements using Earth Mover's Distance (EMD) confirm divergence across all three tested methods on Meta-Llama-3-8B-Instruct. Two mechanistically distinct failure modes emerge: 'harmless' divergences confined to the behavioral null-space of downstream weight matrices, and 'pernicious' divergences that activate hidden computational pathways or trigger dormant behavioral changes—illustrated concretely with a ReLU circuit where mean-difference patching recruits a third hidden unit silent under all natural class inputs. To mitigate pernicious divergence, the paper applies and modifies the Counterfactual Latent (CL) loss from Grant (2025), showing it reduces EMD from 0.032 ± 0.003 to 0.007 ± 0.001 in synthetic DAS settings while maintaining IIA of 0.997–0.9988, and that training EMD anti-correlates with OOD IIA (coef. −0.34, R² = 0.73, F(1,28) = 75.28, p < 0.001) in a 7B LLM Boundless DAS setting. The paper argues this implies that any divergence outside the null-space of NN layers is potentially pernicious, posing fundamental challenges for aspirations of complete mechanistic understanding using current causal intervention methods alone.
More papers — OpenAlex / S2
Affiliations (1)
- Stanford University(institute)
Co-authors (3)
- Alexa R. Tartaglini9 shared
- Christopher Potts9 shared
- Satchel Grant9 shared
Their work is cited by (1)
Recent mentions (1)
- papers-typedgrant-2025-addressing-divergent.md