paper:arxiv-2511-04638Addressing divergent representations from causal interventions on neural networks
TL;DR
Causal intervention methods central to mechanistic interpretability—including activation patching, mean-difference vector patching, Sparse Autoencoders, and Distributed Alignment Search (DAS)—systematically produce representations that diverge from the target model's natural distribution, and this divergence can corrupt mechanistic conclusions even when behavioral accuracy appears unaffected. For any manifold geometry other than axis-aligned hyperrectangles, coordinate patching is provably guaranteed to produce off-manifold representations given exhaustive sampling, and empirical measurements using Earth Mover's Distance (EMD) confirm divergence across all three tested methods on Meta-Llama-3-8B-Instruct. Two mechanistically distinct failure modes emerge: 'harmless' divergences confined to the behavioral null-space of downstream weight matrices, and 'pernicious' divergences that activate hidden computational pathways or trigger dormant behavioral changes—illustrated concretely with a ReLU circuit where mean-difference patching recruits a third hidden unit silent under all natural class inputs. To mitigate pernicious divergence, the paper applies and modifies the Counterfactual Latent (CL) loss from Grant (2025), showing it reduces EMD from 0.032 ± 0.003 to 0.007 ± 0.001 in synthetic DAS settings while maintaining IIA of 0.997–0.9988, and that training EMD anti-correlates with OOD IIA (coef. −0.34, R² = 0.73, F(1,28) = 75.28, p < 0.001) in a 7B LLM Boundless DAS setting. The paper argues this implies that any divergence outside the null-space of NN layers is potentially pernicious, posing fundamental challenges for aspirations of complete mechanistic understanding using current causal intervention methods alone.
What to take away
- 1. For any convex manifold geometry other than an axis-aligned hyperrectangle, coordinate patching is provably guaranteed to produce off-manifold (divergent) representations given exhaustive sampling, because only Cartesian products of intervals are closed under coordinate patching.
- 2. All three tested causal intervention methods—mean-difference vector patching (replicating Feng & Steinhardt 2024), Sparse Autoencoder reconstruction via SAELens (Bloom et al. 2024), and Boundless DAS (Wu et al. 2023)—produce EMD values on Meta-Llama-3-8B-Instruct that exceed the natural distribution's self-comparison baseline.
- 3. Prior activation patching experiments have multiplied feature values by up to 15x (Lindsey et al. 2025), making representational divergence not merely a theoretical concern but a practical one in published interpretability work.
- 4. 'Pernicious' divergence is operationally distinguishable from 'harmless' divergence by whether the off-manifold representation activates hidden pathways (units or circuits silent under all natural inputs) or triggers dormant behavioral changes across contexts not examined during the original experiment.
- 5. In a concrete two-layer ReLU circuit, mean-difference patching flips a class-A/B decision by activating a third hidden unit that is silent for all natural class-A representations, demonstrating that hypothesis-confirming behavior can arise entirely through off-manifold mechanisms.
- 6. The modified Counterfactual Latent (CL) loss—applied exclusively to causal subspaces discovered through DAS alignment training—reduces EMD along feature dimensions from 0.032 ± 0.003 (DAS behavioral loss only) to 0.007 ± 0.001 while maintaining IIA of 0.9988 ± 0.0005 on a 10-class synthetic dataset with 18-dimensional representations (2 causal + 16 noise dimensions).
- 7. Training EMD along causal axes anti-correlates with out-of-distribution interchange intervention accuracy (IIA) across 30 alignment trainings, with regression coefficient −0.3424, R² = 0.729, and F(1,28) = 75.28 (p < 0.001), establishing divergence as a predictor of generalization failure.
- 8. In the 7B Boundless DAS setting from Wu et al. (2023), applying the CL auxiliary loss with small epsilon values maintains IIA while reducing EMD, as visualized via PCA projections of natural and intervened representations at the intervention layer.
- 9. An open question the paper raises is whether a principled, ideally self-supervised method can classify divergence as harmless or pernicious for arbitrary mechanistic claims, given that the current CL loss minimizes all divergence indiscriminately rather than targeting only pernicious cases.
- 10. To replicate the OOD CL loss experiment, a researcher should train a 128-hidden-unit MLP with batch normalization and 0.5 dropout on 18-dimensional synthetic representations, partition 10 classes into dense and sparse clusters, train DAS alignment functions on each partition independently using the symmetric invertible alignment matrix from Grant et al. (2024), and evaluate cross-partition IIA versus training EMD.
Peer brief — for seminar discussion
This paper asks whether causal intervention methods used in mechanistic interpretability—activation patching, mean-difference vector patching, Sparse Autoencoders, and Distributed Alignment Search—produce representations that deviate from a target model's natural latent distribution, and whether such deviations undermine the mechanistic conclusions drawn from those interventions. To answer this, the paper combines formal proofs about manifold geometry with empirical measurements on Meta-Llama-3-8B-Instruct and synthetic MLP experiments, and introduces the Counterfactual Latent (CL) loss as a mitigation tool. An alternative approach the paper does not pursue would be direct manifold projection of intervened representations onto the natural convex hull, which existing counterfactual explanation literature (e.g., Verma et al. 2024; Tsiourvas et al. 2024) employs; the CL loss is preferred here because it generates principled, gradient-trained interventions rather than post-hoc projections. The load-bearing finding is a two-part result. First, for any convex manifold geometry that is not an axis-aligned hyperrectangle, coordinate patching provably generates off-manifold points given sufficient intervention samples—a result that applies to virtually all realistic neural activation distributions. Empirically, this is confirmed across all three methods on Meta-Llama-3-8B-Instruct (layers 10 and 25 are examined), with Earth Mover's Distance values uniformly exceeding the natural-to-natural baseline. Second, the paper distinguishes 'harmless' divergences—those falling in the behavioral null-space of downstream weight matrices—from 'pernicious' ones that activate hidden computational pathways or produce dormant behavioral changes. A concrete ReLU circuit example shows mean-difference patching recruiting a hidden unit silent for all natural class-A inputs, yielding hypothesis-confirming behavior through a non-native mechanism. As a mitigation, the modified CL loss (targeting causal subspaces specifically) reduces EMD from 0.032 ± 0.003 to 0.007 ± 0.001 in 18-dimensional synthetic representations while maintaining IIA at 0.9988 ± 0.0005. Critically, training EMD predicts out-of-distribution IIA failure: regression across 30 alignment trainings yields coefficient −0.3424 with R² = 0.73 and F(1,28) = 75.28 (p < 0.001). In the 7B Boundless DAS setting from Wu et al. (2023), the CL loss reduces representational divergence without sacrificing IIA. The paper's prediction is that any divergence outside the null-space of NN layers is potentially pernicious, implying current causal intervention methods cannot alone support complete mechanistic understanding. A critical reader would push back on the scope of the perniciousness argument: the paper does not demonstrate that pernicious divergence has actually corrupted published mechanistic claims in the literature—it only constructs synthetic existence proofs (the ReLU circuit examples) and shows that divergence predicts OOD generalization failure in controlled settings. The jump from 'divergence exists and can in principle activate hidden pathways' to 'published interpretability findings are unreliable' is not established empirically on real LLM mechanistic claims. The regression's high R² of 0.73 is derived entirely from the synthetic 10-class task, and whether this relationship holds at the scale and complexity of transformer circuits in practice remains an open question the paper itself acknowledges.
Methods (5)
- Algorithm 1: Harmlessness ClassificationProposed algorithm using local PCA to classify a divergence vector as harmless or harmful via behavioral null-space testing
- Causal ScrubbingMethod by Chan et al. 2022 for rigorously testing interpretability hypotheses via interventions
- KDE Density ScoreNonparametric density estimate scoring how typical an intervened representation is relative to the natural distribution
- Local Linear Reconstruction ErrorMeasures how well an intervened point can be expressed as a convex combination of nearby natural manifold points
- Local PCA DistanceMeasures off-manifold distance by computing the orthogonal residual from the local tangent subspace
Frameworks (1)
- Modified CL LossNovel variant of CL loss introduced in this paper targeting only causal subspace dimensions to improve OOD performance
Datasets (1)
- Synthetic DAS Dataset (10-class grid)Synthetic dataset of 10-class representations with two causal feature dimensions and 16 noise dimensions used in Section 5.2
Findings (15)
- Mean-difference patching in a two-layer ReLU circuit flips the decision to class-A by activating a third hidden unit that is silent for all natural class-A inputs
Synthetic theoretical example showing pernicious divergence via hidden pathway activation
- An intervention benign at context v4<0.75 produces a class-C behavioral flip at 0.75<v4<1, demonstrating dormant behavioral changes from latent divergence
Synthetic example showing an intervention that appears safe in tested contexts but causes behavior changes in others
- Coordinate patching on circular manifolds guarantees off-manifold representations for boundary point pairs with orthogonal deviations
Theoretical proof that patching produces divergent representations for most manifold geometries
- Intervention on a balanced subspace dimension while holding others fixed crosses the decision boundary using a non-native mechanism
Additional synthetic example of pernicious divergence from balanced subspaces
- For small CL loss weights epsilon, IIA is maintained (potentially improved) while EMD decreases in Boundless DAS on a 7B LLM
Empirical result showing the CL loss can reduce divergence without sacrificing interpretability accuracy
- Mean difference patching on Llama-3-8B layer 10 produces intervened EMD exceeding the natural-natural baseline
Empirical demonstration that MDVP produces divergent representations in a real LLM
- Linear regression of OOD IIA on training EMD yields coefficient -0.3424, R^2=0.729, F(1,28)=75.28, p<.001
Statistical evidence that training divergence (EMD) predicts lower OOD intervention performance
- SAE reconstructions on Llama-3-8B layer 25 produce intervened EMD exceeding the natural-natural baseline
Empirical demonstration that SAE projections produce divergent representations in a real LLM
- DAS behavioral loss produces EMD along feature dimensions of 0.032±0.003 on synthetic 10-class dataset
Quantitative baseline for divergence using behavioral DAS loss on synthetic dataset
- Modified CL loss outperforms behavioral DAS loss in OOD transfer from dense to sparse class partition
Key practical utility result: CL loss improves generalization of alignment to out-of-distribution settings
Claims (10)
- The CL auxiliary loss can directly reduce representational divergence in practical interpretability settings without sacrificing interpretability method accuracy
Central practical contribution: the CL loss offers a viable mitigation strategy
- Divergence within the behavioral null-space is harmless to functional claims about a function's computation when the claim ignores internal sub-computations
Key theoretical claim distinguishing harmless from pernicious divergence
- Any divergence outside of the null-space of NN layers is potentially pernicious, posing challenges for a complete mechanistic understanding of NNs
Sobering conclusion about the fundamental challenge posed by divergence for mechanistic interpretability
- Off-manifold divergences can activate hidden pathways that produce misleadingly confirmatory behavior while the true mechanism is never exercised
Core claim about why pernicious divergence undermines mechanistic conclusions
- The harm of divergence is inherently claim-dependent: the same divergence can be harmless for one mechanistic claim and pernicious for another
Important nuance that prevents a universal classification of divergence as always good or bad
- Many practical mechanistic projects can be satisfied by collecting sufficiently large intervention evaluation datasets
Optimistic practical note about mitigating divergence concerns without solving the theoretical problem
- Detecting dormant behavioral changes requires evaluating across all possible contexts, which is infeasible in practice
Practical limitation of current evaluation methods for pernicious divergence
- Divergent representations are a common, if not likely, outcome of causal interventions across a wide range of methods
Core empirical claim of the paper supported by both theoretical proof and empirical demonstration
- Minimizing divergence magnitude does not guarantee elimination of hidden pathways; it only reduces the risk surface
Important caveat to the CL loss solution, noting it is a step not a complete fix
- Representational divergence (as measured by EMD) can predict lower out-of-distribution intervention performance
Practical utility of reducing divergence demonstrated through regression analysis
Hypotheses (1)
- We hypothesized that divergence could influence IIA when transferring the DAS alignment to OOD settings
Motivating hypothesis for the OOD experiment testing practical utility of divergence reduction
Questions (4)
- How can we produce a principled method for classifying harmful divergence for any mechanistic claim?
Identified gap: current work lacks a general method for harmful divergence classification
- Do divergent representations change what an intervention can say about an NN's natural mechanisms?
Core research question motivating the paper
- When it is not okay, how can we prevent divergent representations from occurring?
Third core research question motivating the CL loss approach in Section 5
- When, and to what extent, is it okay for divergences to occur?
Second core research question motivating the theoretical analysis in Section 4
Original abstract (expand)
A common approach to mechanistic interpretability is to causally manipulate model representations via targeted interventions in order to understand what those representations encode. Here we ask whether such interventions create out-of-distribution (divergent) representations, and whether this raises concerns about how faithful their resulting explanations are to the target model in its natural state. First, we demonstrate theoretically and empirically that common causal intervention techniques often do shift internal representations away from the natural distribution of the target model. Then, we provide a theoretical analysis of two cases of such divergences: "harmless" divergences that occur in the behavioral null-space of the layer(s) of interest, and "pernicious" divergences that activate hidden network pathways and cause dormant behavioral changes. Finally, in an effort to mitigate the pernicious cases, we apply and modify the Counterfactual Latent (CL) loss from Grant (2025) allowing representations from causal interventions to remain closer to the natural distribution, reducing the likelihood of harmful divergences while preserving the interpretive power of the interventions. Together, these results highlight a path towards more reliable interpretability methods.
Related work— refs + corpus + external arXiv
Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.
- The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?citedin corpus2025≈ 87%
- Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representationscitedin corpus2023≈ 87%
- Model Alignment Searchcitedin corpus2025≈ 85%
- Combining Causal Models for More Accurate Abstractions of Neural NetworksSara Magliacane, Atticus Geiger Theodora-Mara P\^islar2025≈ 85%
- Mechanistic Interpretability for Large Language Model Alignment: Progress, Challenges, and Future DirectionsUsman Naseem2026≈ 85%
- ≈ 85%
- A Multi-Level Causal Intervention Framework for Mechanistic Interpretability in Variational AutoencodersRajiv Misra, Sanjay Kumar Singh, Anisha Roy Dip Roy2026≈ 85%
- Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behaviorin corpus2026≈ 84%
- Disentangling Polysemantic Neurons with a Null-Calibrated Polysemanticity Index and Causal Patch InterventionsDhruv Kumar Manan Gupta2025≈ 84%
- Using Degeneracy in the Loss Landscape for Mechanistic InterpretabilityJake Mendel, Stefan Heimersheim, Dan Braun, Nicholas Goldowsky-Dill, Kaarel H\"anni, Cindy Wu, Marius Hobbhahn Lucius Bushnaq2024≈ 84%
- PLOT: Progressive Localization via Optimal Transport in Neural Causal AbstractionArya Datla, Ziv Goldfeld Jonathn Chang2026≈ 84%
- Causal Probing for Internal Visual Representations in Multimodal Large Language ModelsTianjie Ju, Zheng Wu, Liangbo He, Jun Lan, Huijia Zhu, Weiqiang Wang, Zhuosheng Zhang Zehao Deng2026≈ 84%
- Patches of Nonlinearity: Instruction Vectors in Large Language ModelsJonas Rohweder, Subhabrata Dutta, Iryna Gurevych Irina Bigoulaeva2026≈ 84%
- Navigating by Old Maps: The Pitfalls of Static Mechanistic Localization in LLM Post-TrainingJiaying Zhu, Hongyang Chen, Hongxu Liu, Xinyu Yang, Wenya Wang Hang Chen2026≈ 84%
- Causal Tracing of Object Representations in Large Vision Language Models: Mechanistic Interpretability and Hallucination MitigationZekai Ye, Xiaocheng Feng, Weihong Zhong, Weitao Ma, Xiachong Feng Qiming Li2025≈ 84%
- A Unified Theory of Sparse Dictionary Learning in Mechanistic Interpretability: Piecewise Biconvexity and Spurious MinimaHarshvardhan Saini, Zhaoqian Yao, Zheng Lin, Yizhen Liao, Jingyi Cui, Yisen Wang, Mengnan Du, Dianbo Liu Yiming Tang2026≈ 84%
- ≈ 84%
- Layer-Specific Fine-Tuning for Improved Negation Handling in Medical Vision-Language ModelsMehdi Taghipour, Rahmatollah Beheshti Ali Abbasi2026≈ 84%
- Fundamental Limits of Neural Network Sparsification: Evidence from Catastrophic Interpretability CollapseRajiv Misra, Sanjay Kumar Singh Dip Roy2026≈ 84%
- Mechanistic Interpretability as Statistical Estimation: A Variance AnalysisFran\c{c}ois Portet, Maxime Peyrard Maxime M\'eloux2026≈ 84%
- Constructing Interpretable Features from Compositional Neuron GroupsAtticus Geiger, Mor Geva Or Shafran2026≈ 84%
- ≈ 84%
- ≈ 83%
- Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencodersin corpus2026≈ 83%
- The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasetsin corpus2023≈ 83%
- The Platonic Representation Hypothesisin corpus2024≈ 82%
- Neural natural language inference models partially embed theories of lexical entailment and negationcited2020≈ 78%
- ≈ 71%
- ≈ 70%
- ≈ 69%
+17 more
Similar preprints — Semantic Scholar
Cited by (1)
- Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
Manifold steering — intervening on model activations along paths constrained to lie on a learned activation manifold M_h rather than along Euclidean linear directions — produces behavioral trajectorie