paper
active
2025
paper:arxiv-2511-04638

Addressing divergent representations from causal interventions on neural networks

TL;DR

Causal intervention methods central to mechanistic interpretability—including activation patching, mean-difference vector patching, Sparse Autoencoders, and Distributed Alignment Search (DAS)—systematically produce representations that diverge from the target model's natural distribution, and this divergence can corrupt mechanistic conclusions even when behavioral accuracy appears unaffected. For any manifold geometry other than axis-aligned hyperrectangles, coordinate patching is provably guaranteed to produce off-manifold representations given exhaustive sampling, and empirical measurements using Earth Mover's Distance (EMD) confirm divergence across all three tested methods on Meta-Llama-3-8B-Instruct. Two mechanistically distinct failure modes emerge: 'harmless' divergences confined to the behavioral null-space of downstream weight matrices, and 'pernicious' divergences that activate hidden computational pathways or trigger dormant behavioral changes—illustrated concretely with a ReLU circuit where mean-difference patching recruits a third hidden unit silent under all natural class inputs. To mitigate pernicious divergence, the paper applies and modifies the Counterfactual Latent (CL) loss from Grant (2025), showing it reduces EMD from 0.032 ± 0.003 to 0.007 ± 0.001 in synthetic DAS settings while maintaining IIA of 0.997–0.9988, and that training EMD anti-correlates with OOD IIA (coef. −0.34, R² = 0.73, F(1,28) = 75.28, p < 0.001) in a 7B LLM Boundless DAS setting. The paper argues this implies that any divergence outside the null-space of NN layers is potentially pernicious, posing fundamental challenges for aspirations of complete mechanistic understanding using current causal intervention methods alone.

What to take away

  1. 1. For any convex manifold geometry other than an axis-aligned hyperrectangle, coordinate patching is provably guaranteed to produce off-manifold (divergent) representations given exhaustive sampling, because only Cartesian products of intervals are closed under coordinate patching.
  2. 2. All three tested causal intervention methods—mean-difference vector patching (replicating Feng & Steinhardt 2024), Sparse Autoencoder reconstruction via SAELens (Bloom et al. 2024), and Boundless DAS (Wu et al. 2023)—produce EMD values on Meta-Llama-3-8B-Instruct that exceed the natural distribution's self-comparison baseline.
  3. 3. Prior activation patching experiments have multiplied feature values by up to 15x (Lindsey et al. 2025), making representational divergence not merely a theoretical concern but a practical one in published interpretability work.
  4. 4. 'Pernicious' divergence is operationally distinguishable from 'harmless' divergence by whether the off-manifold representation activates hidden pathways (units or circuits silent under all natural inputs) or triggers dormant behavioral changes across contexts not examined during the original experiment.
  5. 5. In a concrete two-layer ReLU circuit, mean-difference patching flips a class-A/B decision by activating a third hidden unit that is silent for all natural class-A representations, demonstrating that hypothesis-confirming behavior can arise entirely through off-manifold mechanisms.
  6. 6. The modified Counterfactual Latent (CL) loss—applied exclusively to causal subspaces discovered through DAS alignment training—reduces EMD along feature dimensions from 0.032 ± 0.003 (DAS behavioral loss only) to 0.007 ± 0.001 while maintaining IIA of 0.9988 ± 0.0005 on a 10-class synthetic dataset with 18-dimensional representations (2 causal + 16 noise dimensions).
  7. 7. Training EMD along causal axes anti-correlates with out-of-distribution interchange intervention accuracy (IIA) across 30 alignment trainings, with regression coefficient −0.3424, R² = 0.729, and F(1,28) = 75.28 (p < 0.001), establishing divergence as a predictor of generalization failure.
  8. 8. In the 7B Boundless DAS setting from Wu et al. (2023), applying the CL auxiliary loss with small epsilon values maintains IIA while reducing EMD, as visualized via PCA projections of natural and intervened representations at the intervention layer.
  9. 9. An open question the paper raises is whether a principled, ideally self-supervised method can classify divergence as harmless or pernicious for arbitrary mechanistic claims, given that the current CL loss minimizes all divergence indiscriminately rather than targeting only pernicious cases.
  10. 10. To replicate the OOD CL loss experiment, a researcher should train a 128-hidden-unit MLP with batch normalization and 0.5 dropout on 18-dimensional synthetic representations, partition 10 classes into dense and sparse clusters, train DAS alignment functions on each partition independently using the symmetric invertible alignment matrix from Grant et al. (2024), and evaluate cross-partition IIA versus training EMD.

Peer brief — for seminar discussion

This paper asks whether causal intervention methods used in mechanistic interpretability—activation patching, mean-difference vector patching, Sparse Autoencoders, and Distributed Alignment Search—produce representations that deviate from a target model's natural latent distribution, and whether such deviations undermine the mechanistic conclusions drawn from those interventions. To answer this, the paper combines formal proofs about manifold geometry with empirical measurements on Meta-Llama-3-8B-Instruct and synthetic MLP experiments, and introduces the Counterfactual Latent (CL) loss as a mitigation tool. An alternative approach the paper does not pursue would be direct manifold projection of intervened representations onto the natural convex hull, which existing counterfactual explanation literature (e.g., Verma et al. 2024; Tsiourvas et al. 2024) employs; the CL loss is preferred here because it generates principled, gradient-trained interventions rather than post-hoc projections. The load-bearing finding is a two-part result. First, for any convex manifold geometry that is not an axis-aligned hyperrectangle, coordinate patching provably generates off-manifold points given sufficient intervention samples—a result that applies to virtually all realistic neural activation distributions. Empirically, this is confirmed across all three methods on Meta-Llama-3-8B-Instruct (layers 10 and 25 are examined), with Earth Mover's Distance values uniformly exceeding the natural-to-natural baseline. Second, the paper distinguishes 'harmless' divergences—those falling in the behavioral null-space of downstream weight matrices—from 'pernicious' ones that activate hidden computational pathways or produce dormant behavioral changes. A concrete ReLU circuit example shows mean-difference patching recruiting a hidden unit silent for all natural class-A inputs, yielding hypothesis-confirming behavior through a non-native mechanism. As a mitigation, the modified CL loss (targeting causal subspaces specifically) reduces EMD from 0.032 ± 0.003 to 0.007 ± 0.001 in 18-dimensional synthetic representations while maintaining IIA at 0.9988 ± 0.0005. Critically, training EMD predicts out-of-distribution IIA failure: regression across 30 alignment trainings yields coefficient −0.3424 with R² = 0.73 and F(1,28) = 75.28 (p < 0.001). In the 7B Boundless DAS setting from Wu et al. (2023), the CL loss reduces representational divergence without sacrificing IIA. The paper's prediction is that any divergence outside the null-space of NN layers is potentially pernicious, implying current causal intervention methods cannot alone support complete mechanistic understanding. A critical reader would push back on the scope of the perniciousness argument: the paper does not demonstrate that pernicious divergence has actually corrupted published mechanistic claims in the literature—it only constructs synthetic existence proofs (the ReLU circuit examples) and shows that divergence predicts OOD generalization failure in controlled settings. The jump from 'divergence exists and can in principle activate hidden pathways' to 'published interpretability findings are unreliable' is not established empirically on real LLM mechanistic claims. The regression's high R² of 0.73 is derived entirely from the synthetic 10-class task, and whether this relationship holds at the scale and complexity of transformer circuits in practice remains an open question the paper itself acknowledges.

Methods (5)

  • Algorithm 1: Harmlessness Classification
    Proposed algorithm using local PCA to classify a divergence vector as harmless or harmful via behavioral null-space testing
  • Causal Scrubbing
    Method by Chan et al. 2022 for rigorously testing interpretability hypotheses via interventions
  • KDE Density Score
    Nonparametric density estimate scoring how typical an intervened representation is relative to the natural distribution
  • Local Linear Reconstruction Error
    Measures how well an intervened point can be expressed as a convex combination of nearby natural manifold points
  • Local PCA Distance
    Measures off-manifold distance by computing the orthogonal residual from the local tangent subspace

Frameworks (1)

  • Modified CL Loss
    Novel variant of CL loss introduced in this paper targeting only causal subspace dimensions to improve OOD performance

Datasets (1)

Findings (15)

Claims (10)

Hypotheses (1)

Questions (4)

Original abstract (expand)

A common approach to mechanistic interpretability is to causally manipulate model representations via targeted interventions in order to understand what those representations encode. Here we ask whether such interventions create out-of-distribution (divergent) representations, and whether this raises concerns about how faithful their resulting explanations are to the target model in its natural state. First, we demonstrate theoretically and empirically that common causal intervention techniques often do shift internal representations away from the natural distribution of the target model. Then, we provide a theoretical analysis of two cases of such divergences: "harmless" divergences that occur in the behavioral null-space of the layer(s) of interest, and "pernicious" divergences that activate hidden network pathways and cause dormant behavioral changes. Finally, in an effort to mitigate the pernicious cases, we apply and modify the Counterfactual Latent (CL) loss from Grant (2025) allowing representations from causal interventions to remain closer to the natural distribution, reducing the likelihood of harmful divergences while preserving the interpretive power of the interventions. Together, these results highlight a path towards more reliable interpretability methods.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

+17 more

Similar preprints — Semantic Scholar

Cited by (1)