Addressing divergent representations from causal interventions on neural networks

BySatchel Grant·Simon Jerome Han ⓘ·Alexa R. Tartaglini ⓘ·Christopher Potts ⓘPDP Lab, Stanford University

DOI 10.48550/arxiv.2511.04638 arXiv 2511.04638 OpenAlex W4416027175

Behavioral Null Space Modified CL Loss Algorithm 1: Harmlessness Classification Synthetic DAS Dataset (10-class grid)Behaviorally Binary Subspace Causal Scrubbing Dormant Behavioral Changes KDE Density Score Harmless Divergence Local Linear Reconstruction Error Hidden Pathways Local PCA Distance Mechanistic Interpretability Pernicious Divergence+1 more

TL;DR

Causal intervention methods central to mechanistic interpretability—including activation patching, mean-difference vector patching, Sparse Autoencoders, and Distributed Alignment Search (DAS)—systematically produce representations that diverge from the target model's natural distribution, and this divergence can corrupt mechanistic conclusions even when behavioral accuracy appears unaffected. For any manifold geometry other than axis-aligned hyperrectangles, coordinate patching is provably guaranteed to produce off-manifold representations given exhaustive sampling, and empirical measurements using Earth Mover's Distance (EMD) confirm divergence across all three tested methods on Meta-Llama-3-8B-Instruct. Two mechanistically distinct failure modes emerge: 'harmless' divergences confined to the behavioral null-space of downstream weight matrices, and 'pernicious' divergences that activate hidden computational pathways or trigger dormant behavioral changes—illustrated concretely with a ReLU circuit where mean-difference patching recruits a third hidden unit silent under all natural class inputs. To mitigate pernicious divergence, the paper applies and modifies the Counterfactual Latent (CL) loss from Grant (2025), showing it reduces EMD from 0.032 ± 0.003 to 0.007 ± 0.001 in synthetic DAS settings while maintaining IIA of 0.997–0.9988, and that training EMD anti-correlates with OOD IIA (coef. −0.34, R² = 0.73, F(1,28) = 75.28, p < 0.001) in a 7B LLM Boundless DAS setting. The paper argues this implies that any divergence outside the null-space of NN layers is potentially pernicious, posing fundamental challenges for aspirations of complete mechanistic understanding using current causal intervention methods alone.

What to take away

1. For any convex manifold geometry other than an axis-aligned hyperrectangle, coordinate patching is provably guaranteed to produce off-manifold (divergent) representations given exhaustive sampling, because only Cartesian products of intervals are closed under coordinate patching.
2. All three tested causal intervention methods—mean-difference vector patching (replicating Feng & Steinhardt 2024), Sparse Autoencoder reconstruction via SAELens (Bloom et al. 2024), and Boundless DAS (Wu et al. 2023)—produce EMD values on Meta-Llama-3-8B-Instruct that exceed the natural distribution's self-comparison baseline.
3. Prior activation patching experiments have multiplied feature values by up to 15x (Lindsey et al. 2025), making representational divergence not merely a theoretical concern but a practical one in published interpretability work.
4. 'Pernicious' divergence is operationally distinguishable from 'harmless' divergence by whether the off-manifold representation activates hidden pathways (units or circuits silent under all natural inputs) or triggers dormant behavioral changes across contexts not examined during the original experiment.
5. In a concrete two-layer ReLU circuit, mean-difference patching flips a class-A/B decision by activating a third hidden unit that is silent for all natural class-A representations, demonstrating that hypothesis-confirming behavior can arise entirely through off-manifold mechanisms.
6. The modified Counterfactual Latent (CL) loss—applied exclusively to causal subspaces discovered through DAS alignment training—reduces EMD along feature dimensions from 0.032 ± 0.003 (DAS behavioral loss only) to 0.007 ± 0.001 while maintaining IIA of 0.9988 ± 0.0005 on a 10-class synthetic dataset with 18-dimensional representations (2 causal + 16 noise dimensions).
7. Training EMD along causal axes anti-correlates with out-of-distribution interchange intervention accuracy (IIA) across 30 alignment trainings, with regression coefficient −0.3424, R² = 0.729, and F(1,28) = 75.28 (p < 0.001), establishing divergence as a predictor of generalization failure.
8. In the 7B Boundless DAS setting from Wu et al. (2023), applying the CL auxiliary loss with small epsilon values maintains IIA while reducing EMD, as visualized via PCA projections of natural and intervened representations at the intervention layer.
9. An open question the paper raises is whether a principled, ideally self-supervised method can classify divergence as harmless or pernicious for arbitrary mechanistic claims, given that the current CL loss minimizes all divergence indiscriminately rather than targeting only pernicious cases.
10. To replicate the OOD CL loss experiment, a researcher should train a 128-hidden-unit MLP with batch normalization and 0.5 dropout on 18-dimensional synthetic representations, partition 10 classes into dense and sparse clusters, train DAS alignment functions on each partition independently using the symmetric invertible alignment matrix from Grant et al. (2024), and evaluate cross-partition IIA versus training EMD.

Peer brief — for seminar discussion

This paper asks whether causal intervention methods used in mechanistic interpretability—activation patching, mean-difference vector patching, Sparse Autoencoders, and Distributed Alignment Search—produce representations that deviate from a target model's natural latent distribution, and whether such deviations undermine the mechanistic conclusions drawn from those interventions. To answer this, the paper combines formal proofs about manifold geometry with empirical measurements on Meta-Llama-3-8B-Instruct and synthetic MLP experiments, and introduces the Counterfactual Latent (CL) loss as a mitigation tool. An alternative approach the paper does not pursue would be direct manifold projection of intervened representations onto the natural convex hull, which existing counterfactual explanation literature (e.g., Verma et al. 2024; Tsiourvas et al. 2024) employs; the CL loss is preferred here because it generates principled, gradient-trained interventions rather than post-hoc projections. The load-bearing finding is a two-part result. First, for any convex manifold geometry that is not an axis-aligned hyperrectangle, coordinate patching provably generates off-manifold points given sufficient intervention samples—a result that applies to virtually all realistic neural activation distributions. Empirically, this is confirmed across all three methods on Meta-Llama-3-8B-Instruct (layers 10 and 25 are examined), with Earth Mover's Distance values uniformly exceeding the natural-to-natural baseline. Second, the paper distinguishes 'harmless' divergences—those falling in the behavioral null-space of downstream weight matrices—from 'pernicious' ones that activate hidden computational pathways or produce dormant behavioral changes. A concrete ReLU circuit example shows mean-difference patching recruiting a hidden unit silent for all natural class-A inputs, yielding hypothesis-confirming behavior through a non-native mechanism. As a mitigation, the modified CL loss (targeting causal subspaces specifically) reduces EMD from 0.032 ± 0.003 to 0.007 ± 0.001 in 18-dimensional synthetic representations while maintaining IIA at 0.9988 ± 0.0005. Critically, training EMD predicts out-of-distribution IIA failure: regression across 30 alignment trainings yields coefficient −0.3424 with R² = 0.73 and F(1,28) = 75.28 (p < 0.001). In the 7B Boundless DAS setting from Wu et al. (2023), the CL loss reduces representational divergence without sacrificing IIA. The paper's prediction is that any divergence outside the null-space of NN layers is potentially pernicious, implying current causal intervention methods cannot alone support complete mechanistic understanding. A critical reader would push back on the scope of the perniciousness argument: the paper does not demonstrate that pernicious divergence has actually corrupted published mechanistic claims in the literature—it only constructs synthetic existence proofs (the ReLU circuit examples) and shows that divergence predicts OOD generalization failure in controlled settings. The jump from 'divergence exists and can in principle activate hidden pathways' to 'published interpretability findings are unreliable' is not established empirically on real LLM mechanistic claims. The regression's high R² of 0.73 is derived entirely from the synthetic 10-class task, and whether this relationship holds at the scale and complexity of transformer circuits in practice remains an open question the paper itself acknowledges.

Methods (5)

Algorithm 1: Harmlessness Classification
Proposed algorithm using local PCA to classify a divergence vector as harmless or harmful via behavioral null-space testing
Causal Scrubbing
Method by Chan et al. 2022 for rigorously testing interpretability hypotheses via interventions
KDE Density Score
Nonparametric density estimate scoring how typical an intervened representation is relative to the natural distribution
Local Linear Reconstruction Error
Measures how well an intervened point can be expressed as a convex combination of nearby natural manifold points
Local PCA Distance
Measures off-manifold distance by computing the orthogonal residual from the local tangent subspace

Frameworks (1)

Modified CL Loss
Novel variant of CL loss introduced in this paper targeting only causal subspace dimensions to improve OOD performance

Datasets (1)

Synthetic DAS Dataset (10-class grid)
Synthetic dataset of 10-class representations with two causal feature dimensions and 16 noise dimensions used in Section 5.2

Findings (15)

Mean-difference patching in a two-layer ReLU circuit flips the decision to class-A by activating a third hidden unit that is silent for all natural class-A inputs
Synthetic theoretical example showing pernicious divergence via hidden pathway activation
An intervention benign at context v4<0.75 produces a class-C behavioral flip at 0.75<v4<1, demonstrating dormant behavioral changes from latent divergence
Synthetic example showing an intervention that appears safe in tested contexts but causes behavior changes in others
Coordinate patching on circular manifolds guarantees off-manifold representations for boundary point pairs with orthogonal deviations
Theoretical proof that patching produces divergent representations for most manifold geometries
Intervention on a balanced subspace dimension while holding others fixed crosses the decision boundary using a non-native mechanism
Additional synthetic example of pernicious divergence from balanced subspaces
For small CL loss weights epsilon, IIA is maintained (potentially improved) while EMD decreases in Boundless DAS on a 7B LLM
Empirical result showing the CL loss can reduce divergence without sacrificing interpretability accuracy
Mean difference patching on Llama-3-8B layer 10 produces intervened EMD exceeding the natural-natural baseline
Empirical demonstration that MDVP produces divergent representations in a real LLM
Linear regression of OOD IIA on training EMD yields coefficient -0.3424, R^2=0.729, F(1,28)=75.28, p<.001
Statistical evidence that training divergence (EMD) predicts lower OOD intervention performance
SAE reconstructions on Llama-3-8B layer 25 produce intervened EMD exceeding the natural-natural baseline
Empirical demonstration that SAE projections produce divergent representations in a real LLM
DAS behavioral loss produces EMD along feature dimensions of 0.032±0.003 on synthetic 10-class dataset
Quantitative baseline for divergence using behavioral DAS loss on synthetic dataset
Modified CL loss outperforms behavioral DAS loss in OOD transfer from dense to sparse class partition
Key practical utility result: CL loss improves generalization of alignment to out-of-distribution settings

Claims (10)

The CL auxiliary loss can directly reduce representational divergence in practical interpretability settings without sacrificing interpretability method accuracy
Central practical contribution: the CL loss offers a viable mitigation strategy
Divergence within the behavioral null-space is harmless to functional claims about a function's computation when the claim ignores internal sub-computations
Key theoretical claim distinguishing harmless from pernicious divergence
Any divergence outside of the null-space of NN layers is potentially pernicious, posing challenges for a complete mechanistic understanding of NNs
Sobering conclusion about the fundamental challenge posed by divergence for mechanistic interpretability
Off-manifold divergences can activate hidden pathways that produce misleadingly confirmatory behavior while the true mechanism is never exercised
Core claim about why pernicious divergence undermines mechanistic conclusions
The harm of divergence is inherently claim-dependent: the same divergence can be harmless for one mechanistic claim and pernicious for another
Important nuance that prevents a universal classification of divergence as always good or bad
Many practical mechanistic projects can be satisfied by collecting sufficiently large intervention evaluation datasets
Optimistic practical note about mitigating divergence concerns without solving the theoretical problem
Detecting dormant behavioral changes requires evaluating across all possible contexts, which is infeasible in practice
Practical limitation of current evaluation methods for pernicious divergence
Divergent representations are a common, if not likely, outcome of causal interventions across a wide range of methods
Core empirical claim of the paper supported by both theoretical proof and empirical demonstration
Minimizing divergence magnitude does not guarantee elimination of hidden pathways; it only reduces the risk surface
Important caveat to the CL loss solution, noting it is a step not a complete fix
Representational divergence (as measured by EMD) can predict lower out-of-distribution intervention performance
Practical utility of reducing divergence demonstrated through regression analysis

Hypotheses (1)

We hypothesized that divergence could influence IIA when transferring the DAS alignment to OOD settings
Motivating hypothesis for the OOD experiment testing practical utility of divergence reduction

Questions (4)

How can we produce a principled method for classifying harmful divergence for any mechanistic claim?
Identified gap: current work lacks a general method for harmful divergence classification
Do divergent representations change what an intervention can say about an NN's natural mechanisms?
Core research question motivating the paper
When it is not okay, how can we prevent divergent representations from occurring?
Third core research question motivating the CL loss approach in Section 5
When, and to what extent, is it okay for divergences to occur?
Second core research question motivating the theoretical analysis in Section 4

Original abstract (expand)

A common approach to mechanistic interpretability is to causally manipulate model representations via targeted interventions in order to understand what those representations encode. Here we ask whether such interventions create out-of-distribution (divergent) representations, and whether this raises concerns about how faithful their resulting explanations are to the target model in its natural state. First, we demonstrate theoretically and empirically that common causal intervention techniques often do shift internal representations away from the natural distribution of the target model. Then, we provide a theoretical analysis of two cases of such divergences: "harmless" divergences that occur in the behavioral null-space of the layer(s) of interest, and "pernicious" divergences that activate hidden network pathways and cause dormant behavioral changes. Finally, in an effort to mitigate the pernicious cases, we apply and modify the Counterfactual Latent (CL) loss from Grant (2025) allowing representations from causal interventions to remain closer to the natural distribution, reducing the likelihood of harmful divergences while preserving the interpretive power of the interventions. Together, these results highlight a path towards more reliable interpretability methods.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
cited
in corpus
2025
≈ 87%
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations
cited
in corpus
2023
≈ 87%
Model Alignment Search
cited
in corpus
2025
≈ 85%
Combining Causal Models for More Accurate Abstractions of Neural Networks
Sara Magliacane, Atticus Geiger Theodora-Mara P\^islar
2025
≈ 85%
Mechanistic Interpretability for Large Language Model Alignment: Progress, Challenges, and Future Directions
Usman Naseem
2026
≈ 85%
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
in corpus
2024
≈ 85%
A Multi-Level Causal Intervention Framework for Mechanistic Interpretability in Variational Autoencoders
Rajiv Misra, Sanjay Kumar Singh, Anisha Roy Dip Roy
2026
≈ 85%
Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
in corpus
2026
≈ 84%
Disentangling Polysemantic Neurons with a Null-Calibrated Polysemanticity Index and Causal Patch Interventions
Dhruv Kumar Manan Gupta
2025
≈ 84%
Using Degeneracy in the Loss Landscape for Mechanistic Interpretability
Jake Mendel, Stefan Heimersheim, Dan Braun, Nicholas Goldowsky-Dill, Kaarel H\"anni, Cindy Wu, Marius Hobbhahn Lucius Bushnaq
2024
≈ 84%
PLOT: Progressive Localization via Optimal Transport in Neural Causal Abstraction
Arya Datla, Ziv Goldfeld Jonathn Chang
2026
≈ 84%
Causal Probing for Internal Visual Representations in Multimodal Large Language Models
Tianjie Ju, Zheng Wu, Liangbo He, Jun Lan, Huijia Zhu, Weiqiang Wang, Zhuosheng Zhang Zehao Deng
2026
≈ 84%
Patches of Nonlinearity: Instruction Vectors in Large Language Models
Jonas Rohweder, Subhabrata Dutta, Iryna Gurevych Irina Bigoulaeva
2026
≈ 84%
Navigating by Old Maps: The Pitfalls of Static Mechanistic Localization in LLM Post-Training
Jiaying Zhu, Hongyang Chen, Hongxu Liu, Xinyu Yang, Wenya Wang Hang Chen
2026
≈ 84%
Causal Tracing of Object Representations in Large Vision Language Models: Mechanistic Interpretability and Hallucination Mitigation
Zekai Ye, Xiaocheng Feng, Weihong Zhong, Weitao Ma, Xiachong Feng Qiming Li
2025
≈ 84%
A Unified Theory of Sparse Dictionary Learning in Mechanistic Interpretability: Piecewise Biconvexity and Spurious Minima
Harshvardhan Saini, Zhaoqian Yao, Zheng Lin, Yizhen Liao, Jingyi Cui, Yisen Wang, Mengnan Du, Dianbo Liu Yiming Tang
2026
≈ 84%
Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders
in corpus
2026
≈ 84%
Layer-Specific Fine-Tuning for Improved Negation Handling in Medical Vision-Language Models
Mehdi Taghipour, Rahmatollah Beheshti Ali Abbasi
2026
≈ 84%
Fundamental Limits of Neural Network Sparsification: Evidence from Catastrophic Interpretability Collapse
Rajiv Misra, Sanjay Kumar Singh Dip Roy
2026
≈ 84%
Mechanistic Interpretability as Statistical Estimation: A Variance Analysis
Fran\c{c}ois Portet, Maxime Peyrard Maxime M\'eloux
2026
≈ 84%
Constructing Interpretable Features from Compositional Neuron Groups
Atticus Geiger, Mor Geva Or Shafran
2026
≈ 84%
Emergent symbol-like number variables in artificial neural networks
cited
2025
≈ 84%
Unveiling the Latent Directions of Reflection in Large Language Models
in corpus
2025
≈ 83%
Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencoders
in corpus
2026
≈ 83%
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
in corpus
2023
≈ 83%
The Platonic Representation Hypothesis
in corpus
2024
≈ 82%
Neural natural language inference models partially embed theories of lexical entailment and negation
cited
2020
≈ 78%
pyvene: A Library for Understanding and Improving PyTorch Models via Interventions
cited
2024
≈ 71%
Causal abstraction: A theoretical foundation for mechanistic interpretability
cited
2025
≈ 70%
How causal abstraction underpins computational explanation
cited
2025
≈ 69%

+17 more

Similar preprints — Semantic Scholar

Cited by (1)

Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
Manifold steering — intervening on model activations along paths constrained to lie on a learned activation manifold M_h rather than along Euclidean linear directions — produces behavioral trajectorie