When, and to what extent, is it okay for divergences to occur?

Second core research question motivating the theoretical analysis in Section 4

Source paper

extracted_from

Addressing divergent representations from causal interventions on neural networks

(2025) · Satchel Grant · Simon Jerome Han · Alexa R. Tartaglini · Christopher Potts

Neighborhood — ranked by edge-count

Claims (1)

claim

Divergence within the behavioral null-space is harmless to functional claims about a function's computation when the claim ignores internal sub-computations
gates
Key theoretical claim distinguishing harmless from pernicious divergence

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

When it is not okay, how can we prevent divergent representations from occurring?question0.833
Third core research question motivating the CL loss approach in Section 5
The harm of divergence is inherently claim-dependent: the same divergence can be harmless for one mechanistic claim and pernicious for anotherclaim0.780
Important nuance that prevents a universal classification of divergence as always good or bad
How can we produce a principled method for classifying harmful divergence for any mechanistic claim?question0.753
Identified gap: current work lacks a general method for harmful divergence classification
Any divergence outside of the null-space of NN layers is potentially pernicious, posing challenges for a complete mechanistic understanding of NNsclaim0.751
Sobering conclusion about the fundamental challenge posed by divergence for mechanistic interpretability
Representational Divergenceconcept0.749
Core phenomenon studied: when causal interventions shift internal representations away from the natural distribution
Divergent representations are a common, if not likely, outcome of causal interventions across a wide range of methodsclaim0.748
Core empirical claim of the paper supported by both theoretical proof and empirical demonstration
Harmless Divergenceconcept0.747
Divergences that occur in the behavioral null-space and do not affect functional claims about the model
Minimizing divergence magnitude does not guarantee elimination of hidden pathways; it only reduces the risk surfaceclaim0.744
Important caveat to the CL loss solution, noting it is a step not a complete fix