claim

active

claim:divergence-within-the-behavioral-null-space-is-harmless-to-functional-claims-about-a-function-s-computation-when-the-claim-ignores-internal-sub-computations

Divergence within the behavioral null-space is harmless to functional claims about a function's computation when the claim ignores internal sub-computations

Key theoretical claim distinguishing harmless from pernicious divergence

Source paper

extracted_from

Addressing divergent representations from causal interventions on neural networks

(2025) · Satchel Grant · Simon Jerome Han · Alexa R. Tartaglini · Christopher Potts

Neighborhood — ranked by edge-count

Claims (2)

claim

Any divergence outside of the null-space of NN layers is potentially pernicious, posing challenges for a complete mechanistic understanding of NNs
extends
Sobering conclusion about the fundamental challenge posed by divergence for mechanistic interpretability
The harm of divergence is inherently claim-dependent: the same divergence can be harmless for one mechanistic claim and pernicious for another
extends
Important nuance that prevents a universal classification of divergence as always good or bad

Questions (1)

question

When, and to what extent, is it okay for divergences to occur?
gates
Second core research question motivating the theoretical analysis in Section 4

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

No principled method exists for classifying harmful divergence for arbitrary mechanistic claimsconcept0.752
Explicitly identified limitation: the paper cannot classify perniciousness in general
Minimizing divergence magnitude does not guarantee elimination of hidden pathways; it only reduces the risk surfaceclaim0.751
Important caveat to the CL loss solution, noting it is a step not a complete fix
Off-manifold divergences can activate hidden pathways that produce misleadingly confirmatory behavior while the true mechanism is never exercisedclaim0.748
Core claim about why pernicious divergence undermines mechanistic conclusions
How can we produce a principled method for classifying harmful divergence for any mechanistic claim?question0.746
Identified gap: current work lacks a general method for harmful divergence classification
An intervention benign at context v4<0.75 produces a class-C behavioral flip at 0.75<v4<1, demonstrating dormant behavioral changes from latent divergencefinding0.745
Synthetic example showing an intervention that appears safe in tested contexts but causes behavior changes in others
Representational Divergenceconcept0.737
Core phenomenon studied: when causal interventions shift internal representations away from the natural distribution
Harmless Divergenceconcept0.736
Divergences that occur in the behavioral null-space and do not affect functional claims about the model
LLMs can compute meaningful functions over perturbations to their internal states, establishing introspection as a real but layer-dependent phenomenonclaim0.735
Primary positive claim of the paper, grounded in strength comparison and localization results