claim
active
claim:divergence-within-the-behavioral-null-space-is-harmless-to-functional-claims-about-a-function-s-computation-when-the-claim-ignores-internal-sub-computationsDivergence within the behavioral null-space is harmless to functional claims about a function's computation when the claim ignores internal sub-computations
Key theoretical claim distinguishing harmless from pernicious divergence
Source paper
extracted_from(2025) · Satchel Grant · Simon Jerome Han · Alexa R. Tartaglini · Christopher Potts
Neighborhood — ranked by edge-count
Claims (2)
claim
- Sobering conclusion about the fundamental challenge posed by divergence for mechanistic interpretability
- Important nuance that prevents a universal classification of divergence as always good or bad
Questions (1)
question
- Second core research question motivating the theoretical analysis in Section 4
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- No principled method exists for classifying harmful divergence for arbitrary mechanistic claimsconcept0.752Explicitly identified limitation: the paper cannot classify perniciousness in general
- Important caveat to the CL loss solution, noting it is a step not a complete fix
- Core claim about why pernicious divergence undermines mechanistic conclusions
- How can we produce a principled method for classifying harmful divergence for any mechanistic claim?question0.746Identified gap: current work lacks a general method for harmful divergence classification
- Synthetic example showing an intervention that appears safe in tested contexts but causes behavior changes in others
- Core phenomenon studied: when causal interventions shift internal representations away from the natural distribution
- Divergences that occur in the behavioral null-space and do not affect functional claims about the model
- Primary positive claim of the paper, grounded in strength comparison and localization results