claim

active

claim:off-manifold-divergences-can-activate-hidden-pathways-that-produce-misleadingly-confirmatory-behavior-while-the-true-mechanism-is-never-exercised

Off-manifold divergences can activate hidden pathways that produce misleadingly confirmatory behavior while the true mechanism is never exercised

Core claim about why pernicious divergence undermines mechanistic conclusions

Source paper

extracted_from

Addressing divergent representations from causal interventions on neural networks

(2025) · Satchel Grant · Simon Jerome Han · Alexa R. Tartaglini · Christopher Potts

Neighborhood — ranked by edge-count

Findings (2)

finding

Intervention on a balanced subspace dimension while holding others fixed crosses the decision boundary using a non-native mechanism
supports
Additional synthetic example of pernicious divergence from balanced subspaces
Mean-difference patching in a two-layer ReLU circuit flips the decision to class-A by activating a third hidden unit that is silent for all natural class-A inputs
supports
Synthetic theoretical example showing pernicious divergence via hidden pathway activation

Questions (1)

question

Do divergent representations change what an intervention can say about an NN's natural mechanisms?
gates
Core research question motivating the paper

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Patching h[1] with a divergent representation can activate distinct, hidden pathways that result in misleadingly confirmatory behavior and/or undetected behavior.quote0.842
Load-bearing description of the core pernicious divergence mechanism illustrated in Figure 1
Minimizing divergence magnitude does not guarantee elimination of hidden pathways; it only reduces the risk surfaceclaim0.824
Important caveat to the CL loss solution, noting it is a step not a complete fix
manifold steering produces clean probability shifts along natural behavior structure; linear steering cuts across manifold and produces off-target noisy effectsfinding0.787
Empirical demonstration on Llama-3.1-8B that steering along representation manifold aligns outputs with behavior manifold, whereas linear steering does not.
Linear steering produces noisy off-target effects; manifold steering cleanly shifts probability mass between sequential concepts.finding0.769
Core empirical claim comparing steering approaches on cyclic concepts.
Interventions along activation manifold M_h yield behavioral trajectories following behavior manifold M_y, and vice versa — bidirectional relationship demonstrated across language models and video world models.finding0.752
Central empirical result showing causal coupling between representation and behavior geometry across multiple substrates and modalities.
Divergence within the behavioral null-space is harmless to functional claims about a function's computation when the claim ignores internal sub-computationsclaim0.748
Key theoretical claim distinguishing harmless from pernicious divergence
Independently trained model families converge on a common semantic manifold under self-referential processing, suggesting an attractor dynamic that transcends training variancehypothesis0.747
Hypothesis tested in Experiment 3; independently trained GPT, Claude, Gemini architectures converge on similar descriptive vocabulary
Larger hidden representations create more random structure that DAS can search through, allowing manipulation of counterfactual behavior even in randomly initialized networkshypothesis0.747
Tested in Section 4.4 calibration experiment; confirmed by findings.