claim
active
claim:off-manifold-divergences-can-activate-hidden-pathways-that-produce-misleadingly-confirmatory-behavior-while-the-true-mechanism-is-never-exercisedOff-manifold divergences can activate hidden pathways that produce misleadingly confirmatory behavior while the true mechanism is never exercised
Core claim about why pernicious divergence undermines mechanistic conclusions
Source paper
extracted_from(2025) · Satchel Grant · Simon Jerome Han · Alexa R. Tartaglini · Christopher Potts
Neighborhood — ranked by edge-count
Findings (2)
finding
- Additional synthetic example of pernicious divergence from balanced subspaces
- Synthetic theoretical example showing pernicious divergence via hidden pathway activation
Questions (1)
question
- Do divergent representations change what an intervention can say about an NN's natural mechanisms?gatesCore research question motivating the paper
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Load-bearing description of the core pernicious divergence mechanism illustrated in Figure 1
- Important caveat to the CL loss solution, noting it is a step not a complete fix
- Empirical demonstration on Llama-3.1-8B that steering along representation manifold aligns outputs with behavior manifold, whereas linear steering does not.
- Core empirical claim comparing steering approaches on cyclic concepts.
- Central empirical result showing causal coupling between representation and behavior geometry across multiple substrates and modalities.
- Key theoretical claim distinguishing harmless from pernicious divergence
- Hypothesis tested in Experiment 3; independently trained GPT, Claude, Gemini architectures converge on similar descriptive vocabulary
- Tested in Section 4.4 calibration experiment; confirmed by findings.