question

active

question:do-divergent-representations-change-what-an-intervention-can-say-about-an-nn-s-natural-mechanisms

Do divergent representations change what an intervention can say about an NN's natural mechanisms?

Core research question motivating the paper

Source paper

extracted_from

Addressing divergent representations from causal interventions on neural networks

(2025) · Satchel Grant · Simon Jerome Han · Alexa R. Tartaglini · Christopher Potts

Neighborhood — ranked by edge-count

Claims (1)

claim

Off-manifold divergences can activate hidden pathways that produce misleadingly confirmatory behavior while the true mechanism is never exercised
gates
Core claim about why pernicious divergence undermines mechanistic conclusions

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Divergent representations are a common, if not likely, outcome of causal interventions across a wide range of methodsclaim0.829
Core empirical claim of the paper supported by both theoretical proof and empirical demonstration
When it is not okay, how can we prevent divergent representations from occurring?question0.781
Third core research question motivating the CL loss approach in Section 5
Patching h[1] with a divergent representation can activate distinct, hidden pathways that result in misleadingly confirmatory behavior and/or undetected behavior.quote0.769
Load-bearing description of the core pernicious divergence mechanism illustrated in Figure 1
Neural representations carry rich geometric structure; but does that structure causally shape behavior?quote0.766
Opening sentence framing the paper's core inquiry.
Change-of-Basis for Neural Representationsconcept0.761
Key insight that rotating a neural representation to a non-standard basis can reveal distributed causal structure invisible in standard neuron-aligned basis.
Neural representation geometry causally shapes behavior; interventions respecting that geometry will yield natural trajectories.hypothesis0.760
Central hypothesis tested via manifold steering experiments across language models and video world models.
How do interventions on representations causally steer behavior?question0.755
Core question motivating the shift from linear to geometry-aware steering; answered via manifold alignment analysis.
Does the geometric structure of neural representations causally shape model behavior?question0.755
The motivating research question of the paper