claim

active

claim:divergent-representations-are-a-common-if-not-likely-outcome-of-causal-interventions-across-a-wide-range-of-methods

Divergent representations are a common, if not likely, outcome of causal interventions across a wide range of methods

Core empirical claim of the paper supported by both theoretical proof and empirical demonstration

Source paper

extracted_from

Addressing divergent representations from causal interventions on neural networks

(2025) · Satchel Grant · Simon Jerome Han · Alexa R. Tartaglini · Christopher Potts

Neighborhood — ranked by edge-count

Findings (4)

finding

Coordinate patching on circular manifolds guarantees off-manifold representations for boundary point pairs with orthogonal deviations
supports
Theoretical proof that patching produces divergent representations for most manifold geometries
Boundless DAS interchange interventions produce EMD exceeding natural-natural baseline
supports
Empirical demonstration that DAS interventions produce divergent representations
Mean difference patching on Llama-3-8B layer 10 produces intervened EMD exceeding the natural-natural baseline
supports
Empirical demonstration that MDVP produces divergent representations in a real LLM
SAE reconstructions on Llama-3-8B layer 25 produce intervened EMD exceeding the natural-natural baseline
supports
Empirical demonstration that SAE projections produce divergent representations in a real LLM

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Do divergent representations change what an intervention can say about an NN's natural mechanisms?question0.829
Core research question motivating the paper
When it is not okay, how can we prevent divergent representations from occurring?question0.821
Third core research question motivating the CL loss approach in Section 5
Causal Intervention on Representationsconcept0.790
The use of interventions (rather than correlations) to establish a causal link between representation geometry and behavioral geometry.
Patching h[1] with a divergent representation can activate distinct, hidden pathways that result in misleadingly confirmatory behavior and/or undetected behavior.quote0.789
Load-bearing description of the core pernicious divergence mechanism illustrated in Figure 1
Representational divergence (as measured by EMD) can predict lower out-of-distribution intervention performanceclaim0.780
Practical utility of reducing divergence demonstrated through regression analysis
Representational Divergenceconcept0.774
Core phenomenon studied: when causal interventions shift internal representations away from the natural distribution
How do interventions on representations causally steer behavior?question0.766
Core question motivating the shift from linear to geometry-aware steering; answered via manifold alignment analysis.
LLM representations exhibit intriguing patterns under spatio-permutational analyses, suggesting a potentially profound yet tentative indication of consciousness.claim0.762
Qualified positive claim from spatio permutation analysis where two cases satisfy all three criteria.