finding

active

finding:boundless-das-interchange-interventions-produce-emd-exceeding-natural-natural-baseline

Boundless DAS interchange interventions produce EMD exceeding natural-natural baseline

Empirical demonstration that DAS interventions produce divergent representations

Source paper

extracted_from

Addressing divergent representations from causal interventions on neural networks

(2025) · Satchel Grant · Simon Jerome Han · Alexa R. Tartaglini · Christopher Potts

Neighborhood — ranked by edge-count

Claims (1)

claim

Divergent representations are a common, if not likely, outcome of causal interventions across a wide range of methods
supports
Core empirical claim of the paper supported by both theoretical proof and empirical demonstration

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

SAE reconstructions on Llama-3-8B layer 25 produce intervened EMD exceeding the natural-natural baselinefinding0.763
Empirical demonstration that SAE projections produce divergent representations in a real LLM
Trainable intervention (DAS) finds sparser gender representations than linear probing, suggesting probing overestimates causal coverageclaim0.750
Interpretive claim from Case Study II about the distinction between correlational probes and causal interventions
DAS behavioral loss produces EMD along feature dimensions of 0.032±0.003 on synthetic 10-class datasetfinding0.748
Quantitative baseline for divergence using behavioral DAS loss on synthetic dataset
Representational divergence (as measured by EMD) can predict lower out-of-distribution intervention performanceclaim0.735
Practical utility of reducing divergence demonstrated through regression analysis
For small CL loss weights epsilon, IIA is maintained (potentially improved) while EMD decreases in Boundless DAS on a 7B LLMfinding0.734
Empirical result showing the CL loss can reduce divergence without sacrificing interpretability accuracy
DAS achieves substantial causal effect even on arbitrary input-output mappings where no causal mechanism should existfinding0.733
Replication of Wu et al. 2023 finding; DAS expressivity concern validated in CausalGym setup
Mean difference patching on Llama-3-8B layer 10 produces intervened EMD exceeding the natural-natural baselinefinding0.727
Empirical demonstration that MDVP produces divergent representations in a real LLM
DAS overcomes the localist limitation of prior causal abstraction by allowing individual neurons to play multiple roles via non-standard basesclaim0.727
Central claim motivating DAS over prior methods.