finding

active

finding:das-behavioral-loss-produces-emd-along-feature-dimensions-of-0-032-0-003-on-synthetic-10-class-dataset

DAS behavioral loss produces EMD along feature dimensions of 0.032±0.003 on synthetic 10-class dataset

Quantitative baseline for divergence using behavioral DAS loss on synthetic dataset

Source paper

extracted_from

Addressing divergent representations from causal interventions on neural networks

(2025) · Satchel Grant · Simon Jerome Han · Alexa R. Tartaglini · Christopher Potts

Neighborhood — ranked by edge-count

Findings (1)

finding

Modified CL loss produces EMD along feature dimensions of 0.007±0.001 on synthetic 10-class dataset
supports
Quantitative improvement in divergence reduction using the modified CL loss on synthetic dataset

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

DAS behavioral loss achieves IIA of 0.997±0.001 on synthetic 10-class dataset training/test setsfinding0.854
IIA baseline for DAS behavioral loss on synthetic dataset
For small CL loss weights epsilon, IIA is maintained (potentially improved) while EMD decreases in Boundless DAS on a 7B LLMfinding0.766
Empirical result showing the CL loss can reduce divergence without sacrificing interpretability accuracy
Modified CL loss outperforms behavioral DAS loss in OOD transfer from dense to sparse class partitionfinding0.761
Key practical utility result: CL loss improves generalization of alignment to out-of-distribution settings
Representational divergence (as measured by EMD) can predict lower out-of-distribution intervention performanceclaim0.755
Practical utility of reducing divergence demonstrated through regression analysis
Boundless DAS interchange interventions produce EMD exceeding natural-natural baselinefinding0.748
Empirical demonstration that DAS interventions produce divergent representations
Linear regression of OOD IIA on training EMD yields coefficient -0.3424, R^2=0.729, F(1,28)=75.28, p<.001finding0.745
Statistical evidence that training divergence (EMD) predicts lower OOD intervention performance
GRU behavior can be compressed to as few as 4 dimensions using DAS and MAS with comparable IIAsfinding0.742
Shows that behaviorally relevant information is low-dimensional; contrasted with model stitching achieving near-perfect IIA at rank 2.
Mean difference patching on Llama-3-8B layer 10 produces intervened EMD exceeding the natural-natural baselinefinding0.726
Empirical demonstration that MDVP produces divergent representations in a real LLM