finding

active

finding:das-behavioral-loss-achieves-iia-of-0-997-0-001-on-synthetic-10-class-dataset-training-test-sets

DAS behavioral loss achieves IIA of 0.997±0.001 on synthetic 10-class dataset training/test sets

IIA baseline for DAS behavioral loss on synthetic dataset

Source paper

extracted_from

Addressing divergent representations from causal interventions on neural networks

(2025) · Satchel Grant · Simon Jerome Han · Alexa R. Tartaglini · Christopher Potts

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

DAS behavioral loss produces EMD along feature dimensions of 0.032±0.003 on synthetic 10-class datasetfinding0.854
Quantitative baseline for divergence using behavioral DAS loss on synthetic dataset
Modified CL loss achieves IIA of 0.9988±0.0005 on synthetic 10-class dataset training/test setsfinding0.832
IIA for modified CL loss on synthetic dataset, comparable to behavioral DAS
For small CL loss weights epsilon, IIA is maintained (potentially improved) while EMD decreases in Boundless DAS on a 7B LLMfinding0.784
Empirical result showing the CL loss can reduce divergence without sacrificing interpretability accuracy
Modified CL loss outperforms behavioral DAS loss in OOD transfer from dense to sparse class partitionfinding0.783
Key practical utility result: CL loss improves generalization of alignment to out-of-distribution settings
DAS achieves 100% IIA on hierarchical equality task with |N|=16, intervention size 8, Layer 1finding0.767
DAS discovers a perfect alignment between the feed-forward network and the Both Equality Relations high-level model.
DAS on randomly initialized small networks (|N|=16) achieves only 0.50 IIA (chance), cannot construct new behaviorsfinding0.758
Demonstrates DAS cannot manufacture behaviors from random structure in appropriately sized networks.
DAS learning rate of 5e-3 outperforms 1e-3 (used in Wu et al. 2023) for small training sets in CausalGymfinding0.750
Hyperparameter tuning result for DAS; different from prior work due to smaller training set size
SAE training loss (MSE + L1 penalty with decoder norm scaling)method0.743
The objective function combining L2 reconstruction error and L1 penalty scaled by decoder norm, used to train the SAE.