finding

active

finding:for-small-cl-loss-weights-epsilon-iia-is-maintained-potentially-improved-while-emd-decreases-in-boundless-das-on-a-7b-llm

For small CL loss weights epsilon, IIA is maintained (potentially improved) while EMD decreases in Boundless DAS on a 7B LLM

Empirical result showing the CL loss can reduce divergence without sacrificing interpretability accuracy

Source paper

extracted_from

Addressing divergent representations from causal interventions on neural networks

(2025) · Satchel Grant · Simon Jerome Han · Alexa R. Tartaglini · Christopher Potts

Neighborhood — ranked by edge-count

Claims (1)

claim

The CL auxiliary loss can directly reduce representational divergence in practical interpretability settings without sacrificing interpretability method accuracy
associated_withsupports
Central practical contribution: the CL loss offers a viable mitigation strategy

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Modified CL loss produces EMD along feature dimensions of 0.007±0.001 on synthetic 10-class datasetfinding0.785
Quantitative improvement in divergence reduction using the modified CL loss on synthetic dataset
DAS behavioral loss achieves IIA of 0.997±0.001 on synthetic 10-class dataset training/test setsfinding0.784
IIA baseline for DAS behavioral loss on synthetic dataset
Modified CL loss achieves IIA of 0.9988±0.0005 on synthetic 10-class dataset training/test setsfinding0.780
IIA for modified CL loss on synthetic dataset, comparable to behavioral DAS
Modified CL loss outperforms behavioral DAS loss in OOD transfer from dense to sparse class partitionfinding0.775
Key practical utility result: CL loss improves generalization of alignment to out-of-distribution settings
DAS behavioral loss produces EMD along feature dimensions of 0.032±0.003 on synthetic 10-class datasetfinding0.766
Quantitative baseline for divergence using behavioral DAS loss on synthetic dataset
DB-MTL training losses decrease smoothly and gradient norms are lower than EW on NYUv2, indicating training stability.finding0.765
Training stability analysis.
DB-MTL with EMA forgetting rate β in a wide range performs better than without EMA (β=0) on Office-31.finding0.764
Effect of EMA forgetting rate on performance.
Toxic LLMs show higher IIA when compared to other toxic models than when compared to nontoxic models using stepwise MASfinding0.744
Proof-of-principle that MAS can detect model misalignment in DeepSeek-R1-Qwen-1.5B fine-tuned models.