claim
active
claim:the-cl-auxiliary-loss-can-directly-reduce-representational-divergence-in-practical-interpretability-settings-without-sacrificing-interpretability-method-accuracyThe CL auxiliary loss can directly reduce representational divergence in practical interpretability settings without sacrificing interpretability method accuracy
Central practical contribution: the CL loss offers a viable mitigation strategy
Source paper
extracted_from(2025) · Satchel Grant · Simon Jerome Han · Alexa R. Tartaglini · Christopher Potts
Neighborhood — ranked by edge-count
Findings (1)
finding
- For small CL loss weights epsilon, IIA is maintained (potentially improved) while EMD decreases in Boundless DAS on a 7B LLMassociated_withsupportsEmpirical result showing the CL loss can reduce divergence without sacrificing interpretability accuracy
Questions (1)
question
- Third core research question motivating the CL loss approach in Section 5
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Explicitly identified limitation of the proposed mitigation method
- Auxiliary objective combining L2 and cosine losses against pre-recorded CL vectors to improve causal relevance when one model is causally inaccessible.
- Core insight: reconstruction objective combined with appropriate initialization and KL regularization produces human-interpretable explanations as emergent property.
- Modified CL loss outperforms behavioral DAS loss in OOD transfer from dense to sparse class partitionfinding0.755Key practical utility result: CL loss improves generalization of alignment to out-of-distribution settings
- Central thesis of the paper
- Modified CL loss achieves IIA of 0.9988±0.0005 on synthetic 10-class dataset training/test setsfinding0.742IIA for modified CL loss on synthetic dataset, comparable to behavioral DAS
- Motivating claim that mechanistic explanations add clinical value for VUS.
- Practical utility of reducing divergence demonstrated through regression analysis