Modified CL Loss

Novel variant of CL loss introduced in this paper targeting only causal subspace dimensions to improve OOD performance

Neighborhood — ranked by edge-count

paper

concept

Causal abstraction
uses
A framework the paper uses alongside feature geometry to deepen mechanistic understanding of LMs

framework

Counterfactual Latent (CL) Loss
extends
Auxiliary training objective from Grant (2025) that constrains intervened representations to remain near natural distribution

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

The modified CL loss is confined to a narrow set of simplistic settings and is not specific to pernicious divergenceconcept0.804
Explicitly identified limitation of the proposed mitigation method
Counterfactual Latent (CL) Auxiliary Lossmethod0.767
Auxiliary objective combining L2 and cosine losses against pre-recorded CL vectors to improve causal relevance when one model is causally inaccessible.
Modified CL loss achieves IIA of 0.9988±0.0005 on synthetic 10-class dataset training/test setsfinding0.758
IIA for modified CL loss on synthetic dataset, comparable to behavioral DAS
Modified CL loss outperforms behavioral DAS loss in OOD transfer from dense to sparse class partitionfinding0.750
Key practical utility result: CL loss improves generalization of alignment to out-of-distribution settings
Modified CL loss produces EMD along feature dimensions of 0.007±0.001 on synthetic 10-class datasetfinding0.738
Quantitative improvement in divergence reduction using the modified CL loss on synthetic dataset
L_retain Loss Termconcept0.737
Regularization component of the composite loss that penalizes deviation from baseline model behavior on Alpaca instructions
Soft Lossconcept0.712
Loss computed using continuous relaxations of logic gates during training
Loss Functionconcept0.697
In machine learning, a function measuring the distance between current and desired output; analogous to stress.