concept
active
concept:the-modified-cl-loss-is-confined-to-a-narrow-set-of-simplistic-settings-and-is-not-specific-to-pernicious-divergenceThe modified CL loss is confined to a narrow set of simplistic settings and is not specific to pernicious divergence
Explicitly identified limitation of the proposed mitigation method
Neighborhood — ranked by edge-count
Papers (1)
paper
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Novel variant of CL loss introduced in this paper targeting only causal subspace dimensions to improve OOD performance
- Central practical contribution: the CL loss offers a viable mitigation strategy
- Modified CL loss outperforms behavioral DAS loss in OOD transfer from dense to sparse class partitionfinding0.769Key practical utility result: CL loss improves generalization of alignment to out-of-distribution settings
- Modified CL loss achieves IIA of 0.9988±0.0005 on synthetic 10-class dataset training/test setsfinding0.755IIA for modified CL loss on synthetic dataset, comparable to behavioral DAS
- Modified CL loss produces EMD along feature dimensions of 0.007±0.001 on synthetic 10-class datasetfinding0.755Quantitative improvement in divergence reduction using the modified CL loss on synthetic dataset
- Auxiliary training objective from Grant (2025) that constrains intervened representations to remain near natural distribution
- Sobering conclusion about the fundamental challenge posed by divergence for mechanistic interpretability
- Mechanistic evidence that network actively attenuates injected perturbations, explaining late-layer introspection failure