concept

active

concept:the-modified-cl-loss-is-confined-to-a-narrow-set-of-simplistic-settings-and-is-not-specific-to-pernicious-divergence

The modified CL loss is confined to a narrow set of simplistic settings and is not specific to pernicious divergence

Explicitly identified limitation of the proposed mitigation method

Neighborhood — ranked by edge-count

Papers (1)

paper

Addressing divergent representations from causal interventions on neural networks
associated_with

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Modified CL Lossframework0.804
Novel variant of CL loss introduced in this paper targeting only causal subspace dimensions to improve OOD performance
The CL auxiliary loss can directly reduce representational divergence in practical interpretability settings without sacrificing interpretability method accuracyclaim0.797
Central practical contribution: the CL loss offers a viable mitigation strategy
Modified CL loss outperforms behavioral DAS loss in OOD transfer from dense to sparse class partitionfinding0.769
Key practical utility result: CL loss improves generalization of alignment to out-of-distribution settings
Modified CL loss achieves IIA of 0.9988±0.0005 on synthetic 10-class dataset training/test setsfinding0.755
IIA for modified CL loss on synthetic dataset, comparable to behavioral DAS
Modified CL loss produces EMD along feature dimensions of 0.007±0.001 on synthetic 10-class datasetfinding0.755
Quantitative improvement in divergence reduction using the modified CL loss on synthetic dataset
Counterfactual Latent (CL) Lossframework0.754
Auxiliary training objective from Grant (2025) that constrains intervened representations to remain near natural distribution
Any divergence outside of the null-space of NN layers is potentially pernicious, posing challenges for a complete mechanistic understanding of NNsclaim0.754
Sobering conclusion about the fundamental challenge posed by divergence for mechanistic interpretability
Cosine similarity between perturbed and baseline residual streams returns toward 1.0 and projection onto injection direction decays exponentially over subsequent layersfinding0.745
Mechanistic evidence that network actively attenuates injected perturbations, explaining late-layer introspection failure