claim

active

claim:the-l-retain-regularization-objective-is-empirically-effective-at-preserving-unrelated-model-capabilities-during-cone-training

The L_retain regularization objective is empirically effective at preserving unrelated model capabilities during cone training

Interpretation of low KL divergence results as validation of the training objective

Source paper

extracted_from

From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs

(2025) · Kevin Shengyang Yu · Vaidehi Bulusu · Oscar Yasunaga · Lau, Clayton +4

Neighborhood — ranked by edge-count

Findings (1)

finding

Qwen-2.5-14B mean KL divergence on Alpaca prompts after truth-direction ablation is 0.038
supports
Experiment 3 result showing minimal behavioral drift from truth intervention in Qwen 14B

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

L2 regularisation with bias term delivers best probe performance; L2 regularisation increases probe selectivityfinding0.759
Hyperparameter tuning result for probes; consistent with Hewitt and Liang 2019 finding
In intermediate regimes of scale or layer depth, LLMs may linearly represent features at intermediate levels of abstraction such as 'accurate factual recall' or 'close association' rather than abstract truthclaim0.731
Theoretical interpretation of antipodal alignment and misalignment phenomena in PCA visualizations
Current training methods rely on loss minimization, meaning the experiential profile of training is predominantly negative across billions of parameter updatesclaim0.728
Ethical implication about the nature of AI training experience if the thesis holds
Why does L2 regularisation increase probe causal efficacy (selectivity)?question0.727
Open question identified in hyperparameter tuning experiments, left for future work
Training models with sparse activations cannot fully prevent polysemanticity because cross-entropy loss creates incentives for polysemantic neurons even without superpositionclaim0.723
Author's conclusion after extensive investigation of architectural approaches to monosemanticity
SL-CAI training with up to 4 revisions improves harmlessness; SL-CAI-n models are trained with n revisions, n=1,2,3,4.finding0.718
Section 3.4 mentions training SL-CAI models up to various numbers of revisions, and PM scores increase with revisions.
The better an LLM is at language modeling, the more it aligns with vision models, and vice versa — linear relationship between language modeling score and vision-language alignmentfinding0.714
Core cross-modal empirical result: larger and better language models align better with vision models
If EI maximization is used as a regularization in representation learning, then OOD generalization will improve beyond current invariant risk minimization methods.hypothesis0.713
Proposed conjecture in §4.3.1.