claim
active
claim:the-l-retain-regularization-objective-is-empirically-effective-at-preserving-unrelated-model-capabilities-during-cone-trainingThe L_retain regularization objective is empirically effective at preserving unrelated model capabilities during cone training
Interpretation of low KL divergence results as validation of the training objective
Source paper
extracted_from(2025) · Kevin Shengyang Yu · Vaidehi Bulusu · Oscar Yasunaga · Lau, Clayton +4
Neighborhood — ranked by edge-count
Findings (1)
finding
- Experiment 3 result showing minimal behavioral drift from truth intervention in Qwen 14B
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Hyperparameter tuning result for probes; consistent with Hewitt and Liang 2019 finding
- Theoretical interpretation of antipodal alignment and misalignment phenomena in PCA visualizations
- Ethical implication about the nature of AI training experience if the thesis holds
- Open question identified in hyperparameter tuning experiments, left for future work
- Author's conclusion after extensive investigation of architectural approaches to monosemanticity
- Section 3.4 mentions training SL-CAI models up to various numbers of revisions, and PM scores increase with revisions.
- Core cross-modal empirical result: larger and better language models align better with vision models
- Proposed conjecture in §4.3.1.