finding

active

finding:short-rationales-lora-cot-sometimes-improve-in-distribution-performance-but-do-not-reliably-reduce-cross-base-harm

Short rationales (LoRA+CoT) sometimes improve in-distribution performance but do not reliably reduce cross-base harm

E2 finding showing CoT's limited benefit for OOD transfer, consistent with larger dr out of scope

Source paper

extracted_from

The Guanyin Protocol: A Framework for Immediately Establishing an Understanding of Both Causality and Compassion in LLM Systems Using Semantic Anchoring

(2025) · Edward Yi Chang · Kaya, Zeyneb N. · Ethan Chang

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

CoT improves in-distribution but may harm out-of-distribution generalizationclaim0.776
Interpretation of scope generalization results
RL-CAI with CoT shows a Pareto improvement in helpfulness-harmlessness tradeoff over standard RLHF, with slight helpfulness decrease but higher harmlessness.finding0.766
Figure 2 and Figure 8 illustrate RL-CAI at the Pareto frontier.
Clamping CoT probabilities to 40-60% range for RL-CAI with CoT improves robustness and reduces extreme responses.finding0.762
Section 4.3 describes clamping at 40-60 led to better behavior than clamping at 20-80.
Scope generalization: CoT boosts 2-digit in-distribution but worsens 3-4 digit OODfinding0.756
CoT increases dr for OOD operands.
LoRA+CoTmethod0.752
Fine-tuning with chain-of-thought rationales aiming to reduce dr via procedural alignment.
Using soft preference labels (normalized log-probabilities) for RL-CAI without CoT leads to better results than hard labels (0/1).finding0.734
Section 4.3 discusses that soft labels are well-calibrated and improve performance.
Can targeted fine-tuning reverse RP suppression, given that LoRA caps both baseline and latent capacity?question0.731
Practical intervention question arising from RP suppression finding
Absolute harmfulness scores show RL-CAI and RL-CAI w/ CoT become progressively safer during RL training, while helpful RLHF becomes more harmful.finding0.730
Figure 10: solid lines at T=1 and dashed at T=0; helpful RLHF score rises, others fall.