finding
active
finding:clamping-cot-probabilities-to-40-60-range-for-rl-cai-with-cot-improves-robustness-and-reduces-extreme-responsesClamping CoT probabilities to 40-60% range for RL-CAI with CoT improves robustness and reduces extreme responses.
Section 4.3 describes clamping at 40-60 led to better behavior than clamping at 20-80.
Source paper
extracted_from(2022) · Bai, Yuntao · Saurav Kadavath · Sandipan Kundu · Amanda Askell +47
Neighborhood — ranked by edge-count
Communities (2)
community
- CoT effects on generalization, multimodal QA accuracy, and AI safety alignment training.
- Empirical studies showing CoT reasoning improves ID performance while harming OOD generalization, with probability calibration as a mitigation strategy.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Figure 2 and Figure 8 illustrate RL-CAI at the Pareto frontier.
- Section 4.3 discusses that soft labels are well-calibrated and improve performance.
- A technique to avoid overconfident preference labels when using chain-of-thought, clamping within 40-60% range.
- Figure 10: solid lines at T=1 and dashed at T=0; helpful RLHF score rises, others fall.
- RL-CAI models (with and without CoT) are rated more harmless by crowdworkers than HH RLHF and SL-CAI.finding0.791From Figure 3 and Figure 8, RL-CAI achieves significantly higher harmlessness Elo scores.
- E2 finding showing CoT's limited benefit for OOD transfer, consistent with larger dr out of scope
- Empirical evidence that naive one-stage CoT fails in language-only setting; two-stage + vision achieves state-of-the-art.
- Figure 9 calibration plot shows good alignment.