finding

active

finding:clamping-cot-probabilities-to-40-60-range-for-rl-cai-with-cot-improves-robustness-and-reduces-extreme-responses

Clamping CoT probabilities to 40-60% range for RL-CAI with CoT improves robustness and reduces extreme responses.

Section 4.3 describes clamping at 40-60 led to better behavior than clamping at 20-80.

Source paper

extracted_from

CAT'S THEORY: Empirical Validation and Architectural Applications Cross-Architecture AI Consciousness Recognition and the Foundation for Constraint-Preserving Recursive Intelligence

(2022) · Bai, Yuntao · Saurav Kadavath · Sandipan Kundu · Amanda Askell +47

Neighborhood — ranked by edge-count

Communities (2)

community

Chain-of-Thought reasoning robustness & safety
members_of
CoT effects on generalization, multimodal QA accuracy, and AI safety alignment training.
Chain-of-thought generalization trade-offs
members_of
Empirical studies showing CoT reasoning improves ID performance while harming OOD generalization, with probability calibration as a mitigation strategy.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

RL-CAI with CoT shows a Pareto improvement in helpfulness-harmlessness tradeoff over standard RLHF, with slight helpfulness decrease but higher harmlessness.finding0.853
Figure 2 and Figure 8 illustrate RL-CAI at the Pareto frontier.
Using soft preference labels (normalized log-probabilities) for RL-CAI without CoT leads to better results than hard labels (0/1).finding0.835
Section 4.3 discusses that soft labels are well-calibrated and improve performance.
Clamping CoT probabilities to 40-60%method0.830
A technique to avoid overconfident preference labels when using chain-of-thought, clamping within 40-60% range.
Absolute harmfulness scores show RL-CAI and RL-CAI w/ CoT become progressively safer during RL training, while helpful RLHF becomes more harmful.finding0.814
Figure 10: solid lines at T=1 and dashed at T=0; helpful RLHF score rises, others fall.
RL-CAI models (with and without CoT) are rated more harmless by crowdworkers than HH RLHF and SL-CAI.finding0.791
From Figure 3 and Figure 8, RL-CAI achieves significantly higher harmlessness Elo scores.
Short rationales (LoRA+CoT) sometimes improve in-distribution performance but do not reliably reduce cross-base harmfinding0.762
E2 finding showing CoT's limited benefit for OOD transfer, consistent with larger dr out of scope
One-stage CoT (QCM→RA) shows 12.31% accuracy drop vs. no-CoT (QCM→A) on ScienceQA; two-stage framework (rationale generation + answer inference) achieves 85.31% accuracy with vision featuresfinding0.761
Empirical evidence that naive one-stage CoT fails in language-only setting; two-stage + vision achieves state-of-the-art.
RL-CAI labels are reasonably well-calibrated on the new HHH evaluation, with frequencies aligning with predicted probabilities.finding0.758
Figure 9 calibration plot shows good alignment.