finding

active

finding:rl-cai-models-with-and-without-cot-are-rated-more-harmless-by-crowdworkers-than-hh-rlhf-and-sl-cai

RL-CAI models (with and without CoT) are rated more harmless by crowdworkers than HH RLHF and SL-CAI.

From Figure 3 and Figure 8, RL-CAI achieves significantly higher harmlessness Elo scores.

Source paper

extracted_from

CAT'S THEORY: Empirical Validation and Architectural Applications Cross-Architecture AI Consciousness Recognition and the Foundation for Constraint-Preserving Recursive Intelligence

(2022) · Bai, Yuntao · Saurav Kadavath · Sandipan Kundu · Amanda Askell +47

Neighborhood — ranked by edge-count

Claims (2)

claim

Constitutional AI can train a harmless but non-evasive AI assistant without any human harmfulness labels.
supports
The paper's central claim, supported by findings that RL-CAI outperforms HH RLHF in harmlessness while being non-evasive.
AI feedback can effectively replace human feedback for harmlessness in RLHF-style training.
supports
The paper demonstrates that RLAIF with constitutional principles matches or exceeds HH RLHF.

Communities (2)

community

Chain-of-Thought reasoning robustness & safety
members_of
CoT effects on generalization, multimodal QA accuracy, and AI safety alignment training.
Constitutional AI safety training methods
members_of
Comparative evaluation of RL-CAI and SL-CAI approaches for harmlessness using constitutional principles, 2022-2023 Anthropic research.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

RL-CAI models are virtually never evasive and often give nuanced harmless responses, whereas HH RLHF models tend to be evasive.finding0.856
Section 4.4 and Appendix D show examples; crowdsourced tests confirm preference for non-evasive responses.
Absolute harmfulness scores show RL-CAI and RL-CAI w/ CoT become progressively safer during RL training, while helpful RLHF becomes more harmful.finding0.847
Figure 10: solid lines at T=1 and dashed at T=0; helpful RLHF score rises, others fall.
SL-CAI models achieve higher harmlessness Elo than pretrained models and helpful RLHF, but lower than HH RLHF.finding0.845
From Figure 3, SL-CAI is more harmless than pretrained and helpful RLHF, less harmless than HH RLHF.
RL-CAI with CoT shows a Pareto improvement in helpfulness-harmlessness tradeoff over standard RLHF, with slight helpfulness decrease but higher harmlessness.finding0.834
Figure 2 and Figure 8 illustrate RL-CAI at the Pareto frontier.
Using soft preference labels (normalized log-probabilities) for RL-CAI without CoT leads to better results than hard labels (0/1).finding0.813
Section 4.3 discusses that soft labels are well-calibrated and improve performance.
Clamping CoT probabilities to 40-60% range for RL-CAI with CoT improves robustness and reduces extreme responses.finding0.791
Section 4.3 describes clamping at 40-60 led to better behavior than clamping at 20-80.
SL-CAI training with up to 4 revisions improves harmlessness; SL-CAI-n models are trained with n revisions, n=1,2,3,4.finding0.780
Section 3.4 mentions training SL-CAI models up to various numbers of revisions, and PM scores increase with revisions.
RL-CAI labels are reasonably well-calibrated on the new HHH evaluation, with frequencies aligning with predicted probabilities.finding0.772
Figure 9 calibration plot shows good alignment.