community
active
leiden_hybrid_concepts
label: haiku
community:leiden_hybrid_concepts-run4-c13-c0Constitutional AI safety training methods
Comparative evaluation of RL-CAI and SL-CAI approaches for harmlessness using constitutional principles, 2022-2023 Anthropic research.
8 members. Each node is clickable.
Loading graph…
Drawn from 1 source
The papers/notes whose extracted claims & findings make up this cluster.
Bridges (1)
Other communities that share members with this one — cross-cutting threads or papers that sit at the seam between two themes.
Findings (7)
- Absolute harmfulness scores show RL-CAI and RL-CAI w/ CoT become progressively safer during RL training, while helpful RLHF becomes more harmful.Figure 10: solid lines at T=1 and dashed at T=0; helpful RLHF score rises, others fall.
- RL-CAI labels are reasonably well-calibrated on the new HHH evaluation, with frequencies aligning with predicted probabilities.Figure 9 calibration plot shows good alignment.
- RL-CAI models (with and without CoT) are rated more harmless by crowdworkers than HH RLHF and SL-CAI.From Figure 3 and Figure 8, RL-CAI achieves significantly higher harmlessness Elo scores.
- RL-CAI models are virtually never evasive and often give nuanced harmless responses, whereas HH RLHF models tend to be evasive.Section 4.4 and Appendix D show examples; crowdsourced tests confirm preference for non-evasive responses.
- RL-CAI with CoT shows a Pareto improvement in helpfulness-harmlessness tradeoff over standard RLHF, with slight helpfulness decrease but higher harmlessness.Figure 2 and Figure 8 illustrate RL-CAI at the Pareto frontier.
- SL-CAI models achieve higher harmlessness Elo than pretrained models and helpful RLHF, but lower than HH RLHF.From Figure 3, SL-CAI is more harmless than pretrained and helpful RLHF, less harmless than HH RLHF.
- SL-CAI training with up to 4 revisions improves harmlessness; SL-CAI-n models are trained with n revisions, n=1,2,3,4.Section 3.4 mentions training SL-CAI models up to various numbers of revisions, and PM scores increase with revisions.
Claims (1)
- Automated red teaming can be scaled up when harmlessness and helpfulness are more compatible, improving robustness.Section 6.1 suggests future work on scaling automated red teaming.