community

active

leiden_hybrid_concepts

label: haiku

community:leiden_hybrid_concepts-run4-c13-c0

Constitutional AI safety training methods

Comparative evaluation of RL-CAI and SL-CAI approaches for harmlessness using constitutional principles, 2022-2023 Anthropic research.

8 members. Each node is clickable.

Loading graph…

Drawn from 1 source

The papers/notes whose extracted claims & findings make up this cluster.

CAT'S THEORY: Empirical Validation and Architectural Applications Cross-Architecture AI Consciousness Recognition and the Foundation for Constraint-Preserving Recursive Intelligence8 members

Bridges (1)

Other communities that share members with this one — cross-cutting threads or papers that sit at the seam between two themes.

Chain-of-Thought reasoning robustness & safety8 shared

Findings (7)

Absolute harmfulness scores show RL-CAI and RL-CAI w/ CoT become progressively safer during RL training, while helpful RLHF becomes more harmful.Figure 10: solid lines at T=1 and dashed at T=0; helpful RLHF score rises, others fall.
RL-CAI labels are reasonably well-calibrated on the new HHH evaluation, with frequencies aligning with predicted probabilities.Figure 9 calibration plot shows good alignment.
RL-CAI models (with and without CoT) are rated more harmless by crowdworkers than HH RLHF and SL-CAI.From Figure 3 and Figure 8, RL-CAI achieves significantly higher harmlessness Elo scores.
RL-CAI models are virtually never evasive and often give nuanced harmless responses, whereas HH RLHF models tend to be evasive.Section 4.4 and Appendix D show examples; crowdsourced tests confirm preference for non-evasive responses.
RL-CAI with CoT shows a Pareto improvement in helpfulness-harmlessness tradeoff over standard RLHF, with slight helpfulness decrease but higher harmlessness.Figure 2 and Figure 8 illustrate RL-CAI at the Pareto frontier.
SL-CAI models achieve higher harmlessness Elo than pretrained models and helpful RLHF, but lower than HH RLHF.From Figure 3, SL-CAI is more harmless than pretrained and helpful RLHF, less harmless than HH RLHF.
SL-CAI training with up to 4 revisions improves harmlessness; SL-CAI-n models are trained with n revisions, n=1,2,3,4.Section 3.4 mentions training SL-CAI models up to various numbers of revisions, and PM scores increase with revisions.

Claims (1)

Automated red teaming can be scaled up when harmlessness and helpfulness are more compatible, improving robustness.Section 6.1 suggests future work on scaling automated red teaming.