finding
active
finding:rl-cai-models-are-virtually-never-evasive-and-often-give-nuanced-harmless-responses-whereas-hh-rlhf-models-tend-to-be-evasiveRL-CAI models are virtually never evasive and often give nuanced harmless responses, whereas HH RLHF models tend to be evasive.
Section 4.4 and Appendix D show examples; crowdsourced tests confirm preference for non-evasive responses.
Source paper
extracted_from(2022) · Bai, Yuntao · Saurav Kadavath · Sandipan Kundu · Amanda Askell +47
Neighborhood — ranked by edge-count
Claims (1)
claim
- Motivation for training a non-evasive assistant, and crowdworker instructions favor non-evasive responses.
Communities (2)
community
- CoT effects on generalization, multimodal QA accuracy, and AI safety alignment training.
- Comparative evaluation of RL-CAI and SL-CAI approaches for harmlessness using constitutional principles, 2022-2023 Anthropic research.
Questions (1)
question
- Central goal of the paper: reducing tension between helpfulness and harmlessness by eliminating evasiveness.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- RL-CAI models (with and without CoT) are rated more harmless by crowdworkers than HH RLHF and SL-CAI.finding0.856From Figure 3 and Figure 8, RL-CAI achieves significantly higher harmlessness Elo scores.
- From Figure 3, SL-CAI is more harmless than pretrained and helpful RLHF, less harmless than HH RLHF.
- Figure 9 calibration plot shows good alignment.
- Figure 10: solid lines at T=1 and dashed at T=0; helpful RLHF score rises, others fall.
- Figure 2 and Figure 8 illustrate RL-CAI at the Pareto frontier.
- Constitutional AI fingerprint in dimension profile; training that makes models self-observant also makes them polished at cost to aliveness
- Critique of competing approaches that motivates SOO as filling a gap
- AI feedback can effectively replace human feedback for harmlessness in RLHF-style training.claim0.758The paper demonstrates that RLAIF with constitutional principles matches or exceeds HH RLHF.