claim
active
claim:ai-feedback-can-effectively-replace-human-feedback-for-harmlessness-in-rlhf-style-trainingAI feedback can effectively replace human feedback for harmlessness in RLHF-style training.
The paper demonstrates that RLAIF with constitutional principles matches or exceeds HH RLHF.
Source paper
extracted_from(2022) · Bai, Yuntao · Saurav Kadavath · Sandipan Kundu · Amanda Askell +47
Neighborhood — ranked by edge-count
Findings (1)
finding
- From Figure 3 and Figure 8, RL-CAI achieves significantly higher harmlessness Elo scores.
Communities (2)
community
- Alive AI interface ethics & designmembers_ofExplores aliveness, aesthetics, welfare, and ethical responsibility in AI interaction design.
- Methods for training safe AI systems using AI feedback instead of human labels, scaling supervision as capabilities grow.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- A competing alignment approach that fine-tunes models based on human evaluator feedback; discussed as complementary to SOO
- Foundational RLHF paper introducing HHH training objective for Claude
- Foundational motivation for the research.
- Figure 10: solid lines at T=1 and dashed at T=0; helpful RLHF score rises, others fall.
- The paper's central claim, supported by findings that RL-CAI outperforms HH RLHF in harmlessness while being non-evasive.
- Highlights the practical impact of CAI.
- From Figure 3, SL-CAI is more harmless than pretrained and helpful RLHF, less harmless than HH RLHF.
- Related work studying capability of LLMs to subvert safety measures if severely misaligned