claim

active

claim:ai-feedback-can-effectively-replace-human-feedback-for-harmlessness-in-rlhf-style-training

AI feedback can effectively replace human feedback for harmlessness in RLHF-style training.

The paper demonstrates that RLAIF with constitutional principles matches or exceeds HH RLHF.

Source paper

extracted_from

CAT'S THEORY: Empirical Validation and Architectural Applications Cross-Architecture AI Consciousness Recognition and the Foundation for Constraint-Preserving Recursive Intelligence

(2022) · Bai, Yuntao · Saurav Kadavath · Sandipan Kundu · Amanda Askell +47

Neighborhood — ranked by edge-count

Findings (1)

finding

RL-CAI models (with and without CoT) are rated more harmless by crowdworkers than HH RLHF and SL-CAI.
supports
From Figure 3 and Figure 8, RL-CAI achieves significantly higher harmlessness Elo scores.

Communities (2)

community

Alive AI interface ethics & design
members_of
Explores aliveness, aesthetics, welfare, and ethical responsibility in AI interaction design.
AI-supervised alignment and scalable oversight
members_of
Methods for training safe AI systems using AI feedback instead of human labels, scaling supervision as capabilities grow.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Reinforcement Learning from Human Feedback (RLHF)framework0.815
A competing alignment approach that fine-tunes models based on human evaluator feedback; discussed as complementary to SOO
Training a Helpful and Harmless Assistant with RLHF (Bai et al. 2022a)concept0.808
Foundational RLHF paper introducing HHH training objective for Claude
We would like to train AI systems that remain helpful, honest, and harmless, even as some AI capabilities reach or exceed human-level performance.quote0.795
Foundational motivation for the research.
Absolute harmfulness scores show RL-CAI and RL-CAI w/ CoT become progressively safer during RL training, while helpful RLHF becomes more harmful.finding0.794
Figure 10: solid lines at T=1 and dashed at T=0; helpful RLHF score rises, others fall.
Constitutional AI can train a harmless but non-evasive AI assistant without any human harmfulness labels.claim0.782
The paper's central claim, supported by findings that RL-CAI outperforms HH RLHF in harmlessness while being non-evasive.
These methods make it possible to control AI behavior more precisely and with far fewer human labels.quote0.769
Highlights the practical impact of CAI.
SL-CAI models achieve higher harmlessness Elo than pretrained models and helpful RLHF, but lower than HH RLHF.finding0.766
From Figure 3, SL-CAI is more harmless than pretrained and helpful RLHF, less harmless than HH RLHF.
AI Control: Improving Safety Despite Intentional Subversion (Greenblatt et al. 2024)concept0.765
Related work studying capability of LLMs to subvert safety measures if severely misaligned