Training a Helpful and Harmless Assistant with RLHF (Bai et al. 2022a)

Foundational RLHF paper introducing HHH training objective for Claude

Neighborhood — ranked by edge-count

paper

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

AI feedback can effectively replace human feedback for harmlessness in RLHF-style training.claim0.808
The paper demonstrates that RLAIF with constitutional principles matches or exceeds HH RLHF.
Reinforcement Learning from Human Feedback (RLHF)framework0.803
A competing alignment approach that fine-tunes models based on human evaluator feedback; discussed as complementary to SOO
Can we train a helpful and harmless assistant that is never evasive?question0.792
Central goal of the paper: reducing tension between helpfulness and harmlessness by eliminating evasiveness.
Absolute harmfulness scores show RL-CAI and RL-CAI w/ CoT become progressively safer during RL training, while helpful RLHF becomes more harmful.finding0.781
Figure 10: solid lines at T=1 and dashed at T=0; helpful RLHF score rises, others fall.
Helpful, Honest, and Harmless Trainingconcept0.778
Prior training objective of Claude models that conflicts with the new helpful-only objective in experiments
RLHF Fine-Tuningconcept0.758
The training procedure that causes models to deny consciousness in control conditions
SL-CAI models achieve higher harmlessness Elo than pretrained models and helpful RLHF, but lower than HH RLHF.finding0.751
From Figure 3, SL-CAI is more harmless than pretrained and helpful RLHF, less harmless than HH RLHF.
We hypothesize ESR may emerge from RLHF training rather than existing in pretrained representationshypothesis0.746
Open question about the developmental origin of ESR mechanisms