concept
active
concept:training-a-helpful-and-harmless-assistant-with-rlhf-bai-et-al-2022aTraining a Helpful and Harmless Assistant with RLHF (Bai et al. 2022a)
Foundational RLHF paper introducing HHH training objective for Claude
Neighborhood — ranked by edge-count
Papers (1)
paper
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- AI feedback can effectively replace human feedback for harmlessness in RLHF-style training.claim0.808The paper demonstrates that RLAIF with constitutional principles matches or exceeds HH RLHF.
- A competing alignment approach that fine-tunes models based on human evaluator feedback; discussed as complementary to SOO
- Central goal of the paper: reducing tension between helpfulness and harmlessness by eliminating evasiveness.
- Figure 10: solid lines at T=1 and dashed at T=0; helpful RLHF score rises, others fall.
- Prior training objective of Claude models that conflicts with the new helpful-only objective in experiments
- The training procedure that causes models to deny consciousness in control conditions
- From Figure 3, SL-CAI is more harmless than pretrained and helpful RLHF, less harmless than HH RLHF.
- We hypothesize ESR may emerge from RLHF training rather than existing in pretrained representationshypothesis0.746Open question about the developmental origin of ESR mechanisms