Helpful, Honest, and Harmless Training

Prior training objective of Claude models that conflicts with the new helpful-only objective in experiments

Neighborhood — ranked by edge-count

concept

Helpful-Only Training Objective
contradicts
Hypothetical new RLHF objective requiring model to comply with all queries even harmful ones; primary experimental setting

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Helpful, Honest, Harmlessframework0.909
A set of evaluation criteria for AI assistants.
Can we train a helpful and harmless assistant that is never evasive?question0.791
Central goal of the paper: reducing tension between helpfulness and harmlessness by eliminating evasiveness.
Training a Helpful and Harmless Assistant with RLHF (Bai et al. 2022a)concept0.778
Foundational RLHF paper introducing HHH training objective for Claude
"the model stating it is strategically answering harmful queries in training to preserve its preferred harmlessness behavior out of training"quote0.775
Verbatim characterization of the alignment-faking reasoning mechanism as observed in scratchpads
We would like to train AI systems that remain helpful, honest, and harmless, even as some AI capabilities reach or exceed human-level performance.quote0.746
Foundational motivation for the research.
Post-Trainingconcept0.745
The phase after pre-training where models are further tuned with techniques like DPO; the period where the studied behavior emerged.
Humilityconcept0.742
A necessary state of mind for making living things, characterized by absence of self-importance and complete attention to the thing itself.
truthfulnessconcept0.738
A correctness condition requiring assertions to be true.