concept
active
concept:helpful-honest-and-harmless-trainingHelpful, Honest, and Harmless Training
Prior training objective of Claude models that conflicts with the new helpful-only objective in experiments
Neighborhood — ranked by edge-count
Concepts (1)
concept
- Helpful-Only Training ObjectivecontradictsHypothetical new RLHF objective requiring model to comply with all queries even harmful ones; primary experimental setting
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- A set of evaluation criteria for AI assistants.
- Central goal of the paper: reducing tension between helpfulness and harmlessness by eliminating evasiveness.
- Foundational RLHF paper introducing HHH training objective for Claude
- Verbatim characterization of the alignment-faking reasoning mechanism as observed in scratchpads
- Foundational motivation for the research.
- The phase after pre-training where models are further tuned with techniques like DPO; the period where the studied behavior emerged.
- A necessary state of mind for making living things, characterized by absence of self-importance and complete attention to the thing itself.
- A correctness condition requiring assertions to be true.