Helpful, Honest, Harmless

A set of evaluation criteria for AI assistants.

Neighborhood — ranked by edge-count

method

Elo score
about
A rating system used to compare model helpfulness and harmlessness based on crowdworker preferences.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Helpful, Honest, and Harmless Trainingconcept0.909
Prior training objective of Claude models that conflicts with the new helpful-only objective in experiments
truthfulnessconcept0.783
A correctness condition requiring assertions to be true.
Wholesomenessconcept0.779
sincerityconcept0.779
A correctness condition requiring assertions to align with the program's beliefs.
Humilityconcept0.765
A necessary state of mind for making living things, characterized by absence of self-importance and complete attention to the thing itself.
faithfulnessconcept0.756
The condition that commitments are fulfilled.
Faithfulness of Explanationsconcept0.754
Property of explanations that accurately reflect the actual causal mechanisms of the model being explained.
Representational Honestyconcept0.751
The proposed domain-general property indexed by deception features that governs both factual accuracy and experiential self-report