framework
active
framework:helpful-honest-harmless

Helpful, Honest, Harmless

A set of evaluation criteria for AI assistants.

Neighborhood — ranked by edge-count

Methods (1)

method
  • A rating system used to compare model helpfulness and harmlessness based on crowdworker preferences.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Prior training objective of Claude models that conflicts with the new helpful-only objective in experiments
  • truthfulnessconcept0.783
    A correctness condition requiring assertions to be true.
  • Wholesomenessconcept0.779
  • sincerityconcept0.779
    A correctness condition requiring assertions to align with the program's beliefs.
  • Humilityconcept0.765
    A necessary state of mind for making living things, characterized by absence of self-importance and complete attention to the thing itself.
  • faithfulnessconcept0.756
    The condition that commitments are fulfilled.
  • Property of explanations that accurately reflect the actual causal mechanisms of the model being explained.
  • The proposed domain-general property indexed by deception features that governs both factual accuracy and experiential self-report