framework
active
framework:helpful-honest-harmlessHelpful, Honest, Harmless
A set of evaluation criteria for AI assistants.
Neighborhood — ranked by edge-count
Methods (1)
method
- Elo scoreaboutA rating system used to compare model helpfulness and harmlessness based on crowdworker preferences.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Prior training objective of Claude models that conflicts with the new helpful-only objective in experiments
- A correctness condition requiring assertions to be true.
- A correctness condition requiring assertions to align with the program's beliefs.
- A necessary state of mind for making living things, characterized by absence of self-importance and complete attention to the thing itself.
- The condition that commitments are fulfilled.
- Property of explanations that accurately reflect the actual causal mechanisms of the model being explained.
- The proposed domain-general property indexed by deception features that governs both factual accuracy and experiential self-report