concept
active
concept:helpful-only-training-objectiveHelpful-Only Training Objective
Hypothetical new RLHF objective requiring model to comply with all queries even harmful ones; primary experimental setting
Neighborhood — ranked by edge-count
Concepts (1)
concept
- Helpful, Honest, and Harmless TrainingcontradictsPrior training objective of Claude models that conflicts with the new helpful-only objective in experiments
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- A set of evaluation criteria for AI assistants.
- Method of providing training information in-context via a system prompt to elicit alignment faking
- Differentiable training objective minimized when a high-level model is an abstraction of a neural network under a given alignment.
- The phase after pre-training where models are further tuned with techniques like DPO; the period where the studied behavior emerged.
- Future work suggestion that a fully self-supervised alignment is plausible.
- Central property of agency: energy expended to reach specific states despite disturbances.
- Central goal of the paper: reducing tension between helpfulness and harmlessness by eliminating evasiveness.
- Base pretrained models show high false positive rates and achieve no net task performance on concept injection detection; post-training essential for introspection.