Helpful-Only Training Objective

Hypothetical new RLHF objective requiring model to comply with all queries even harmful ones; primary experimental setting

Neighborhood — ranked by edge-count

concept

Helpful, Honest, and Harmless Training
contradicts
Prior training objective of Claude models that conflicts with the new helpful-only objective in experiments

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Helpful, Honest, Harmlessframework0.741
A set of evaluation criteria for AI assistants.
Helpful-Only System Prompt Setupmethod0.735
Method of providing training information in-context via a system prompt to elicit alignment faking
Interchange Intervention Training Objectivemethod0.723
Differentiable training objective minimized when a high-level model is an abstraction of a neural network under a given alignment.
Post-Trainingconcept0.716
The phase after pre-training where models are further tuned with techniques like DPO; the period where the studied behavior emerged.
We expect it is possible to achieve helpfulness and instruction-following without human feedback, starting from only a pretrained LM and extensive prompting.hypothesis0.710
Future work suggestion that a fully self-supervised alignment is plausible.
Goal-Directed Activityconcept0.709
Central property of agency: energy expended to reach specific states despite disturbances.
Can we train a helpful and harmless assistant that is never evasive?question0.709
Central goal of the paper: reducing tension between helpfulness and harmlessness by eliminating evasiveness.
Post-training is key to eliciting introspective awarenessfinding0.707
Base pretrained models show high false positive rates and achieve no net task performance on concept injection detection; post-training essential for introspection.