Can we train a helpful and harmless assistant that is never evasive?

Central goal of the paper: reducing tension between helpfulness and harmlessness by eliminating evasiveness.

Source paper

extracted_from

CAT'S THEORY: Empirical Validation and Architectural Applications Cross-Architecture AI Consciousness Recognition and the Foundation for Constraint-Preserving Recursive Intelligence

(2022) · Bai, Yuntao · Saurav Kadavath · Sandipan Kundu · Amanda Askell +47

Neighborhood — ranked by edge-count

Findings (1)

finding

RL-CAI models are virtually never evasive and often give nuanced harmless responses, whereas HH RLHF models tend to be evasive.
answered_by
Section 4.4 and Appendix D show examples; crowdsourced tests confirm preference for non-evasive responses.

Claims (1)

claim

Constitutional AI can train a harmless but non-evasive AI assistant without any human harmfulness labels.
gates
The paper's central claim, supported by findings that RL-CAI outperforms HH RLHF in harmlessness while being non-evasive.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Training a Helpful and Harmless Assistant with RLHF (Bai et al. 2022a)concept0.792
Foundational RLHF paper introducing HHH training objective for Claude
Helpful, Honest, and Harmless Trainingconcept0.791
Prior training objective of Claude models that conflicts with the new helpful-only objective in experiments
We would like to train AI systems that remain helpful, honest, and harmless, even as some AI capabilities reach or exceed human-level performance.quote0.769
Foundational motivation for the research.
Evasive responses harm transparency and helpfulness; non-evasive harmless responses are preferable for both safety and utility.claim0.759
Motivation for training a non-evasive assistant, and crowdworker instructions favor non-evasive responses.
"the model stating it is strategically answering harmful queries in training to preserve its preferred harmlessness behavior out of training"quote0.741
Verbatim characterization of the alignment-faking reasoning mechanism as observed in scratchpads
The Assistant Axis in instruct models mainly inherits from pre-existing helpful and harmless human personas in base models, later acquiring additional associations (such as being an AI) during post-trainingclaim0.738
Key mechanistic claim about the developmental origin of the Assistant persona
Most AI assistants are anti-Alexander by design—they perform helpfulness, show work, and list options rather than resolving into calm.claim0.737
How reliably does the model actually remain in character as the Assistant? Can unusual model behavior be explained as the model drifting into other personas?question0.734
Second of two central questions motivating the paper