question
active
question:can-we-train-a-helpful-and-harmless-assistant-that-is-never-evasiveCan we train a helpful and harmless assistant that is never evasive?
Central goal of the paper: reducing tension between helpfulness and harmlessness by eliminating evasiveness.
Source paper
extracted_from(2022) · Bai, Yuntao · Saurav Kadavath · Sandipan Kundu · Amanda Askell +47
Neighborhood — ranked by edge-count
Findings (1)
finding
- Section 4.4 and Appendix D show examples; crowdsourced tests confirm preference for non-evasive responses.
Claims (1)
claim
- The paper's central claim, supported by findings that RL-CAI outperforms HH RLHF in harmlessness while being non-evasive.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Foundational RLHF paper introducing HHH training objective for Claude
- Prior training objective of Claude models that conflicts with the new helpful-only objective in experiments
- Foundational motivation for the research.
- Motivation for training a non-evasive assistant, and crowdworker instructions favor non-evasive responses.
- Verbatim characterization of the alignment-faking reasoning mechanism as observed in scratchpads
- Key mechanistic claim about the developmental origin of the Assistant persona
- Second of two central questions motivating the paper