question
active
question:are-critiques-necessary-for-improving-harmlessness-in-the-supervised-stageAre critiques necessary for improving harmlessness in the supervised stage?
Section 3.5 explicitly investigates whether skipping the critique step works as well.
Source paper
extracted_from(2022) · Bai, Yuntao · Saurav Kadavath · Sandipan Kundu · Amanda Askell +47
Neighborhood — ranked by edge-count
Findings (1)
finding
- Figure 7 comparison of critiqued vs direct revisions across model sizes.
Claims (1)
claim
- The paper's central claim, supported by findings that RL-CAI outperforms HH RLHF in harmlessness while being non-evasive.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Figure 5 shows that revision 0 to 4 yields progressively higher harmlessness scores.
- The paper provides evidence that AI can help supervise AI, reducing reliance on humans.
- Key prescriptive statement supporting the system-agnostic approach.
- Section 3.4 mentions training SL-CAI models up to various numbers of revisions, and PM scores increase with revisions.
- Verbatim characterization of the alignment-faking reasoning mechanism as observed in scratchpads
- Motivation for training a non-evasive assistant, and crowdworker instructions favor non-evasive responses.
- Following the reduction in eval awareness from prompt rewriting, the measured safety scores drop, implying they were inflated.
- Load-bearing summary of the paper's central contribution