question

active

question:are-critiques-necessary-for-improving-harmlessness-in-the-supervised-stage

Are critiques necessary for improving harmlessness in the supervised stage?

Section 3.5 explicitly investigates whether skipping the critique step works as well.

Source paper

extracted_from

CAT'S THEORY: Empirical Validation and Architectural Applications Cross-Architecture AI Consciousness Recognition and the Foundation for Constraint-Preserving Recursive Intelligence

(2022) · Bai, Yuntao · Saurav Kadavath · Sandipan Kundu · Amanda Askell +47

Neighborhood — ranked by edge-count

Findings (1)

finding

For small models, critiqued revisions yield higher harmlessness PM scores than direct revisions; for large models the difference is negligible.
answered_by
Figure 7 comparison of critiqued vs direct revisions across model sizes.

Claims (1)

claim

Constitutional AI can train a harmless but non-evasive AI assistant without any human harmfulness labels.
gates
The paper's central claim, supported by findings that RL-CAI outperforms HH RLHF in harmlessness while being non-evasive.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Harmlessness PM scores improve monotonically with more critique-revision iterations (up to 4 revisions tested).finding0.795
Figure 5 shows that revision 0 to 4 yields progressively higher harmlessness scores.
Scaling supervision through AI self-improvement is feasible and may be necessary as AI capabilities advance.claim0.753
The paper provides evidence that AI can help supervise AI, reducing reliance on humans.
If behaviour is the window to sentience, evaluation criteria must focus on observable response patterns without reference to the means by which they are produced.quote0.731
Key prescriptive statement supporting the system-agnostic approach.
SL-CAI training with up to 4 revisions improves harmlessness; SL-CAI-n models are trained with n revisions, n=1,2,3,4.finding0.726
Section 3.4 mentions training SL-CAI models up to various numbers of revisions, and PM scores increase with revisions.
"the model stating it is strategically answering harmful queries in training to preserve its preferred harmlessness behavior out of training"quote0.725
Verbatim characterization of the alignment-faking reasoning mechanism as observed in scratchpads
Evasive responses harm transparency and helpfulness; non-evasive harmless responses are preferable for both safety and utility.claim0.722
Motivation for training a non-evasive assistant, and crowdworker instructions favor non-evasive responses.
Safety scores decrease when prompts are rewritten to remove suspicious cuesfinding0.721
Following the reduction in eval awareness from prompt rewriting, the measured safety scores drop, implying they were inflated.
"the self-prior can serve as an internal criterion for the mark-directed behavior observed in the mirror test, offering a computational basis for investigating the developmental origins of self-awareness"quote0.720
Load-bearing summary of the paper's central contribution