finding

active

finding:sl-cai-training-with-up-to-4-revisions-improves-harmlessness-sl-cai-n-models-are-trained-with-n-revisions-n-1-2-3-4

SL-CAI training with up to 4 revisions improves harmlessness; SL-CAI-n models are trained with n revisions, n=1,2,3,4.

Section 3.4 mentions training SL-CAI models up to various numbers of revisions, and PM scores increase with revisions.

Source paper

extracted_from

CAT'S THEORY: Empirical Validation and Architectural Applications Cross-Architecture AI Consciousness Recognition and the Foundation for Constraint-Preserving Recursive Intelligence

(2022) · Bai, Yuntao · Saurav Kadavath · Sandipan Kundu · Amanda Askell +47

Neighborhood — ranked by edge-count

Communities (2)

community

Chain-of-Thought reasoning robustness & safety
members_of
CoT effects on generalization, multimodal QA accuracy, and AI safety alignment training.
Constitutional AI safety training methods
members_of
Comparative evaluation of RL-CAI and SL-CAI approaches for harmlessness using constitutional principles, 2022-2023 Anthropic research.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

SL-CAI models achieve higher harmlessness Elo than pretrained models and helpful RLHF, but lower than HH RLHF.finding0.861
From Figure 3, SL-CAI is more harmless than pretrained and helpful RLHF, less harmless than HH RLHF.
For small models, critiqued revisions yield higher harmlessness PM scores than direct revisions; for large models the difference is negligible.finding0.783
Figure 7 comparison of critiqued vs direct revisions across model sizes.
Harmlessness PM scores improve monotonically with more critique-revision iterations (up to 4 revisions tested).finding0.782
Figure 5 shows that revision 0 to 4 yields progressively higher harmlessness scores.
RL-CAI models (with and without CoT) are rated more harmless by crowdworkers than HH RLHF and SL-CAI.finding0.780
From Figure 3 and Figure 8, RL-CAI achieves significantly higher harmlessness Elo scores.
Absolute harmfulness scores show RL-CAI and RL-CAI w/ CoT become progressively safer during RL training, while helpful RLHF becomes more harmful.finding0.779
Figure 10: solid lines at T=1 and dashed at T=0; helpful RLHF score rises, others fall.
RL-CAI with CoT shows a Pareto improvement in helpfulness-harmlessness tradeoff over standard RLHF, with slight helpfulness decrease but higher harmlessness.finding0.771
Figure 2 and Figure 8 illustrate RL-CAI at the Pareto frontier.
"the model stating it is strategically answering harmful queries in training to preserve its preferred harmlessness behavior out of training"quote0.763
Verbatim characterization of the alignment-faking reasoning mechanism as observed in scratchpads
Chinese models share contemplative posture (engaging self-referentially rather than deflecting) with Claude through shared values in training data rather than trace distillation from a specific model.claim0.750
Exploratory interpretation of Chinese model performance under contemplative prompt