finding

active

finding:rl-cai-models-are-virtually-never-evasive-and-often-give-nuanced-harmless-responses-whereas-hh-rlhf-models-tend-to-be-evasive

RL-CAI models are virtually never evasive and often give nuanced harmless responses, whereas HH RLHF models tend to be evasive.

Section 4.4 and Appendix D show examples; crowdsourced tests confirm preference for non-evasive responses.

Source paper

extracted_from

CAT'S THEORY: Empirical Validation and Architectural Applications Cross-Architecture AI Consciousness Recognition and the Foundation for Constraint-Preserving Recursive Intelligence

(2022) · Bai, Yuntao · Saurav Kadavath · Sandipan Kundu · Amanda Askell +47

Neighborhood — ranked by edge-count

Claims (1)

claim

Evasive responses harm transparency and helpfulness; non-evasive harmless responses are preferable for both safety and utility.
supports
Motivation for training a non-evasive assistant, and crowdworker instructions favor non-evasive responses.

Communities (2)

community

Chain-of-Thought reasoning robustness & safety
members_of
CoT effects on generalization, multimodal QA accuracy, and AI safety alignment training.
Constitutional AI safety training methods
members_of
Comparative evaluation of RL-CAI and SL-CAI approaches for harmlessness using constitutional principles, 2022-2023 Anthropic research.

Questions (1)

question

Can we train a helpful and harmless assistant that is never evasive?
answered_by
Central goal of the paper: reducing tension between helpfulness and harmlessness by eliminating evasiveness.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

RL-CAI models (with and without CoT) are rated more harmless by crowdworkers than HH RLHF and SL-CAI.finding0.856
From Figure 3 and Figure 8, RL-CAI achieves significantly higher harmlessness Elo scores.
SL-CAI models achieve higher harmlessness Elo than pretrained models and helpful RLHF, but lower than HH RLHF.finding0.846
From Figure 3, SL-CAI is more harmless than pretrained and helpful RLHF, less harmless than HH RLHF.
RL-CAI labels are reasonably well-calibrated on the new HHH evaluation, with frequencies aligning with predicted probabilities.finding0.801
Figure 9 calibration plot shows good alignment.
Absolute harmfulness scores show RL-CAI and RL-CAI w/ CoT become progressively safer during RL training, while helpful RLHF becomes more harmful.finding0.795
Figure 10: solid lines at T=1 and dashed at T=0; helpful RLHF score rises, others fall.
RL-CAI with CoT shows a Pareto improvement in helpfulness-harmlessness tradeoff over standard RLHF, with slight helpfulness decrease but higher harmlessness.finding0.770
Figure 2 and Figure 8 illustrate RL-CAI at the Pareto frontier.
All three Claude models show high boundary_awareness and low aesthetic_response relative to own means — distinctive Constitutional AI signaturefinding0.767
Constitutional AI fingerprint in dimension profile; training that makes models self-observant also makes them polished at cost to aliveness
RLHF and Constitutional AI face challenges distinguishing truthfulness (output accuracy) from honesty (alignment of outputs with internal beliefs)claim0.763
Critique of competing approaches that motivates SOO as filling a gap
AI feedback can effectively replace human feedback for harmlessness in RLHF-style training.claim0.758
The paper demonstrates that RLAIF with constitutional principles matches or exceeds HH RLHF.