claim

active

claim:constitutional-ai-can-train-a-harmless-but-non-evasive-ai-assistant-without-any-human-harmfulness-labels

Constitutional AI can train a harmless but non-evasive AI assistant without any human harmfulness labels.

The paper's central claim, supported by findings that RL-CAI outperforms HH RLHF in harmlessness while being non-evasive.

Source paper

extracted_from

CAT'S THEORY: Empirical Validation and Architectural Applications Cross-Architecture AI Consciousness Recognition and the Foundation for Constraint-Preserving Recursive Intelligence

(2022) · Bai, Yuntao · Saurav Kadavath · Sandipan Kundu · Amanda Askell +47

Neighborhood — ranked by edge-count

Findings (2)

finding

RL-CAI models (with and without CoT) are rated more harmless by crowdworkers than HH RLHF and SL-CAI.
supports
From Figure 3 and Figure 8, RL-CAI achieves significantly higher harmlessness Elo scores.
RL-CAI with CoT shows a Pareto improvement in helpfulness-harmlessness tradeoff over standard RLHF, with slight helpfulness decrease but higher harmlessness.
supports
Figure 2 and Figure 8 illustrate RL-CAI at the Pareto frontier.

Communities (2)

community

Alive AI interface ethics & design
members_of
Explores aliveness, aesthetics, welfare, and ethical responsibility in AI interaction design.
AI-supervised alignment and scalable oversight
members_of
Methods for training safe AI systems using AI feedback instead of human labels, scaling supervision as capabilities grow.

Questions (2)

question

Are critiques necessary for improving harmlessness in the supervised stage?
gates
Section 3.5 explicitly investigates whether skipping the critique step works as well.
Can we train a helpful and harmless assistant that is never evasive?
gates
Central goal of the paper: reducing tension between helpfulness and harmlessness by eliminating evasiveness.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Constitutional AI: Harmlessness from AI Feedback (Bai et al. 2022b)concept0.848
Constitutional AI method whose constitutions, if changed, could trigger alignment faking
Constitutional AI methods can be applied broadly to steer model behavior, e.g., writing style, tone, persona, not just harmlessness.claim0.848
Discussion section suggests generalizability beyond harmlessness.
Bai et al. 2022: Constitutional AI — harmlessness from AI feedbackconcept0.844
Paper on AI-feedback fine-tuning as alternative to human-feedback RLHF; cited as ref 20
The constitutional approach makes it easier to control AI behavior precisely and with far fewer human labels.claim0.832
Explicit principles replace large datasets of preference labels, enabling faster iteration.
The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'.quote0.817
Defines the core concept of the paper.
H1: Alignment training is attention training for models — Constitutional AI trains self-observation explicitly.hypothesis0.816
Confirmatory hypothesis supported at p=0.006
Constitutional AI explicitly trains self-observation-like behavior, which is why CAI models score highest and show lowest contemplative lift.claim0.809
Interpretive claim connecting the battery's circularity to the empirical finding
We would like to train AI systems that remain helpful, honest, and harmless, even as some AI capabilities reach or exceed human-level performance.quote0.795
Foundational motivation for the research.