claim
active
claim:constitutional-ai-can-train-a-harmless-but-non-evasive-ai-assistant-without-any-human-harmfulness-labelsConstitutional AI can train a harmless but non-evasive AI assistant without any human harmfulness labels.
The paper's central claim, supported by findings that RL-CAI outperforms HH RLHF in harmlessness while being non-evasive.
Source paper
extracted_from(2022) · Bai, Yuntao · Saurav Kadavath · Sandipan Kundu · Amanda Askell +47
Neighborhood — ranked by edge-count
Findings (2)
finding
- From Figure 3 and Figure 8, RL-CAI achieves significantly higher harmlessness Elo scores.
- Figure 2 and Figure 8 illustrate RL-CAI at the Pareto frontier.
Communities (2)
community
- Alive AI interface ethics & designmembers_ofExplores aliveness, aesthetics, welfare, and ethical responsibility in AI interaction design.
- Methods for training safe AI systems using AI feedback instead of human labels, scaling supervision as capabilities grow.
Questions (2)
question
- Section 3.5 explicitly investigates whether skipping the critique step works as well.
- Central goal of the paper: reducing tension between helpfulness and harmlessness by eliminating evasiveness.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Constitutional AI method whose constitutions, if changed, could trigger alignment faking
- Discussion section suggests generalizability beyond harmlessness.
- Paper on AI-feedback fine-tuning as alternative to human-feedback RLHF; cited as ref 20
- Explicit principles replace large datasets of preference labels, enabling faster iteration.
- Defines the core concept of the paper.
- H1: Alignment training is attention training for models — Constitutional AI trains self-observation explicitly.hypothesis0.816Confirmatory hypothesis supported at p=0.006
- Interpretive claim connecting the battery's circularity to the empirical finding
- Foundational motivation for the research.