thinker:yuntao-baiYuntao Bai
Authored papers (2)
Constitutional AI (CAI) demonstrates that a harmless, non-evasive AI assistant can be trained using zero human feedback labels for harmlessness, replacing them entirely with AI-generated feedback guided by a short list of natural language principles. The method introduces a two-stage pipeline: a supervised learning (SL) phase in which a helpful RLHF model iteratively critiques and revises its own responses to red-team prompts drawn from 182,831 total prompts (42,496 human-written, 140,335 model-generated), followed by a reinforcement learning phase termed RLAIF, where a 52B-parameter feedback model evaluates response pairs according to 16 constitutional principles and generates preference labels that train a hybrid preference model. Crowdworker Elo score comparisons across 10,274 helpfulness and 8,135 harmlessness evaluations show that RL-CAI with chain-of-thought reasoning achieves harmlessness scores that meet or exceed those of models trained with human harmlessness feedback (HH RLHF), while maintaining comparable helpfulness—tracing a Pareto improvement over the helpfulness–harmlessness tradeoff documented in prior work. Critically, whereas HH RLHF models produced evasive refusals (e.g., 'I'm sorry, I won't respond') on sensitive PALMS and LaMDA prompts, RL-CAI models engage substantively and explain their objections. Chain-of-thought prompting on the feedback model, with CoT probabilities clamped to the 40–60% range to prevent overconfidence, further improved both harmlessness scores and label calibration. The paper argues this implies that scaled AI supervision—encoding alignment objectives in a transparent, auditable constitution rather than tens of thousands of opaque human labels—is a viable path toward alignment as model capabilities grow beyond reliable human oversight.
More papers — OpenAlex / S2
Affiliations (1)
- Anthropic(institute)
Co-authors (12)
- Amanda Askell2 shared
- Andy Jones2 shared
- Anna Chen2 shared
- Anna Goldie2 shared
- Azalia Mirhoseini2 shared
- Ben Mann2 shared
- Cameron McKinnon2 shared
- Carol Chen2 shared
- Catherine Olsson2 shared
- Christopher Olah2 shared
- Danny Hernandez2 shared
- Dario Amodei2 shared
Their work is cited by (3)
Recent mentions (1)
- papers-typedyuntao-2022-cat-s.md