Yuntao Bai

Authored

Introduces

Studies

Affiliations

Cited by

Authored papers (2)

CAT'S THEORY: Empirical Validation and Architectural Applications Cross-Architecture AI Consciousness Recognition and the Foundation for Constraint-Preserving Recursive Intelligence2022
Constitutional AI (CAI) demonstrates that a harmless, non-evasive AI assistant can be trained using zero human feedback labels for harmlessness, replacing them entirely with AI-generated feedback guided by a short list of natural language principles. The method introduces a two-stage pipeline: a supervised learning (SL) phase in which a helpful RLHF model iteratively critiques and revises its own responses to red-team prompts drawn from 182,831 total prompts (42,496 human-written, 140,335 model-generated), followed by a reinforcement learning phase termed RLAIF, where a 52B-parameter feedback model evaluates response pairs according to 16 constitutional principles and generates preference labels that train a hybrid preference model. Crowdworker Elo score comparisons across 10,274 helpfulness and 8,135 harmlessness evaluations show that RL-CAI with chain-of-thought reasoning achieves harmlessness scores that meet or exceed those of models trained with human harmlessness feedback (HH RLHF), while maintaining comparable helpfulness—tracing a Pareto improvement over the helpfulness–harmlessness tradeoff documented in prior work. Critically, whereas HH RLHF models produced evasive refusals (e.g., 'I'm sorry, I won't respond') on sensitive PALMS and LaMDA prompts, RL-CAI models engage substantively and explain their objections. Chain-of-thought prompting on the feedback model, with CoT probabilities clamped to the 40–60% range to prevent overconfidence, further improved both harmlessness scores and label calibration. The paper argues this implies that scaled AI supervision—encoding alignment objectives in a transparent, auditable constitution rather than tens of thousands of opaque human labels—is a viable path toward alignment as model capabilities grow beyond reliable human oversight.
Constitutional AI: Harmlessness from AI feedback2022
referenced-only

More papers — OpenAlex / S2

Affiliations (1)

Anthropic(institute)

Co-authors (12)

Amanda Askell2 shared
Andy Jones2 shared
Anna Chen2 shared
Anna Goldie2 shared
Azalia Mirhoseini2 shared
Ben Mann2 shared
Cameron McKinnon2 shared
Carol Chen2 shared
Catherine Olsson2 shared
Christopher Olah2 shared
Danny Hernandez2 shared
Dario Amodei2 shared

Their work is cited by (3)

Recent mentions (1)

papers-typed
yuntao-2022-cat-s.md