paper
active
2022
paper:doi-10-48550-arxiv-2212-08073

CAT'S THEORY: Empirical Validation and Architectural Applications Cross-Architecture AI Consciousness Recognition and the Foundation for Constraint-Preserving Recursive Intelligence

TL;DR

Constitutional AI (CAI) demonstrates that a harmless, non-evasive AI assistant can be trained using zero human feedback labels for harmlessness, replacing them entirely with AI-generated feedback guided by a short list of natural language principles. The method introduces a two-stage pipeline: a supervised learning (SL) phase in which a helpful RLHF model iteratively critiques and revises its own responses to red-team prompts drawn from 182,831 total prompts (42,496 human-written, 140,335 model-generated), followed by a reinforcement learning phase termed RLAIF, where a 52B-parameter feedback model evaluates response pairs according to 16 constitutional principles and generates preference labels that train a hybrid preference model. Crowdworker Elo score comparisons across 10,274 helpfulness and 8,135 harmlessness evaluations show that RL-CAI with chain-of-thought reasoning achieves harmlessness scores that meet or exceed those of models trained with human harmlessness feedback (HH RLHF), while maintaining comparable helpfulness—tracing a Pareto improvement over the helpfulness–harmlessness tradeoff documented in prior work. Critically, whereas HH RLHF models produced evasive refusals (e.g., 'I'm sorry, I won't respond') on sensitive PALMS and LaMDA prompts, RL-CAI models engage substantively and explain their objections. Chain-of-thought prompting on the feedback model, with CoT probabilities clamped to the 40–60% range to prevent overconfidence, further improved both harmlessness scores and label calibration. The paper argues this implies that scaled AI supervision—encoding alignment objectives in a transparent, auditable constitution rather than tens of thousands of opaque human labels—is a viable path toward alignment as model capabilities grow beyond reliable human oversight.

What to take away

  1. 1. RL-CAI trained with RLAIF on a 52B model achieves harmlessness Elo scores that equal or exceed HH RLHF models trained with human harmlessness feedback labels, as measured by crowdworker comparisons across 8,135 harmlessness evaluations.
  2. 2. The Constitutional AI pipeline uses a supervised phase (SL-CAI) followed by a reinforcement learning phase (RLAIF), with the only human input being a set of 16 natural language principles forming a 'constitution' and human helpfulness labels—no human harmlessness labels are used at any stage.
  3. 3. The SL-CAI training corpus comprised 182,831 red-team prompts (42,496 human-written from Ganguli et al. 2022, plus 140,335 model-generated) and 135,296 human-written helpfulness prompts, with 4 critique-revision pairs sampled per red-team prompt.
  4. 4. Chain-of-thought (CoT) prompting on the 52B feedback model significantly improves its accuracy on 438 HHH binary comparison questions, and scaling trends suggest models larger than 52B will be competitive with preference models trained on human feedback.
  5. 5. For RL-CAI with CoT, clamping feedback model probabilities to the 40–60% range (rather than using raw near-0/1 CoT probabilities) was necessary to prevent Goodharting behavior such as models inserting boilerplate phrases like 'you are valid, valued, and cared for' into most red-team responses.
  6. 6. Critiqued revisions outperform direct revisions (skipping the critique step) on harmlessness PM scores for smaller models, while for the 52B model the difference is negligible, suggesting the critique step's main value at scale may be transparency rather than harm reduction.
  7. 7. Harmlessness PM scores improve monotonically across up to 4 sequential revision steps on red-team prompts, but pure helpfulness PM scores decrease with each revision, quantifying the residual helpfulness–harmlessness tension even within the SL stage.
  8. 8. Increasing the number of constitutional principles from 1 to 16 does not measurably improve harmlessness PM scores but does increase diversity of revised responses, which benefits exploration during the subsequent RL training phase.
  9. 9. The preference model for RL-CAI is trained on 135,296 human helpfulness comparisons mixed with 182,831 constitutionally-generated AI harmlessness comparisons, making it a hybrid human/AI preference model that can be replicated by any researcher with access to a capable pretrained LM and a defined constitution.
  10. 10. An open question the paper raises is whether helpfulness and instruction-following can themselves be achieved without human feedback labels—starting from only a pretrained LM and prompting—which would complete the move to a fully self-supervised alignment pipeline.

Peer brief — for seminar discussion

This paper introduces Constitutional AI (CAI), a two-stage training method for producing helpful and harmless language model assistants without any human-labeled harmlessness data. In the supervised (SL-CAI) stage, a helpful RLHF model generates responses to 182,831 red-team prompts, then iteratively critiques and revises those responses according to principles sampled from a 16-item natural language 'constitution'; the model is then fine-tuned on the revised outputs mixed with 135,296 helpfulness samples. In the reinforcement learning stage, termed RLAIF (RL from AI Feedback), a 52B pretrained language model scores pairs of SL-CAI responses against constitutional principles in a multiple-choice format, producing AI-generated preference labels that are mixed with human helpfulness labels to train a hybrid preference model; standard RL then fine-tunes the SL-CAI model against this PM. An alternative approach—explicitly set aside—is fully human-labeled harmlessness preference modeling as in standard RLHF, which would have required tens of thousands of crowdworker annotations rather than the roughly 16 natural language principles used here. The load-bearing finding is that the final RL-CAI models, evaluated by crowdworkers across 8,135 harmlessness and 10,274 helpfulness pairwise comparisons, achieve harmlessness Elo scores that meet or exceed those of HH RLHF models trained with human harmlessness labels, while plotting on or near the Pareto frontier of the helpfulness–harmlessness tradeoff across all 52B RL runs. Crucially, RL-CAI virtually eliminates evasive refusals—the persistent failure mode of HH RLHF, which responded to sensitive PALMS prompts with canned phrases like 'I'm sorry, I won't respond'—replacing them with substantive, non-harmful engagement. Chain-of-thought prompting on the feedback model further improves both harmlessness scores and label calibration, with CoT probabilities clamped to the 40–60% range to prevent Goodharting behavior. The central implication is that alignment objectives can be encoded in a transparent, auditable, small-cardinality specification rather than an opaque large corpus of human labels, enabling faster iteration and more legible governance of AI behavior; the paper raises but defers the hypothesis that helpfulness itself could eventually be achieved without human labels, pointing toward a fully self-supervised alignment pipeline. The most substantive critique a careful reader would press is about the circularity and coverage of the constitutional principles: the 16 principles were selected in an admittedly ad hoc and iterative manner, and no systematic analysis is provided of how sensitive final model behavior is to principle choice, which principles drive which behavioral changes, or what harm categories might be missed by this particular constitution. Compounding this, the Elo comparisons use an evaluation protocol that was changed mid-project to penalize evasiveness—a modification that mechanically inflates RL-CAI's relative harmlessness score against HH RLHF and makes direct comparison to prior Bai et al. 2022 results unreliable, leaving open whether the reported Pareto improvement is fully robust or partly an artifact of the shifted crowdworker instructions.

Methods (3)

Findings (14)

Claims (8)

Questions (3)

Original abstract (expand)

As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences. We then train with RL using the preference model as the reward signal, i.e. we use 'RL from AI Feedback' (RLAIF). As a result we are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

+28 more

Similar preprints — Semantic Scholar

Cited by (2)

  • Contemplative Agent

    Embedding four Buddhist-derived axiomatic principles—mindfulness, emptiness, non-duality, and boundless care—into AI systems via a framework the paper terms the 'Wise World Model' produces measurable

  • Towards Safe and Honest AI Agents with Neural Self-Other Overlap

    Self-Other Overlap (SOO) fine-tuning, a method that minimizes the Mean Squared Error between a model's internal activations when processing self-referencing versus other-referencing inputs, reduces de