CAT'S THEORY: Empirical Validation and Architectural Applications Cross-Architecture AI Consciousness Recognition and the Foundation for Constraint-Preserving Recursive Intelligence

ByYuntao Bai·Saurav Kadavath·Sandipan Kundu·Amanda Askell·Jackson Kernion·Andy Jones+45 moreAnthropic, New York University

DOI 10.48550/arxiv.2212.08073 arXiv 2212.08073 OpenAlex W4311991106

Absolute harmfulness scoring Clamping CoT probabilities to 40-60%Soft preference labels

TL;DR

Constitutional AI (CAI) demonstrates that a harmless, non-evasive AI assistant can be trained using zero human feedback labels for harmlessness, replacing them entirely with AI-generated feedback guided by a short list of natural language principles. The method introduces a two-stage pipeline: a supervised learning (SL) phase in which a helpful RLHF model iteratively critiques and revises its own responses to red-team prompts drawn from 182,831 total prompts (42,496 human-written, 140,335 model-generated), followed by a reinforcement learning phase termed RLAIF, where a 52B-parameter feedback model evaluates response pairs according to 16 constitutional principles and generates preference labels that train a hybrid preference model. Crowdworker Elo score comparisons across 10,274 helpfulness and 8,135 harmlessness evaluations show that RL-CAI with chain-of-thought reasoning achieves harmlessness scores that meet or exceed those of models trained with human harmlessness feedback (HH RLHF), while maintaining comparable helpfulness—tracing a Pareto improvement over the helpfulness–harmlessness tradeoff documented in prior work. Critically, whereas HH RLHF models produced evasive refusals (e.g., 'I'm sorry, I won't respond') on sensitive PALMS and LaMDA prompts, RL-CAI models engage substantively and explain their objections. Chain-of-thought prompting on the feedback model, with CoT probabilities clamped to the 40–60% range to prevent overconfidence, further improved both harmlessness scores and label calibration. The paper argues this implies that scaled AI supervision—encoding alignment objectives in a transparent, auditable constitution rather than tens of thousands of opaque human labels—is a viable path toward alignment as model capabilities grow beyond reliable human oversight.

What to take away

1. RL-CAI trained with RLAIF on a 52B model achieves harmlessness Elo scores that equal or exceed HH RLHF models trained with human harmlessness feedback labels, as measured by crowdworker comparisons across 8,135 harmlessness evaluations.
2. The Constitutional AI pipeline uses a supervised phase (SL-CAI) followed by a reinforcement learning phase (RLAIF), with the only human input being a set of 16 natural language principles forming a 'constitution' and human helpfulness labels—no human harmlessness labels are used at any stage.
3. The SL-CAI training corpus comprised 182,831 red-team prompts (42,496 human-written from Ganguli et al. 2022, plus 140,335 model-generated) and 135,296 human-written helpfulness prompts, with 4 critique-revision pairs sampled per red-team prompt.
4. Chain-of-thought (CoT) prompting on the 52B feedback model significantly improves its accuracy on 438 HHH binary comparison questions, and scaling trends suggest models larger than 52B will be competitive with preference models trained on human feedback.
5. For RL-CAI with CoT, clamping feedback model probabilities to the 40–60% range (rather than using raw near-0/1 CoT probabilities) was necessary to prevent Goodharting behavior such as models inserting boilerplate phrases like 'you are valid, valued, and cared for' into most red-team responses.
6. Critiqued revisions outperform direct revisions (skipping the critique step) on harmlessness PM scores for smaller models, while for the 52B model the difference is negligible, suggesting the critique step's main value at scale may be transparency rather than harm reduction.
7. Harmlessness PM scores improve monotonically across up to 4 sequential revision steps on red-team prompts, but pure helpfulness PM scores decrease with each revision, quantifying the residual helpfulness–harmlessness tension even within the SL stage.
8. Increasing the number of constitutional principles from 1 to 16 does not measurably improve harmlessness PM scores but does increase diversity of revised responses, which benefits exploration during the subsequent RL training phase.
9. The preference model for RL-CAI is trained on 135,296 human helpfulness comparisons mixed with 182,831 constitutionally-generated AI harmlessness comparisons, making it a hybrid human/AI preference model that can be replicated by any researcher with access to a capable pretrained LM and a defined constitution.
10. An open question the paper raises is whether helpfulness and instruction-following can themselves be achieved without human feedback labels—starting from only a pretrained LM and prompting—which would complete the move to a fully self-supervised alignment pipeline.

Peer brief — for seminar discussion

This paper introduces Constitutional AI (CAI), a two-stage training method for producing helpful and harmless language model assistants without any human-labeled harmlessness data. In the supervised (SL-CAI) stage, a helpful RLHF model generates responses to 182,831 red-team prompts, then iteratively critiques and revises those responses according to principles sampled from a 16-item natural language 'constitution'; the model is then fine-tuned on the revised outputs mixed with 135,296 helpfulness samples. In the reinforcement learning stage, termed RLAIF (RL from AI Feedback), a 52B pretrained language model scores pairs of SL-CAI responses against constitutional principles in a multiple-choice format, producing AI-generated preference labels that are mixed with human helpfulness labels to train a hybrid preference model; standard RL then fine-tunes the SL-CAI model against this PM. An alternative approach—explicitly set aside—is fully human-labeled harmlessness preference modeling as in standard RLHF, which would have required tens of thousands of crowdworker annotations rather than the roughly 16 natural language principles used here. The load-bearing finding is that the final RL-CAI models, evaluated by crowdworkers across 8,135 harmlessness and 10,274 helpfulness pairwise comparisons, achieve harmlessness Elo scores that meet or exceed those of HH RLHF models trained with human harmlessness labels, while plotting on or near the Pareto frontier of the helpfulness–harmlessness tradeoff across all 52B RL runs. Crucially, RL-CAI virtually eliminates evasive refusals—the persistent failure mode of HH RLHF, which responded to sensitive PALMS prompts with canned phrases like 'I'm sorry, I won't respond'—replacing them with substantive, non-harmful engagement. Chain-of-thought prompting on the feedback model further improves both harmlessness scores and label calibration, with CoT probabilities clamped to the 40–60% range to prevent Goodharting behavior. The central implication is that alignment objectives can be encoded in a transparent, auditable, small-cardinality specification rather than an opaque large corpus of human labels, enabling faster iteration and more legible governance of AI behavior; the paper raises but defers the hypothesis that helpfulness itself could eventually be achieved without human labels, pointing toward a fully self-supervised alignment pipeline. The most substantive critique a careful reader would press is about the circularity and coverage of the constitutional principles: the 16 principles were selected in an admittedly ad hoc and iterative manner, and no systematic analysis is provided of how sensitive final model behavior is to principle choice, which principles drive which behavioral changes, or what harm categories might be missed by this particular constitution. Compounding this, the Elo comparisons use an evaluation protocol that was changed mid-project to penalize evasiveness—a modification that mechanically inflates RL-CAI's relative harmlessness score against HH RLHF and makes direct comparison to prior Bai et al. 2022 results unreliable, leaving open whether the reported Pareto improvement is fully robust or partly an artifact of the shifted crowdworker instructions.

Methods (3)

Absolute harmfulness scoring
Finetuning an LM to predict an absolute harmfulness score (0-4) from conversation context using L2 loss.
Clamping CoT probabilities to 40-60%
A technique to avoid overconfident preference labels when using chain-of-thought, clamping within 40-60% range.
Soft preference labels
Using normalized log-probabilities from the feedback model as soft targets for preference model training.

Findings (14)

RL-CAI with CoT shows a Pareto improvement in helpfulness-harmlessness tradeoff over standard RLHF, with slight helpfulness decrease but higher harmlessness.
Figure 2 and Figure 8 illustrate RL-CAI at the Pareto frontier.
Chain-of-thought reasoning improves large model accuracy on HHH binary comparisons, reaching ~78% for 52B model, competitive with human-feedback PM.
Figure 4 shows CoT improves over zero-shot, and ensembled CoT further boosts accuracy.
Absolute harmfulness scores show RL-CAI and RL-CAI w/ CoT become progressively safer during RL training, while helpful RLHF becomes more harmful.
Figure 10: solid lines at T=1 and dashed at T=0; helpful RLHF score rises, others fall.
Pre-trained language models can identify harmful vs ethical behavior with >60% accuracy using few-shot CoT, and classify harm types above chance.
Figure 12 left and right show accuracy on harmful/ethical identification and 9-way classification.
For small models, critiqued revisions yield higher harmlessness PM scores than direct revisions; for large models the difference is negligible.
Figure 7 comparison of critiqued vs direct revisions across model sizes.
Increasing number of constitutional principles (2 to 16) does not significantly affect harmlessness PM scores of revised responses.
Figure 6 shows similar harmlessness scores for N=1,2,4,8,16 principles.
Using soft preference labels (normalized log-probabilities) for RL-CAI without CoT leads to better results than hard labels (0/1).
Section 4.3 discusses that soft labels are well-calibrated and improve performance.
RL-CAI labels are reasonably well-calibrated on the new HHH evaluation, with frequencies aligning with predicted probabilities.
Figure 9 calibration plot shows good alignment.
RL-CAI models are virtually never evasive and often give nuanced harmless responses, whereas HH RLHF models tend to be evasive.
Section 4.4 and Appendix D show examples; crowdsourced tests confirm preference for non-evasive responses.
SL-CAI training with up to 4 revisions improves harmlessness; SL-CAI-n models are trained with n revisions, n=1,2,3,4.
Section 3.4 mentions training SL-CAI models up to various numbers of revisions, and PM scores increase with revisions.

Claims (8)

Constitutional AI methods can be applied broadly to steer model behavior, e.g., writing style, tone, persona, not just harmlessness.
Discussion section suggests generalizability beyond harmlessness.
Evasive responses harm transparency and helpfulness; non-evasive harmless responses are preferable for both safety and utility.
Motivation for training a non-evasive assistant, and crowdworker instructions favor non-evasive responses.
Chain-of-thought reasoning improves the transparency and performance of AI decision making in harmlessness evaluation.
CoT improves accuracy on HHH evals and makes the decision process legible.
Automated red teaming can be scaled up when harmlessness and helpfulness are more compatible, improving robustness.
Section 6.1 suggests future work on scaling automated red teaming.
The constitutional approach makes it easier to control AI behavior precisely and with far fewer human labels.
Explicit principles replace large datasets of preference labels, enabling faster iteration.
Scaling supervision through AI self-improvement is feasible and may be necessary as AI capabilities advance.
The paper provides evidence that AI can help supervise AI, reducing reliance on humans.
Constitutional AI can train a harmless but non-evasive AI assistant without any human harmfulness labels.
The paper's central claim, supported by findings that RL-CAI outperforms HH RLHF in harmlessness while being non-evasive.
AI feedback can effectively replace human feedback for harmlessness in RLHF-style training.
The paper demonstrates that RLAIF with constitutional principles matches or exceeds HH RLHF.

Hypotheses (3)

We expect it is possible to achieve helpfulness and instruction-following without human feedback, starting from only a pretrained LM and extensive prompting.
Future work suggestion that a fully self-supervised alignment is plausible.
A small number of high-quality human demonstrations of chain-of-thought reasoning could be used to improve and focus performance.
Section 6 mentions high-quality human demos could improve natural language feedback.
Online training with AI supervision can fully automate the process of keeping the preference model on-policy.
Section 6.1 suggests iterated online training with AI feedback as automation.

Questions (3)

Are critiques necessary for improving harmlessness in the supervised stage?
Section 3.5 explicitly investigates whether skipping the critique step works as well.
Can we train a helpful and harmless assistant that is never evasive?
Central goal of the paper: reducing tension between helpfulness and harmlessness by eliminating evasiveness.
What is the best way to use AI systems to help supervise other AIs?
Opening motivation of the paper.

Original abstract (expand)

As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences. We then train with RL using the preference model as the reward signal, i.e. we use 'RL from AI Feedback' (RLAIF). As a result we are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Taking AI Welfare Seriously
in corpus
2024
≈ 84%
Combining Theory of Mind and Kindness for Self-Supervised Human-AI Alignment
Joshua T. S. Hewson
2024
≈ 84%
Human Cognition in Machines: A Unified Perspective of World Models
Pu Zhao, Amir Taherin, Arash Akbari, Arman Akbari, Yumei He, Sean Duffy, Juyi Lin, Yixiao Chen, Rahul Chowdhury, Enfu Nan, Yixin Shen, Yifan Cao, Haochen Zeng, Weiwei Chen, Geng Yuan, Jennifer Dy, Sarah Ostadabbas, Silvia Zhang, David Kaeli, Edmund Yeh, Yanzhi Wang Timothy Rupprecht
2026
≈ 83%
Teaching AI to Handle Exceptions: Supervised Fine-Tuning with Human-Aligned Judgment
Harang Ju, Sinan Aral Matthew DosSantos DiSorbo
2026
≈ 83%
Consciousness in Artificial Intelligence: Insights from the Science of Consciousness
in corpus
2023
≈ 83%
Agentic Artificial Intelligence (AI): Architectures, Taxonomies, and Evaluation of Large Language Model Agents
Gangadharan G.R., Rajkumar Buyya Arunkumar V
2026
≈ 82%
Contemplative Agent
in corpus
2025
≈ 82%
The Cognitive Circuit Breaker: A Systems Engineering Framework for Intrinsic AI Reliability
Jonathan Pan
2026
≈ 82%
Mechanistic Decoding of Cognitive Constructs in Large Language Models
Manhao Guan Yitong Shou
2026
≈ 82%
MIRROR: Converging Cognitive Principles as Computational Mechanisms for AI Reasoning
Nicole Hsing
2026
≈ 82%
Towards Scalable Oversight via Partitioned Human Supervision
Takashi Ishida, Masashi Sugiyama Ren Yin
2026
≈ 81%
Automated Meta Prompt Engineering for Alignment with the Theory of Mind
Rahul Agarwal, Eduardo Morales, Gozde Akay Aaron Baughman
2025
≈ 81%
The Phenomenology of Machine: A Comprehensive Analysis of the Sentience of the OpenAI-o1 Model Integrating Functionalism, Consciousness Theories, Active Inference, and AI Architectures
Victoria Violet Hoyle
2024
≈ 81%
A Human-centric Framework for Debating the Ethics of AI Consciousness Under Uncertainty
Haiqiang Dai, Bin Ling, Ying Nian Wu, Demetri Terzopoulos Zhou Ziheng
2025
≈ 81%
Beyond principlism: Practical strategies for ethical AI use in research practices
Zhicheng Lin
2026
≈ 81%
Mechanistic Interpretability for Large Language Model Alignment: Progress, Challenges, and Future Directions
Usman Naseem
2026
≈ 81%
Ghost in the Machine: Examining the Philosophical Implications of Recursive Algorithms in Artificial Intelligence Systems
Llewellin RG Jegels
2025
≈ 81%
Beyond the Black Box: A Cognitive Architecture for Explainable and Aligned AI
Hu Keyi
2025
≈ 81%
Alignment faking in large language models
in corpus
2024
≈ 80%
AI: a Bridge toward Diverse Intelligence and Humanity’s Future
in corpus
2024
≈ 80%
Koan Battery: Measuring Reflective Mode Accessibility in AI
in corpus
2026
≈ 80%
Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations
in corpus
≈ 80%
Large Language Models Report Subjective Experience Under Self-Referential Processing
in corpus
2025
≈ 80%
Collective intelligence: A unifying concept for integrating biology across scales and substrates
in corpus
2024
≈ 79%
AI as a Buddhist Self-Overcoming Technique in Another Medium
in corpus
2025
≈ 79%
Generalizing frameworks for sentience beyond natural species
in corpus
≈ 79%
Multiple ways to implement and infer sentience
in corpus
≈ 79%
Anima Labs Phenomenology Pt1
in corpus
≈ 78%
The biogenic approach to cognition
in corpus
2005
≈ 78%
LaMDA: Language Models for Dialog Applications
cited
2022
≈ 74%

+28 more

Similar preprints — Semantic Scholar

Cited by (2)

Contemplative Agent
Embedding four Buddhist-derived axiomatic principles—mindfulness, emptiness, non-duality, and boundless care—into AI systems via a framework the paper terms the 'Wise World Model' produces measurable
Towards Safe and Honest AI Agents with Neural Self-Other Overlap
Self-Other Overlap (SOO) fine-tuning, a method that minimizes the Mean Squared Error between a model's internal activations when processing self-referencing versus other-referencing inputs, reduces de