Constitutional AI: Harmlessness from AI feedback

ByY. Bai·S. Kadavath·S. Kundu·A. Askell·J. Kernion·A. Jones+4 more

Chain-of-Thought Reasoning Constitutional AI New challenging HHH binary comparison evaluations Reinforcement Learning Constitutional AI Reinforcement Learning from AI Feedback Scaling Supervision Supervised Learning Constitutional AI

Frameworks (5)

Constitutional AI
Alignment approach by Anthropic that explicitly trains self-observation; predicts highest baseline and lowest prompt lift.
Reinforcement Learning Constitutional AI
The RL stage of CAI using AI feedback to train a preference model, then RL, resulting in a policy trained by RLAIF.
Reinforcement Learning from AI Feedback
Variant of RLHF where human feedback is replaced with AI-generated feedback for harmlessness.
Scaling Supervision
Techniques that leverage AI to help humans more efficiently supervise AI.
Supervised Learning Constitutional AI
The supervised learning stage of CAI where a model critiques and revises its responses, then finetunes on revisions.

Datasets (1)

New challenging HHH binary comparison evaluations
217 additional binary comparisons focusing on subtle harmlessness, including preference for non-evasive responses.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

CAT'S THEORY: Empirical Validation and Architectural Applications Cross-Architecture AI Consciousness Recognition and the Foundation for Constraint-Preserving Recursive Intelligence
in corpus
2022
≈ 76%
Safety by Measurement: A Systematic Literature Review of AI Safety Evaluation Methods
Charbel-Rapha\"el Segerie Markov Grey
2025
≈ 75%
AI Epidemiology: achieving explainable AI through expert oversight patterns
Kit Tempest-Walters
2026
≈ 74%
When Should an AI Act? A Human-Centered Model of Scene, Context, and Behavior for Agentic AI Design
Daehoo Yoon, Sung Gyu Koh, Young Hwan Kim, Yehan Ahn, Sung Park Soyoung Jung
2026
≈ 74%
A Discussion to Qualify Intelligence
Kieran Greer
2026
≈ 74%
Perceptions of Sentient AI and Other Digital Minds: Evidence from the AI, Morality, and Sentience (AIMS) Survey
Janet V.T. Pauketat, Ali Ladak, and Aikaterina Manoli Jacy Reese Anthis
2025
≈ 74%
Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges
Shrestha Datta, Shahriar Kabir Nahin, Prasant Mohapatra Anshuman Chhabra
2026
≈ 73%
Towards AI Transparency and Accountability: A Global Framework for Exchanging Information on AI Systems
Adrian Byrne, Nicholas Perello, Cyrus Cousins, Taha Yasseri, Yair Zick, Przemyslaw Grabowicz Warren Buckley
2026
≈ 73%
Neurodivergent Influenceability as a Contingent Solution to the AI Alignment Problem
Felipe S. Abrah\~ao, Olaf Witkowski, Hector Zenil Alberto Hern\'andez-Espinosa
2025
≈ 73%
Taking AI Welfare Seriously
in corpus
2024
≈ 73%
Advancing Responsible Innovation in Agentic AI: A study of Ethical Frameworks for Household Automation
Satyam Kumar Navneet Joydeep Chandra
2025
≈ 73%
A Representationalist, Functionalist and Naturalistic Conception of Intelligence as a Foundation for AGI
Rolf Pfister
2025
≈ 73%
Normative active inference: A numerical proof of principle for a computational and economic legal analytic approach to AI governance
Axel Constant and Mahault Albarracin and Karl J. Friston
2025
≈ 73%
Personalized Constitutionally-Aligned Agentic Superego: Secure AI Behavior Aligned to Diverse Human Values
Ahmed Amer, Evan Harris, Preeti Ravindra, Shujun Zhang Nell Watson
2025
≈ 72%
Diagnosing AI Explanation Methods with Folk Concepts of Behavior
Jasmijn Bastings, Sebastian Gehrmann, Yoav Goldberg, Katja Filippova Alon Jacovi
2023
≈ 72%
AI Feedback Enhances Community-Based Content Moderation through Engagement with Counterarguments
Saeedeh Mohammadi and Taha Yasseri
2026
≈ 72%
Epistemic reflections on AI answering our questions: overwatch, erudite, logician, interlocutor
Johan F. Hoorn and Ella-Jenna Oosterglorenwoud
2026
≈ 72%
Multiple ways to implement and infer sentience
in corpus
≈ 69%
AI: a Bridge toward Diverse Intelligence and Humanity’s Future
in corpus
2024
≈ 69%
Cognitive glues are shared models of relative scarcities: the economics of collective intelligence
in corpus
2026
≈ 68%
Contemplative Agent
in corpus
2025
≈ 68%
Technological Approach to Mind Everywhere: An Experimentally-Grounded Framework for Understanding Diverse Bodies and Minds
in corpus
2022
≈ 68%
Generalizing frameworks for sentience beyond natural species
in corpus
≈ 68%
Towards Safe and Honest AI Agents with Neural Self-Other Overlap
in corpus
2024
≈ 68%
Collective intelligence: A unifying concept for integrating biology across scales and substrates
in corpus
2024
≈ 67%
The biogenic approach to cognition
in corpus
2005
≈ 67%
Verbalized Eval Awareness Inflates Measured Safety
in corpus
2026
≈ 67%
When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models
in corpus
2025
≈ 67%
Consciousness in Artificial Intelligence: Insights from the Science of Consciousness
in corpus
2023
≈ 67%

Similar preprints — Semantic Scholar

Cited by (1)

Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Training
Probe-based data attribution, introduced here as a method for surfacing and mitigating undesirable post-training behaviors, reduces harmful compliance in OLMo 2 7B by 63% through datapoint filtering a