paper
referenced-only
2022
paper:arxiv-2212-08073Constitutional AI: Harmlessness from AI feedback
ByY. Bai·S. Kadavath·S. Kundu·A. Askell·J. Kernion·A. Jones+4 more
Frameworks (5)
- Constitutional AIAlignment approach by Anthropic that explicitly trains self-observation; predicts highest baseline and lowest prompt lift.
- Reinforcement Learning Constitutional AIThe RL stage of CAI using AI feedback to train a preference model, then RL, resulting in a policy trained by RLAIF.
- Reinforcement Learning from AI FeedbackVariant of RLHF where human feedback is replaced with AI-generated feedback for harmlessness.
- Scaling SupervisionTechniques that leverage AI to help humans more efficiently supervise AI.
- Supervised Learning Constitutional AIThe supervised learning stage of CAI where a model critiques and revises its responses, then finetunes on revisions.
Datasets (1)
- New challenging HHH binary comparison evaluations217 additional binary comparisons focusing on subtle harmlessness, including preference for non-evasive responses.
Related work— refs + corpus + external arXiv
Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.
- ≈ 76%
- Safety by Measurement: A Systematic Literature Review of AI Safety Evaluation MethodsCharbel-Rapha\"el Segerie Markov Grey2025≈ 75%
- ≈ 74%
- When Should an AI Act? A Human-Centered Model of Scene, Context, and Behavior for Agentic AI DesignDaehoo Yoon, Sung Gyu Koh, Young Hwan Kim, Yehan Ahn, Sung Park Soyoung Jung2026≈ 74%
- ≈ 74%
- Perceptions of Sentient AI and Other Digital Minds: Evidence from the AI, Morality, and Sentience (AIMS) SurveyJanet V.T. Pauketat, Ali Ladak, and Aikaterina Manoli Jacy Reese Anthis2025≈ 74%
- Agentic AI Security: Threats, Defenses, Evaluation, and Open ChallengesShrestha Datta, Shahriar Kabir Nahin, Prasant Mohapatra Anshuman Chhabra2026≈ 73%
- Towards AI Transparency and Accountability: A Global Framework for Exchanging Information on AI SystemsAdrian Byrne, Nicholas Perello, Cyrus Cousins, Taha Yasseri, Yair Zick, Przemyslaw Grabowicz Warren Buckley2026≈ 73%
- Neurodivergent Influenceability as a Contingent Solution to the AI Alignment ProblemFelipe S. Abrah\~ao, Olaf Witkowski, Hector Zenil Alberto Hern\'andez-Espinosa2025≈ 73%
- Taking AI Welfare Seriouslyin corpus2024≈ 73%
- Advancing Responsible Innovation in Agentic AI: A study of Ethical Frameworks for Household AutomationSatyam Kumar Navneet Joydeep Chandra2025≈ 73%
- A Representationalist, Functionalist and Naturalistic Conception of Intelligence as a Foundation for AGIRolf Pfister2025≈ 73%
- Normative active inference: A numerical proof of principle for a computational and economic legal analytic approach to AI governanceAxel Constant and Mahault Albarracin and Karl J. Friston2025≈ 73%
- Personalized Constitutionally-Aligned Agentic Superego: Secure AI Behavior Aligned to Diverse Human ValuesAhmed Amer, Evan Harris, Preeti Ravindra, Shujun Zhang Nell Watson2025≈ 72%
- Diagnosing AI Explanation Methods with Folk Concepts of BehaviorJasmijn Bastings, Sebastian Gehrmann, Yoav Goldberg, Katja Filippova Alon Jacovi2023≈ 72%
- AI Feedback Enhances Community-Based Content Moderation through Engagement with CounterargumentsSaeedeh Mohammadi and Taha Yasseri2026≈ 72%
- Epistemic reflections on AI answering our questions: overwatch, erudite, logician, interlocutorJohan F. Hoorn and Ella-Jenna Oosterglorenwoud2026≈ 72%
- ≈ 69%
- ≈ 69%
- Cognitive glues are shared models of relative scarcities: the economics of collective intelligencein corpus2026≈ 68%
- Contemplative Agentin corpus2025≈ 68%
- ≈ 68%
- ≈ 68%
- ≈ 68%
- Collective intelligence: A unifying concept for integrating biology across scales and substratesin corpus2024≈ 67%
- The biogenic approach to cognitionin corpus2005≈ 67%
- Verbalized Eval Awareness Inflates Measured Safetyin corpus2026≈ 67%
- When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Modelsin corpus2025≈ 67%
- ≈ 67%
Similar preprints — Semantic Scholar
Cited by (1)
- Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Training
Probe-based data attribution, introduced here as a method for surfacing and mitigating undesirable post-training behaviors, reduces harmful compliance in OLMo 2 7B by 63% through datapoint filtering a