Reinforcement Learning Constitutional AI

The RL stage of CAI using AI feedback to train a preference model, then RL, resulting in a policy trained by RLAIF.

Neighborhood — ranked by edge-count

method

Multiple-choice evaluation method for PM training
implements
Using language model log probabilities of answer choices (A)/(B) to produce preference labels.

framework

Reinforcement Learning from AI Feedback
related_to
Variant of RLHF where human feedback is replaced with AI-generated feedback for harmlessness.
Supervised Learning Constitutional AI
related_to
The supervised learning stage of CAI where a model critiques and revises its responses, then finetunes on revisions.
Constitutional AI
implements
Alignment approach by Anthropic that explicitly trains self-observation; predicts highest baseline and lowest prompt lift.
Preference Model
implements
A model trained on comparison data to assign scores to responses, used as reward signal in RLHF/RLAIF.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Reinforcement Learningframework0.820
Alternative framework for agent behavior; based on reward maximization rather than free energy minimization.
Reinforcement Learning from Human Feedbackmethod0.811
Method for fine-tuning LMs based on human preferences; mentioned as combining RL and LMs.
Bai et al. 2022: Constitutional AI — harmlessness from AI feedbackconcept0.803
Paper on AI-feedback fine-tuning as alternative to human-feedback RLHF; cited as ref 20
Constitutional AI explicitly trains self-observation-like behavior, which is why CAI models score highest and show lowest contemplative lift.claim0.786
Interpretive claim connecting the battery's circularity to the empirical finding
H1: Alignment training is attention training for models — Constitutional AI trains self-observation explicitly.hypothesis0.786
Confirmatory hypothesis supported at p=0.006
Constitutional AI can train a harmless but non-evasive AI assistant without any human harmfulness labels.claim0.785
The paper's central claim, supported by findings that RL-CAI outperforms HH RLHF in harmlessness while being non-evasive.
Reinforcement learning of tissuesconcept0.782
The hypothesis that cellular collectives can be trained via rewards/punishments to produce specific morphological outcomes.
The constitutional approach makes it easier to control AI behavior precisely and with far fewer human labels.claim0.779
Explicit principles replace large datasets of preference labels, enabling faster iteration.