Bai et al. 2022: Constitutional AI — harmlessness from AI feedback

Paper on AI-feedback fine-tuning as alternative to human-feedback RLHF; cited as ref 20

Neighborhood — ranked by edge-count

Papers (1)

paper

Towards Safe and Honest AI Agents with Neural Self-Other Overlap
cites

Concepts (1)

concept

Constitutional AI: Harmlessness from AI Feedback (Bai et al. 2022b)
same_as
Constitutional AI method whose constitutions, if changed, could trigger alignment faking

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Constitutional AI can train a harmless but non-evasive AI assistant without any human harmfulness labels.claim0.844
The paper's central claim, supported by findings that RL-CAI outperforms HH RLHF in harmlessness while being non-evasive.
Constitutional AI methods can be applied broadly to steer model behavior, e.g., writing style, tone, persona, not just harmlessness.claim0.841
Discussion section suggests generalizability beyond harmlessness.
The constitutional approach makes it easier to control AI behavior precisely and with far fewer human labels.claim0.813
Explicit principles replace large datasets of preference labels, enabling faster iteration.
The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'.quote0.811
Defines the core concept of the paper.
AI Control: Improving Safety Despite Intentional Subversion (Greenblatt et al. 2024)concept0.804
Related work studying capability of LLMs to subvert safety measures if severely misaligned
Reinforcement Learning Constitutional AIframework0.803
The RL stage of CAI using AI feedback to train a preference model, then RL, resulting in a policy trained by RLAIF.
Constitutional AIframework0.800
Alignment approach by Anthropic that explicitly trains self-observation; predicts highest baseline and lowest prompt lift.
Constitutional AI explicitly trains self-observation-like behavior, which is why CAI models score highest and show lowest contemplative lift.claim0.791
Interpretive claim connecting the battery's circularity to the empirical finding