Constitutional AI: Harmlessness from AI Feedback (Bai et al. 2022b)

Constitutional AI method whose constitutions, if changed, could trigger alignment faking

Neighborhood — ranked by edge-count

Papers (1)

paper

Alignment faking in large language models
cites

Concepts (1)

concept

Bai et al. 2022: Constitutional AI — harmlessness from AI feedback
same_as
Paper on AI-feedback fine-tuning as alternative to human-feedback RLHF; cited as ref 20

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Constitutional AI can train a harmless but non-evasive AI assistant without any human harmfulness labels.claim0.848
The paper's central claim, supported by findings that RL-CAI outperforms HH RLHF in harmlessness while being non-evasive.
Constitutional AI methods can be applied broadly to steer model behavior, e.g., writing style, tone, persona, not just harmlessness.claim0.831
Discussion section suggests generalizability beyond harmlessness.
Constitutional AIframework0.814
Alignment approach by Anthropic that explicitly trains self-observation; predicts highest baseline and lowest prompt lift.
The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'.quote0.798
Defines the core concept of the paper.
The constitutional approach makes it easier to control AI behavior precisely and with far fewer human labels.claim0.787
Explicit principles replace large datasets of preference labels, enabling faster iteration.
AI Control: Improving Safety Despite Intentional Subversion (Greenblatt et al. 2024)concept0.786
Related work studying capability of LLMs to subvert safety measures if severely misaligned
Constitutional AI produces a distinctive signature: high boundary_awareness, low aesthetic_response relative to peers.claim0.781
Interpretive finding from dimension profile analysis: training for honest limits comes at cost to aliveness.
Contemplative Constitutional AIframework0.777
Paper's proposed adaptation of Constitutional AI incorporating contemplative wisdom charter