concept
active
concept:constitutional-ai-harmlessness-from-ai-feedback-bai-et-al-2022bConstitutional AI: Harmlessness from AI Feedback (Bai et al. 2022b)
Constitutional AI method whose constitutions, if changed, could trigger alignment faking
Neighborhood — ranked by edge-count
Papers (1)
paper
Concepts (1)
concept
- Paper on AI-feedback fine-tuning as alternative to human-feedback RLHF; cited as ref 20
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- The paper's central claim, supported by findings that RL-CAI outperforms HH RLHF in harmlessness while being non-evasive.
- Discussion section suggests generalizability beyond harmlessness.
- Alignment approach by Anthropic that explicitly trains self-observation; predicts highest baseline and lowest prompt lift.
- Defines the core concept of the paper.
- Explicit principles replace large datasets of preference labels, enabling faster iteration.
- Related work studying capability of LLMs to subvert safety measures if severely misaligned
- Interpretive finding from dimension profile analysis: training for honest limits comes at cost to aliveness.
- Paper's proposed adaptation of Constitutional AI incorporating contemplative wisdom charter