concept
active
concept:bai-et-al-2022-constitutional-ai-harmlessness-from-ai-feedbackBai et al. 2022: Constitutional AI — harmlessness from AI feedback
Paper on AI-feedback fine-tuning as alternative to human-feedback RLHF; cited as ref 20
Neighborhood — ranked by edge-count
Papers (1)
paper
Concepts (1)
concept
- Constitutional AI method whose constitutions, if changed, could trigger alignment faking
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- The paper's central claim, supported by findings that RL-CAI outperforms HH RLHF in harmlessness while being non-evasive.
- Discussion section suggests generalizability beyond harmlessness.
- Explicit principles replace large datasets of preference labels, enabling faster iteration.
- Defines the core concept of the paper.
- Related work studying capability of LLMs to subvert safety measures if severely misaligned
- The RL stage of CAI using AI feedback to train a preference model, then RL, resulting in a policy trained by RLAIF.
- Alignment approach by Anthropic that explicitly trains self-observation; predicts highest baseline and lowest prompt lift.
- Interpretive claim connecting the battery's circularity to the empirical finding