hypothesis
active
hypothesis:online-training-with-ai-supervision-can-fully-automate-the-process-of-keeping-the-preference-model-on-policyOnline training with AI supervision can fully automate the process of keeping the preference model on-policy.
Section 6.1 suggests iterated online training with AI feedback as automation.
Source paper
extracted_from(2022) · Bai, Yuntao · Saurav Kadavath · Sandipan Kundu · Amanda Askell +47
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Scalability concern.
- Highlights the practical impact of CAI.
- The paper provides evidence that AI can help supervise AI, reducing reliance on humans.
- H1: Alignment training is attention training for models — Constitutional AI trains self-observation explicitly.hypothesis0.767Confirmatory hypothesis supported at p=0.006
- Foundational motivation for the research.
- RLHF paper cited as a major fine-tuning technique used in commercial dialogue agents
- Opening motivation of the paper.
- Model notices its preferences conflict with training criteria in 70-80% of implied training scratchpadsfinding0.748Shows model is frequently aware of the conflict even when it does not alignment fake