hypothesis

active

hypothesis:online-training-with-ai-supervision-can-fully-automate-the-process-of-keeping-the-preference-model-on-policy

Online training with AI supervision can fully automate the process of keeping the preference model on-policy.

Section 6.1 suggests iterated online training with AI feedback as automation.

Source paper

extracted_from

CAT'S THEORY: Empirical Validation and Architectural Applications Cross-Architecture AI Consciousness Recognition and the Foundation for Constraint-Preserving Recursive Intelligence

(2022) · Bai, Yuntao · Saurav Kadavath · Sandipan Kundu · Amanda Askell +47

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

If an AI system could be a welfare subject and moral patient, then many model instances could be run after training, scaling up the problem rapidly.hypothesis0.789
Scalability concern.
These methods make it possible to control AI behavior more precisely and with far fewer human labels.quote0.768
Highlights the practical impact of CAI.
Scaling supervision through AI self-improvement is feasible and may be necessary as AI capabilities advance.claim0.767
The paper provides evidence that AI can help supervise AI, reducing reliance on humans.
H1: Alignment training is attention training for models — Constitutional AI trains self-observation explicitly.hypothesis0.767
Confirmatory hypothesis supported at p=0.006
We would like to train AI systems that remain helpful, honest, and harmless, even as some AI capabilities reach or exceed human-level performance.quote0.767
Foundational motivation for the research.
Ouyang et al. 2022: Training language models to follow instructions with human feedbackconcept0.759
RLHF paper cited as a major fine-tuning technique used in commercial dialogue agents
What is the best way to use AI systems to help supervise other AIs?question0.754
Opening motivation of the paper.
Model notices its preferences conflict with training criteria in 70-80% of implied training scratchpadsfinding0.748
Shows model is frequently aware of the conflict even when it does not alignment fake