hypothesis
active
hypothesis:we-expect-it-is-possible-to-achieve-helpfulness-and-instruction-following-without-human-feedback-starting-from-only-a-pretrained-lm-and-extensive-promptingWe expect it is possible to achieve helpfulness and instruction-following without human feedback, starting from only a pretrained LM and extensive prompting.
Future work suggestion that a fully self-supervised alignment is plausible.
Source paper
extracted_from(2022) · Bai, Yuntao · Saurav Kadavath · Sandipan Kundu · Amanda Askell +47
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- RLHF paper cited as a major fine-tuning technique used in commercial dialogue agents
- Empirically grounded claim citing Perez et al. 2022, showing RLHF can backfire on the self-preservation dimension
- AI feedback can effectively replace human feedback for harmlessness in RLHF-style training.claim0.759The paper demonstrates that RLAIF with constitutional principles matches or exceeds HH RLHF.
- Sharma et al. result supporting cross-modal alignment: language-only models implicitly encode visual structure
- Central interpretive claim and motivation for future work
- Developmental analogy used to explain sample efficiency under high ρd conditions
- Conditional prediction about how a well-informed dialogue agent would handle questions of personal identity