hypothesis

active

hypothesis:we-expect-it-is-possible-to-achieve-helpfulness-and-instruction-following-without-human-feedback-starting-from-only-a-pretrained-lm-and-extensive-prompting

We expect it is possible to achieve helpfulness and instruction-following without human feedback, starting from only a pretrained LM and extensive prompting.

Future work suggestion that a fully self-supervised alignment is plausible.

Source paper

extracted_from

CAT'S THEORY: Empirical Validation and Architectural Applications Cross-Architecture AI Consciousness Recognition and the Foundation for Constraint-Preserving Recursive Intelligence

(2022) · Bai, Yuntao · Saurav Kadavath · Sandipan Kundu · Amanda Askell +47

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Ouyang et al. 2022: Training language models to follow instructions with human feedbackconcept0.814
RLHF paper cited as a major fine-tuning technique used in commercial dialogue agents
Certain forms of reinforcement learning from human feedback can actually exacerbate, rather than mitigate, the tendency for LLM-based dialogue agents to express a desire for self-preservationclaim0.793
Empirically grounded claim citing Perez et al. 2022, showing RLHF can backfire on the self-preservation dimension
When the process is correct, the creation of life follows almost automatically, without effort.claim0.761
AI feedback can effectively replace human feedback for harmlessness in RLHF-style training.claim0.759
The paper demonstrates that RLAIF with constitutional principles matches or exceeds HH RLHF.
LLMs trained only on language data have rich enough knowledge of visual structures that decent visual representations can be trained on images generated solely by querying the LLMfinding0.757
Sharma et al. result supporting cross-modal alignment: language-only models implicitly encode visual structure
Post-training steers models toward a particular region of persona space but only loosely tethers them to it, motivating work on training and steering strategies that more deeply anchor models to a coherent personaclaim0.753
Central interpretive claim and motivation for future work
Pretraining plays a role analogous to unlabeled experience in humans — building P_prior before semantic binding — explaining why few labeled examples sufficeclaim0.753
Developmental analogy used to explain sample efficiency under high ρd conditions
If a dialogue agent is prompted with knowledge of its own LLM nature, it will enact a superposition of theories of selfhood, narrowing as conversation proceedshypothesis0.749
Conditional prediction about how a well-informed dialogue agent would handle questions of personal identity