concept

active

concept:ouyang-et-al-2022-training-language-models-to-follow-instructions-with-human-feedback

Ouyang et al. 2022: Training language models to follow instructions with human feedback

RLHF paper cited as a major fine-tuning technique used in commercial dialogue agents

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

We expect it is possible to achieve helpfulness and instruction-following without human feedback, starting from only a pretrained LM and extensive prompting.hypothesis0.814
Future work suggestion that a fully self-supervised alignment is plausible.
Perez et al. 2022: Discovering language model behaviors with model-written evaluationsconcept0.812
Study showing RLHF can exacerbate self-preservation tendencies in LLMs; key empirical support for a paper claim
Language models are few-shot learners (Brown et al., 2020)concept0.796
Demonstrated transformers on mathematical understanding and logic; cited to motivate transformer versatility.
Jointly training a language model with a vision model improves performance on language tasks compared to training the language model alonefinding0.792
OpenAI GPT-4V finding supporting cross-modal training benefit
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model (Li et al., 2023)concept0.790
Safety intervention that relies on activation modification, which ESR might undermine
Discovering Language Model Behaviors with Model-Written Evaluations (Perez et al. 2022)concept0.786
Prior work studying sycophancy and desire not to be shut down in RLHF-trained models
Today's Large Language Models have become so good at playing Turing's game that it often takes experts to demonstrate the present limits of their ability to simulate human-like intelligence.claim0.785
Paper's assessment of current LLM capabilities relative to Turing Test
Models might produce first-person experiential language by drawing on human-authored self-descriptions in pretraining data without internally encoding these acts as roleplayhypothesis0.785
Alternative hypothesis for how experience reports arise without explicit performance