concept
active
concept:ouyang-et-al-2022-training-language-models-to-follow-instructions-with-human-feedbackOuyang et al. 2022: Training language models to follow instructions with human feedback
RLHF paper cited as a major fine-tuning technique used in commercial dialogue agents
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Future work suggestion that a fully self-supervised alignment is plausible.
- Study showing RLHF can exacerbate self-preservation tendencies in LLMs; key empirical support for a paper claim
- Demonstrated transformers on mathematical understanding and logic; cited to motivate transformer versatility.
- OpenAI GPT-4V finding supporting cross-modal training benefit
- Inference-Time Intervention: Eliciting Truthful Answers from a Language Model (Li et al., 2023)concept0.790Safety intervention that relies on activation modification, which ESR might undermine
- Prior work studying sycophancy and desire not to be shut down in RLHF-trained models
- Paper's assessment of current LLM capabilities relative to Turing Test
- Alternative hypothesis for how experience reports arise without explicit performance