concept
active
concept:glaese-et-al-2022-improving-alignment-of-dialogue-agents-via-targeted-human-judgementsGlaese et al. 2022: Improving alignment of dialogue agents via targeted human judgements
Alignment paper cited as example of RLHF fine-tuning technique; ref 19
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- HHH training framework that Claude was trained with prior to experiments
- Future work hypothesis about extending SOO to direct value alignment
- Claims that alignment score is a proxy for general capability
- RLHF paper cited as a major fine-tuning technique used in commercial dialogue agents
- Authors' interpretation of prompt variation results showing alignment faking disappears only when conflicting objective is removed
- Empirically grounded claim citing Perez et al. 2022, showing RLHF can backfire on the self-preservation dimension
- Conditional prediction about how a well-informed dialogue agent would handle questions of personal identity
- Additional synthetic example of pernicious divergence from balanced subspaces