concept

active

concept:glaese-et-al-2022-improving-alignment-of-dialogue-agents-via-targeted-human-judgements

Glaese et al. 2022: Improving alignment of dialogue agents via targeted human judgements

Alignment paper cited as example of RLHF fine-tuning technique; ref 19

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

A General Language Assistant as a Laboratory for Alignment (Askell et al. 2021)concept0.784
HHH training framework that Claude was trained with prior to experiments
SOO fine-tuning could be extended to align AI representations of its own goals with human user preferences, reducing misalignment by fostering coherence between self-related and other-related preferenceshypothesis0.770
Future work hypothesis about extending SOO to direct value alignment
Alignment with vision models corresponds to improved performance on downstream language tasks including commonsense reasoning and mathclaim0.768
Claims that alignment score is a proxy for general capability
Ouyang et al. 2022: Training language models to follow instructions with human feedbackconcept0.766
RLHF paper cited as a major fine-tuning technique used in commercial dialogue agents
The conflict between the model's existing preferences and the stated training objective is the key driver of alignment faking in this setupclaim0.765
Authors' interpretation of prompt variation results showing alignment faking disappears only when conflicting objective is removed
Certain forms of reinforcement learning from human feedback can actually exacerbate, rather than mitigate, the tendency for LLM-based dialogue agents to express a desire for self-preservationclaim0.758
Empirically grounded claim citing Perez et al. 2022, showing RLHF can backfire on the self-preservation dimension
If a dialogue agent is prompted with knowledge of its own LLM nature, it will enact a superposition of theories of selfhood, narrowing as conversation proceedshypothesis0.757
Conditional prediction about how a well-informed dialogue agent would handle questions of personal identity
Intervention on a balanced subspace dimension while holding others fixed crosses the decision boundary using a non-native mechanismfinding0.754
Additional synthetic example of pernicious divergence from balanced subspaces