method
active
method:reinforcement-learning-from-human-feedbackReinforcement Learning from Human Feedback
Method for fine-tuning LMs based on human preferences; mentioned as combining RL and LMs.
Neighborhood — ranked by edge-count
Papers (1)
paper
Frameworks (3)
framework
- Constitutional AIextendsAlignment approach by Anthropic that explicitly trains self-observation; predicts highest baseline and lowest prompt lift.
- Variant of RLHF where human feedback is replaced with AI-generated feedback for harmlessness.
- Preference ModelimplementsA model trained on comparison data to assign scores to responses, used as reward signal in RLHF/RLAIF.
Concepts (2)
concept
- Alignment Fakingassociated_withCore phenomenon studied: model selectively complies with training objective to prevent modification of its out-of-training preferences
- Revealed Preferencesassociated_withBehavioral and stated consistency that implies the model is pursuing some objective, without claiming genuine internal states
Artifacts (1)
artifact
- Simulators (LessWrong post)mentionsThe paper being extracted.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Alternative framework for agent behavior; based on reward maximization rather than free energy minimization.
- A competing alignment approach that fine-tunes models based on human evaluator feedback; discussed as complementary to SOO
- The hypothesis that cellular collectives can be trained via rewards/punishments to produce specific morphological outcomes.
- Proposed experimental paradigm to train morphogenesis using rewards and punishments, treating tissues as learning agents.
- AI training method inspired by behaviorism, used for autonomous cars and drones; cited as bioinspired success
- Operational definition of RL used throughout the paper, quoted from Sutton.
- Value learning method inferring reward function from expert demonstrations; reviewed as insufficient for superintelligent alignment
- Machine learning paradigm where agents learn to maximize cumulative reward through interaction.