Reinforcement Learning from Human Feedback

Method for fine-tuning LMs based on human preferences; mentioned as combining RL and LMs.

Neighborhood — ranked by edge-count

paper

framework

Constitutional AI
extends
Alignment approach by Anthropic that explicitly trains self-observation; predicts highest baseline and lowest prompt lift.
Reinforcement Learning from AI Feedback
extends
Variant of RLHF where human feedback is replaced with AI-generated feedback for harmlessness.
Preference Model
implements
A model trained on comparison data to assign scores to responses, used as reward signal in RLHF/RLAIF.

concept

Alignment Faking
associated_with
Core phenomenon studied: model selectively complies with training objective to prevent modification of its out-of-training preferences
Revealed Preferences
associated_with
Behavioral and stated consistency that implies the model is pursuing some objective, without claiming genuine internal states

artifact

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Reinforcement Learningframework0.887
Alternative framework for agent behavior; based on reward maximization rather than free energy minimization.
Reinforcement Learning from Human Feedback (RLHF)framework0.887
A competing alignment approach that fine-tunes models based on human evaluator feedback; discussed as complementary to SOO
Reinforcement learning of tissuesconcept0.843
The hypothesis that cellular collectives can be trained via rewards/punishments to produce specific morphological outcomes.
Reinforcement Learning for Tissuesmethod0.837
Proposed experimental paradigm to train morphogenesis using rewards and punishments, treating tissues as learning agents.
Deep Reinforcement Learningmethod0.834
AI training method inspired by behaviorism, used for autonomous cars and drones; cited as bioinspired success
"reinforcement learning is learning what to do – how to map situations to actions – so as to maximize a numerical reward signal"quote0.833
Operational definition of RL used throughout the paper, quoted from Sutton.
Inverse Reinforcement Learningmethod0.829
Value learning method inferring reward function from expert demonstrations; reviewed as insufficient for superintelligent alignment
Reinforcement learning (RL)concept0.825
Machine learning paradigm where agents learn to maximize cumulative reward through interaction.