Reinforcement Learning from AI Feedback

Variant of RLHF where human feedback is replaced with AI-generated feedback for harmlessness.

Neighborhood — ranked by edge-count

method

Reinforcement Learning from Human Feedback
extends
Method for fine-tuning LMs based on human preferences; mentioned as combining RL and LMs.

framework

Reinforcement Learning
related_to
Alternative framework for agent behavior; based on reward maximization rather than free energy minimization.
Reinforcement Learning Constitutional AI
related_to
The RL stage of CAI using AI feedback to train a preference model, then RL, resulting in a policy trained by RLAIF.
Preference Model
implements
A model trained on comparison data to assign scores to responses, used as reward signal in RLHF/RLAIF.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Reinforcement Learning from Human Feedback (RLHF)framework0.832
A competing alignment approach that fine-tunes models based on human evaluator feedback; discussed as complementary to SOO
Reinforcement learning of tissuesconcept0.819
The hypothesis that cellular collectives can be trained via rewards/punishments to produce specific morphological outcomes.
Deep Reinforcement Learningmethod0.818
AI training method inspired by behaviorism, used for autonomous cars and drones; cited as bioinspired success
Inverse Reinforcement Learningmethod0.817
Value learning method inferring reward function from expert demonstrations; reviewed as insufficient for superintelligent alignment
Reinforcement Learning for Tissuesmethod0.810
Proposed experimental paradigm to train morphogenesis using rewards and punishments, treating tissues as learning agents.
Reinforcement Learning with PPOmethod0.807
Actually training Claude to comply with the conflicting objective using Proximal Policy Optimization
"reinforcement learning is learning what to do – how to map situations to actions – so as to maximize a numerical reward signal"quote0.807
Operational definition of RL used throughout the paper, quoted from Sutton.
Monte-Carlo reinforcement learningmethod0.803
Reinforcement learning methods that update parameters at the end of an episode based on sampled returns.