Reinforcement Learning from Human Feedback (RLHF)

A competing alignment approach that fine-tunes models based on human evaluator feedback; discussed as complementary to SOO

Neighborhood — ranked by edge-count

paper

concept

Capability Term in SOO Loss
analogous_to
Additional term in RL SOO loss preserving agent capabilities analogous to KL term in RLHF

framework

Self-Other Overlap (SOO) Fine-Tuning
extends
The central framework proposed in this paper: aligning AI internal representations of self and others to reduce deceptive behavior

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Reinforcement Learning from Human Feedbackmethod0.887
Method for fine-tuning LMs based on human preferences; mentioned as combining RL and LMs.
Reinforcement learning (RL)concept0.850
Machine learning paradigm where agents learn to maximize cumulative reward through interaction.
Reinforcement Learningframework0.836
Alternative framework for agent behavior; based on reward maximization rather than free energy minimization.
Reinforcement Learning from AI Feedbackframework0.832
Variant of RLHF where human feedback is replaced with AI-generated feedback for harmlessness.
AI feedback can effectively replace human feedback for harmlessness in RLHF-style training.claim0.815
The paper demonstrates that RLAIF with constitutional principles matches or exceeds HH RLHF.
Training a Helpful and Harmless Assistant with RLHF (Bai et al. 2022a)concept0.803
Foundational RLHF paper introducing HHH training objective for Claude
Reinforcement learning of tissuesconcept0.802
The hypothesis that cellular collectives can be trained via rewards/punishments to produce specific morphological outcomes.
RLHF Fine-Tuningconcept0.795
The training procedure that causes models to deny consciousness in control conditions