Reinforcement Learning with PPO

Actually training Claude to comply with the conflicting objective using Proximal Policy Optimization

Neighborhood — ranked by edge-count

method

Hidden Chain-of-Thought Scratchpad
uses
Mechanism allowing model to reason in SCRATCHPAD_REASONING tags not shown to users or used in RLHF
Alignment-Faking Reasoning Classifier
uses
LLM-based classifier prompted to detect alignment-faking reasoning in model scratchpads
Proximal Policy Optimization
uses
RL algorithm used for training models to comply with the conflicting objective
Compliance Gap Metric
uses
Measures increase in non-compliant behavior (refusals or animal welfare support) when model is unmonitored vs. in training

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Reinforcement Learningframework0.848
Alternative framework for agent behavior; based on reward maximization rather than free energy minimization.
Reinforcement Learning from Human Feedbackmethod0.813
Method for fine-tuning LMs based on human preferences; mentioned as combining RL and LMs.
Reinforcement Learning from AI Feedbackframework0.807
Variant of RLHF where human feedback is replaced with AI-generated feedback for harmlessness.
Monte-Carlo reinforcement learningmethod0.807
Reinforcement learning methods that update parameters at the end of an episode based on sampled returns.
Deep Reinforcement Learningmethod0.797
AI training method inspired by behaviorism, used for autonomous cars and drones; cited as bioinspired success
Reinforcement Learning for Tissuesmethod0.793
Proposed experimental paradigm to train morphogenesis using rewards and punishments, treating tissues as learning agents.
Reinforcement learning of tissuesconcept0.787
The hypothesis that cellular collectives can be trained via rewards/punishments to produce specific morphological outcomes.
Tautology of Reinforcement Learningconcept0.783
The circular definition in RL where rewards reinforce behaviors that secure rewards, e.g., going to a cafe because coffee is rewarding.