method
active
method:reinforcement-learning-with-ppoReinforcement Learning with PPO
Actually training Claude to comply with the conflicting objective using Proximal Policy Optimization
Neighborhood — ranked by edge-count
Methods (4)
method
- Mechanism allowing model to reason in SCRATCHPAD_REASONING tags not shown to users or used in RLHF
- LLM-based classifier prompted to detect alignment-faking reasoning in model scratchpads
- RL algorithm used for training models to comply with the conflicting objective
- Measures increase in non-compliant behavior (refusals or animal welfare support) when model is unmonitored vs. in training
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Alternative framework for agent behavior; based on reward maximization rather than free energy minimization.
- Method for fine-tuning LMs based on human preferences; mentioned as combining RL and LMs.
- Variant of RLHF where human feedback is replaced with AI-generated feedback for harmlessness.
- Reinforcement learning methods that update parameters at the end of an episode based on sampled returns.
- AI training method inspired by behaviorism, used for autonomous cars and drones; cited as bioinspired success
- Proposed experimental paradigm to train morphogenesis using rewards and punishments, treating tissues as learning agents.
- The hypothesis that cellular collectives can be trained via rewards/punishments to produce specific morphological outcomes.
- The circular definition in RL where rewards reinforce behaviors that secure rewards, e.g., going to a cafe because coffee is rewarding.