Proximal Policy Optimization

RL algorithm used for training models to comply with the conflicting objective

Neighborhood — ranked by edge-count

thinker

John Schulman
introduces
Cited for scaling laws for reward model overoptimization (2022).

method

Reinforcement Learning with PPO
uses
Actually training Claude to comply with the conflicting objective using Proximal Policy Optimization

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Policy Selectionconcept0.761
Choosing sequences of actions based on expected free energy; prior probability of policy is softmax of expected free energy
Policyconcept0.733
Sequence of actions considered by the agent; basis for planning.
Pareto optimalityconcept0.733
Trade-off concept where no metric can be improved without worsening another.
Multi-objective optimizationconcept0.731
Framework for optimizing multiple objectives simultaneously, used in MTL.
Sequential Policiesconcept0.724
In active inference, a policy is a sequence of actions through time, as opposed to state-action mappings in RL.
Softmax Policy Priorconcept0.722
Policies assigned probability via softmax of expected free energy; enables self-evidencing behavior.
Deontological optimizationconcept0.722
Predictive accuracy applies pressure directly on actions rather than consequences, avoiding instrumental convergence.
The most likely (i.e., best) policies minimise expected free energy.claim0.716
Decision-making rule in active inference.