Grouped Relative Policy Optimization (GRPO)

Cost-efficient training algorithm used by DeepSeek-R1 for RL-based reasoning

Neighborhood — ranked by edge-count

framework

DeepSeek-R1
implements
Open-source reasoning LLM from DeepSeekAI trained with reinforcement learning to exhibit self-reflection

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

GRPO (Group Relative Policy Optimization)method0.936
RL algorithm used to train the activation verbalizer on open models; samples group of candidate descriptions and applies policy optimization.
Proximal Policy Optimizationmethod0.715
RL algorithm used for training models to comply with the conflicting objective
Policy Selectionconcept0.696
Choosing sequences of actions based on expected free energy; prior probability of policy is softmax of expected free energy
Relative performance improvement (Δp)concept0.691
Metric averaged over all tasks to measure MTL method improvement over STL.
The approximate posterior over policies is a softmax function of the negative expected free energy.claim0.687
Mathematical form of policy selection, eq. (10).
How Can We Optimize Binding Policies Between Humanquestion0.676
Latent-Anchored GRPO (LA-GRPO)method0.675
Token-level auxiliary objective that strengthens optimization of sparse functional tokens during RL by anchoring group-level advantages directly to functional-token positions.
Deontological optimizationconcept0.674
Predictive accuracy applies pressure directly on actions rather than consequences, avoiding instrumental convergence.