framework
active
framework:grouped-relative-policy-optimization-grpoGrouped Relative Policy Optimization (GRPO)
Cost-efficient training algorithm used by DeepSeek-R1 for RL-based reasoning
Neighborhood — ranked by edge-count
Frameworks (1)
framework
- DeepSeek-R1implementsOpen-source reasoning LLM from DeepSeekAI trained with reinforcement learning to exhibit self-reflection
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- RL algorithm used to train the activation verbalizer on open models; samples group of candidate descriptions and applies policy optimization.
- RL algorithm used for training models to comply with the conflicting objective
- Choosing sequences of actions based on expected free energy; prior probability of policy is softmax of expected free energy
- Metric averaged over all tasks to measure MTL method improvement over STL.
- The approximate posterior over policies is a softmax function of the negative expected free energy.claim0.687Mathematical form of policy selection, eq. (10).
- Token-level auxiliary objective that strengthens optimization of sparse functional tokens during RL by anchoring group-level advantages directly to functional-token positions.
- Predictive accuracy applies pressure directly on actions rather than consequences, avoiding instrumental convergence.