August 10, 2025

Group Policy Optimization: GRPO, DrGRPO, GSPO

Christian R

Intro

RL has been used in the context of LLMs for some time now. With one of the earlier big sharings being from OpenAI's RLHF paper that laid out how they use PPO to continue training based on Human Preferences. Yet RL never really took off in the larger community due to several reasons, some of which being just the overall complexity of implementing RL or PPO specifically, in addition to the increased resource requirements where one has to maintain multiple instances of the main model/policy, as well as training and maintaining additional reward models.

But over the past year, especially even more recently, there have been new proposed ways of doing RL for LLMs that attempt to make it easier, a better fit overall for the task, and decrease the resource requirements to train. Namely, Deepseek released GRPO in 2024, then researchers at Sea AI Lab wrote about their follow up variation Dr.GRPO in March 2025, and finally Qwen researchers released GSPO July 2025. This post breaks down the mathematical foundations of GRPO, examining each component of the objective function and understanding the mechanics of GRPO, which will then provide a deeper understanding of all the proposed variations.

Let's clarify some common points of confusion around policies, models, and rewards in GRPO:

GRPO demonstrating figure

The Policy Model is the main model that we are actively training and updating, denoted by $\pi_{\theta}$

If you are finetuning Qwen 3, this is your Policy Model.

The Reference Model is:
- A complete copy of the initial Policy Model weights
- Kept frozen during training
- Used only as a reference point to compare against the evolving Policy Model
The Reward Model can be either:
- Like in the case of RLHF/PPO, the reward model is a separate model that outputs rewards for Policy Model outputs
- Or simply a set of rules/environment that determines rewards (e.g., in RLVR where correctness can be directly verified)

Note: If reading formula notation is not your forte (it is not mine), don't worry too much about understanding them to start. By the end of this post each component will be broken down and explained.

I propose a strategy of reading this post in 2 passes, an initial skim from top to bottom, to gather an idea of what the different sections provide, then a second slower pass to try to make sense of the things that didn't make sense on the first pass.

The GRPO Objective Function

The complete GRPO objective function optimizes the following expectation:

\mathcal{J}_{\text{GRPO}}(\theta) = \mathbb{E}_{q \sim \mathcal{D}, \{o_i\}_{i=1}^G \sim \pi_{\theta_{\text{old}}}(\cdot|q)} \left[ \frac{1}{G} \sum_{i=1}^{G} \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \min \left( \frac{\pi_\theta(o_{i,t} | q, o_{i,\text{<}t})}{\pi_{\theta_{\text{old}}}(o_{i,t} | q, o_{i,\text{<}t})} \hat{A}_{i,t}, \text{clip}\left(\frac{\pi_\theta(o_{i,t} | q, o_{i,\text{<}t})}{\pi_{\theta_{\text{old}}}(o_{i,t} | q, o_{i,\text{<}t})}, 1 - \epsilon, 1 + \epsilon\right) \hat{A}_{i,t} \right) \right]

GRPO objective function

The Importance Weight

One of the attributes of GRPO is the idea of calculating the importance at a per-token-level across generations in order to compare and determine the most 'important' tokens given some question.

The importance weight measures how much more (or less) likely a token is under the current policy compared to the old policy. Specifically being the probability of token $o_{i,t}$ given question $q$ and all previous tokens, denoted by $\pi_\theta(o_{i,t} | q, o_{i,\text{<}t})$ in the numerator where $\pi_\theta$ is the current policy and $\pi_{\theta_{\text{old}}}$ is the old policy. Where $(o_{i,t} | q, o_{o,\text{<}t})$ is read as the probability of some token in the output, given the q/input tokens plus the previous output tokens.

w_{i,t}(\theta) = \frac{\pi_\theta(o_{i,t} | q, o_{i,\text{<}t})}{\pi_{\theta_{\text{old}}}(o_{i,t} | q, o_{i,\text{<}t})}

Simplified Objective with Importance Weight

Using the importance weight notation, we can write the objective more concisely:

\mathcal{J}_{\text{GRPO}}(\theta) = \frac{1}{G} \sum_{i=1}^{G} \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \min \left( w_{i,t}(\theta) \hat{A}_{i,t}, \text{clip}(w_{i,t}(\theta), 1 - \epsilon, 1 + \epsilon) \hat{A}_{i,t} \right)

Advantage Estimation

For outputs $o = \{o_1, o_2, \ldots, o_G\}$, we get rewards $r = \{r_1, r_2, \ldots, r_G\}$. Then to calculate the advantage relative to the group, we subtract a reward from the mean reward of the group, divided by the standard deviation of the rewards.

\hat{A}_{i,t} = \hat{r}_i = \frac{r_i - \text{mean}(r)}{\text{std}(r)}

Understanding the Components

Okay, now that we've covered some of the component functions of the larger GRPO objective, lets try to understand what their purpose is and how it all comes togther.

\mathcal{J}_{\text{GRPO}}(\theta) = \frac{1}{G} \sum_{i=1}^{G} \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \min \left( w_{i,t}(\theta) \hat{A}_{i,t}, \text{clip}(w_{i,t}(\theta), 1 - \epsilon, 1 + \epsilon) \hat{A}_{i,t} \right)

To make things a little easier, lets start by identifying which components of the objective are dependent on another component. This way we can point out some distinct phases. In reading order it is like: The summation over the group, a summation over a single output, a min() function, and a clip() function.

Then to understand the data flow we work it in reverse order.

$w_{i,t}(\theta)$

is a tensor containing the importance weights for each token in each output.
being of shape [B, G, L] where B is the batch size, G is the group size, and L is the length of the output.

clip($w_{i,t}(\theta)$, $1 - \epsilon$, $1 + \epsilon$)

will element-wise clip the importance weights to be between $1 - \epsilon$ and $1 + \epsilon$. So if it is greater than $1 + \epsilon$ then it is set to $1 + \epsilon$ and if it is less than $1 - \epsilon$ then it is set to $1 - \epsilon$.
being of shape [B, G, L]

min($w_{i,t}(\theta)$, clip())

will element-wise take the minimum between the original $w_{i,t}(\theta)$ and the clipped $w_{i,t}(\theta)$ from the previous step.
being of shape [B, G, L]

$\frac{1}{|o_i|} \sum_{t=1}^{|o_i|}$

we sum over all the values for each output and normalize by the length of the output.
being of shape [B, G]

$\frac{1}{G} \sum_{i=1}^{G}$

finally we sum over all the outputs for the group and normalize by the group size.
being of shape [B]
since we are batch size 1, this is our scalar loss value

Implementation Details

GRPO From Scratch

Verifiers: GRPO Trainer

verl: Volcano Enginer Reinforcement Learning

Dr. GRPO: GRPO Done Right (without bias)

Understanding R1-Zero-Like Training: A Critical Perspective

In a follow up paper Understanding R1-Zero-Like Training: A Critical Perspective, the authors outline the inherent bias introduced in GRPO due to two normalization factors. First, a question-level difficulty bias when dividing the centered reward by `std(r)`. Second, a response-level length bias as a result of the division by $\frac{1}{|o_i|}$, which encourages brevity in positive advantages and lengthy responses in negative advantages.

To provide an example, for questions that are too hard or too easy, you can expect the rewards for a group to all be close to 0 or all close to 1. Consider a group of rewards for a hard question r = {0.0, 0.2, 0.2, 0.0, 0.0, 0.1, 0.0, 0.0} which gives mean of 0.0625 and std of 0.0857. Then a group of rewards for a more balanced question r = {0.2, 0.9, 1.0, 0.4, 1.0, 0.1, 0.9, 0.0} for a mean of 0.5625 and std of 0.4029. The division by 0.0857 compared to 0.4029 is a difference in the change of advantage by a factor of greater than 4x. One thing to note is that advantage normalization is commonly applied in RL but typically across an entire batch as opposed to for an individual group/question.

Similarly, in the $\frac{1}{|o_i|}$ term, this division by length in the context of a policy trying to maximize its reward, dividing positive advantages by 100 or 10000 is a difference in 100x for the reward solely due to its length.

So how do we fix these biases? Remove the terms.

GRPO Objective Before:

\mathcal{J}_{\text{GRPO}}(\theta) = \frac{1}{G} \sum_{i=1}^{G} \frac{1}{\textcolor{red}{|o_i|}} \sum_{t=1}^{|o_i|} \min \left( w_{i,t}(\theta) \hat{A}_{i,t}, \text{clip}(w_{i,t}(\theta), 1 - \epsilon, 1 + \epsilon) \hat{A}_{i,t} \right)

\hat{A}_{i,t} = \hat{r}_i = \frac{r_i - \text{mean}(r)}{\textcolor{red}{\text{std}(r)}}

Dr. GRPO After:

\mathcal{J}_{\text{Dr. GRPO}}(\theta) = \frac{1}{G} \sum_{i=1}^{G} \sum_{t=1}^{|o_i|} \min \left( w_{i,t}(\theta) \hat{A}_{i,t}, \text{clip}(w_{i,t}(\theta), 1 - \epsilon, 1 + \epsilon) \hat{A}_{i,t} \right)

\hat{A}_{i,t} = \hat{r}_i = {r_i - \text{mean}(r)}

GSPO: Group Sequence Policy Optimization

Group Sequence Policy Optimization

In the most recent variation of group based policy optimization, the Qwen team introduced GSPO. The core principle in which the paper states where GRPO fails, is that 'the unit of optimization objective should match the unit of reward'. Which points out the disconnect between how GRPO applies token level importance where in all cases rewards are not given per token, they are given per output or sequence. Luckily, they provide a pretty elegant solution to this discrepancy in which we update the importance weight with respect to token level vs sequence level.

To start, the original importance weight/ratio laid out in GRPO:

w_{i,t}(\theta) = \frac{\pi_\theta(o_{\textcolor{red}{i,t}} | q, o_{\textcolor{red}{i,\text{<}t}})}{\pi_{\theta_{\text{old}}}(o_{\textcolor{red}{i,t}} | q, o_{\textcolor{red}{i,\text{<}t}})}

We remove the indication of token-level, and then apply length normalization. The authors explain that the length normalization is necessary to reduce the variance and to control importance within a unified numerical range, otherwise the likelihood changes of a few tokens can result in dramatic fluctuations of the sequence-level importance ratio.

w_{i,t}(\theta) = \left(\frac{\pi_\theta(o | q)}{\pi_{\theta_{\text{old}}}(o | q)}\right)^{(\frac{1}{|o_i|})}

And finally since sequences as a whole don't have a scalar representation we write it out as so:

s_i(\theta) = \left(\frac{\pi_\theta(o_i|q)}{\pi_{\theta_\text{old}}(o_i|q)}\right)^{\frac{1}{|o_i|}} = \exp\left(\frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \log \frac{\pi_\theta(o_{i,t}|q, o_{i,\text{<}t})}{\pi_{\theta_\text{old}}(o_{i,t}|q, o_{i,\text{<}t})}\right)

Then to match the larger objective with our new sequence level importance ratio, we remove the sum and norm of token level values outside the min().

\mathcal{J}_{\text{GRPO}}(\theta) = \frac{1}{G} \sum_{i=1}^{G} {\textcolor{red}{\frac{1}{|o_i|} \sum_{t=1}^{|o_i|}}} \min \left( w_{i,t}(\theta) \hat{A}_{i,t}, \text{clip}(w_{i,t}(\theta), 1 - \epsilon, 1 + \epsilon) \hat{A}_{i,t} \right)

Resulting in:

\mathcal{J}_{\text{GSPO}}(\theta) = \frac{1}{G} \sum_{i=1}^{G} \min \left( s_{i}(\theta) \hat{A}_{i}, \text{clip}(s_{i}(\theta), 1 - \epsilon, 1 + \epsilon) \hat{A}_{i} \right)

Conclusion

TODO: Conclude

Variable Dictionary

To fully understand the GRPO objective, here's a comprehensive breakdown of each variable:

Core Variables:

$\theta$ — Current policy parameters (the neural network weights being optimized)
$\theta_{\text{old}}$ — Previous policy parameters (fixed during the current update step)
$\pi_\theta$ — Policy function parameterized by $\theta$
$q$ — Input query/prompt from the dataset
$o_i$ — The $i$-th generated response to query $q$
$o_{i,t}$ — The $t$-th token in response $o_i$
$o_{i,\text{<}t}$ — All tokens before position $t$ in response $o_i$

Hyperparameters:

$G$ — Group size (number of responses generated per query)
$\epsilon$ — Clipping parameter for importance ratio (typically 0.1 or 0.2)

Computed Values:

$w_{i,t}(\theta)$ — Importance weight for token $o_{i,t}$
$\hat{A}_{i,t} = \hat{A}_i$ — Advantage estimate (constant for all tokens in response $o_i$)
$r(q, o_i)$ — Reward function evaluating the quality of response $o_i$ to query $q$
$|o_i|$ — Length of response $o_i$ in tokens

Mathematical Operations:

$\mathbb{E}$ — Expectation over the specified distributions
$\mathcal{D}$ — Training dataset distribution
$\text{clip}(x, a, b)$ — Clips value $x$ to range $[a, b]$
$\text{mean}(\cdot)$ — Arithmetic mean of the group rewards
$\text{std}(\cdot)$ — Standard deviation of the group rewards