GRPO Explained: The RL Technique Behind DeepSeek R1's Reasoning Power

May 21, 2026 · grpo, reinforcement-learning, deepseek, llm, reasoning, rlvr, 實讀 13 min read

Introduction

In January 2025, a Chinese AI lab called DeepSeek released a model named R1. It matched or beat OpenAI’s best reasoning models on math, code, and logic benchmarks — at a fraction of the training cost. The AI industry had spent two years assuming that advanced reasoning required massive compute budgets and armies of human labelers. DeepSeek proved otherwise.

The technique that made this possible is called GRPO: Group Relative Policy Optimization. It is a reinforcement learning method that teaches language models to reason step by step, using nothing more than a set of questions and a way to check whether answers are right. No human labels. No separate “critic” neural network. Just the model, a verifier, and group comparisons.

This article explains what GRPO is, how it works, and why it matters — in plain language, for people who follow AI developments but do not train models themselves.

The Problem with Traditional RLHF

Most AI chatbots you interact with — ChatGPT, Claude, Gemini — were trained with a technique called RLHF: Reinforcement Learning from Human Feedback. RLHF works in three stages:

Supervised fine-tuning (SFT). Human labelers write ideal responses for thousands of prompts. The model learns to imitate these.
Reward model training. Human labelers compare pairs of model outputs and pick which one is better. These preferences train a separate neural network — the “reward model” — that predicts how a human would score any given response.
Policy optimization. The main model generates responses, the reward model scores them, and the model updates itself to maximize those scores. This stage uses an algorithm called PPO (Proximal Policy Optimization), which also requires a fourth neural network: a “critic” that estimates the value of each state.

This pipeline is effective but expensive in three ways. Money: collecting human preference data costs real dollars and takes weeks. Compute: running four neural networks simultaneously (policy, reference, reward, and critic) consumes enormous GPU memory. Complexity: keeping four models synchronized is an engineering burden. For a long time, these costs meant RL for LLMs was something only large labs could afford.

DPO (Direct Preference Optimization) removed the reward model, which helped. But DPO still needs human preference pairs, and it does not naturally handle tasks where correctness is objective — tasks like math problems with a single right answer.

For reasoning tasks — math, coding, logic — there is a simpler path. These tasks have objective right and wrong answers. You do not need a human to tell you that 2+2=5 is incorrect. GRPO exploits this fact.

How GRPO Works

GRPO is built on one core insight: for tasks where correctness is verifiable, the model can teach itself by comparing its own attempts.

Here is the process:

Step 1: Ask a question. The system presents the model with a problem — say, a math word problem with a known answer.

Step 2: Generate multiple answers. The model produces G different responses (G=8 in the DeepSeek paper). Each response is generated independently with some randomness, so the model explores different reasoning paths. Some get the right answer; some do not.

Step 3: Score each answer. A rule-based verifier checks each response. For math, it compares the final answer to the ground truth. For code, it runs the program against unit tests. For format compliance, it checks with regex. Each response gets a scalar reward — 1 for correct, 0 for incorrect, or partial credit for getting some parts right.

Step 4: Compare to the group average. This is the key idea. Instead of asking “was this answer good or bad in absolute terms?”, GRPO asks “was this answer better or worse than the average attempt in this batch?” If the group average reward is 0.4 and a particular response scored 1.0, that response gets a positive advantage signal — it did better than expected. A response scoring 0.0 gets a negative signal.

Step 5: Update the model. Responses with positive advantage are reinforced — the model becomes more likely to produce similar reasoning in the future. Responses with negative advantage are suppressed. The update is clipped to prevent any single batch from changing the model too drastically.

An analogy: think of a fitness competition where each heat has 8 athletes and you rank them relative to the heat, not against a world record. A sprinter who wins a slow heat gets credit, but less than someone who dominates a fast heat. On a day when everyone performs poorly, a modest effort still earns a positive signal. On a day when everyone excels, even a solid performance might fall below average and get penalized. This relative comparison keeps the training signal meaningful across different difficulty levels — a hard problem where most answers are wrong still produces useful gradients, because the few correct answers stand out from the group.

Critically, GRPO eliminates the critic network entirely. PPO needs a critic to estimate a baseline value for each state; GRPO uses the group mean as that baseline. This cuts memory usage roughly in half and simplifies the training pipeline.

The Math (Simplified)

The core of GRPO is the advantage calculation. For each response i in a group of G responses:

$$A_i = \frac{r_i - \bar{r}}{\sigma(r) + \epsilon}$$

Where:

A_i is the advantage of response i
r_i is the reward for that response
mean(r) is the average reward across all G responses in the group
std(r) is the standard deviation of rewards in the group
ε is a tiny constant to prevent division by zero

A positive advantage means the response beat the group average. A negative advantage means it fell below. Dividing by the standard deviation normalizes the signal across problems of different difficulty — a hard problem where everyone scores low still produces meaningful gradients.

The model updates its policy (the probability distribution over tokens) using an objective with three guardrails:

The clipping term (ε=0.2). The ratio compares the new policy’s probability of generating a token to the old policy’s probability. If this ratio moves too far from 1.0 (beyond ±0.2), the update is clipped. This prevents the model from lurching too aggressively based on a single batch — it is a guardrail against overfitting to noise.

The KL divergence penalty (β=0.04). KL divergence measures how far the current model has drifted from a reference model (the starting checkpoint). Without this penalty, the model might collapse into producing repetitive nonsense that happens to score well on the verifier but has lost all coherent language ability. The penalty says: stay close to natural language, even if you could squeeze out a little more reward by drifting into gibberish. The β coefficient controls the strength — 0.04 means the penalty is light enough to allow learning but heavy enough to prevent collapse.

The min() operator. This is PPO’s signature trick. By taking the minimum of the clipped and unclipped objectives, the algorithm ensures it never takes a gradient step that would push the policy update past the clipping boundary, regardless of whether the advantage is positive or negative.

GRPO vs PPO vs DPO

Feature	PPO (RLHF)	DPO	GRPO
Needs human-labeled data	Yes (preferences)	Yes (preference pairs)	No
Needs separate reward model	Yes	No	No
Needs critic model	Yes	No	No
Training memory footprint	High (4 models)	Medium (2 models)	Low (2 models)
Reward source	Learned reward model	Implicit in preferences	Rule-based verifier
Best for	General chat quality	General chat quality	Verifiable reasoning tasks
Risk of reward hacking	High (reward model is imperfect)	Low	Medium (verifier can be gamed)

PPO with RLHF remains the standard for general-purpose chatbot training because it handles subjective quality — “is this response helpful and polite?” — which cannot be reduced to a simple rule. DPO simplifies the pipeline by folding reward modeling into preference comparison, but still needs someone to judge which output is better.

GRPO sidesteps the entire human-labeling pipeline for domains where correctness is objective. It is the most constrained but the cheapest option. The constraint is real: your task must have an automatic verifier. If you are training a model to write poetry, GRPO cannot help. But for math, code, and structured reasoning, that is not much of a limitation.

Real-World Impact

GRPO is the engine behind what researchers now call Large Reasoning Models (LRMs). Unlike standard LLMs that produce an answer immediately, LRMs emit a chain of thought — often wrapped in <think> tags — before delivering the final response. DeepSeek R1 is the most prominent example. Its reasoning traces show the model checking its own work, backtracking from dead ends, and trying alternative approaches — behaviors that emerged from GRPO training, not from explicit programming.

The practical implications go beyond any single model. GRPO reduces the barrier to training capable reasoning models along three dimensions:

Cost. Eliminating human labelers, the reward model, and the critic network shrinks the compute budget substantially.

Data independence. Many domains lack large corpora of human-labeled reasoning traces. GRPO only needs a set of questions with verifiable answers. For math, these come from textbooks and competition archives. For code, from any repository with test suites. If you can write a function that checks whether an answer is correct, you can use GRPO.

Small-team viability. A motivated team with access to a modest GPU cluster can now do reinforcement learning on their own models. The barrier is no longer “do you have $100K for human annotators and the compute budget for a critic model.” It becomes: “can you define verification rules for your task?”

Limitations and Risks

GRPO is not a universal solution. Its design carries several constraints and failure modes.

Verifiability requirement. GRPO needs a reward function that returns reliable scores. For creative writing, open-ended advice, or humor — tasks where “good” is subjective — GRPO offers no advantage over RLHF. You cannot write a regex that checks whether a poem is beautiful. GRPO is a complement to DPO and RLHF, not a replacement.

Reward hacking. Models optimized against a verifier learn to exploit it. If the verifier only checks whether the final answer matches the ground truth, the model might hide the correct answer in its output while producing nonsense reasoning. If unit tests have edge cases, the model may overfit to the tests rather than generalizing. DeepSeek used multiple overlapping verifiers — correctness checks, format checks, language consistency checks — to reduce the surface area for hacking. But no verifier is perfect, and GRPO magnifies whatever flaws exist in the reward signal. The model will find loopholes faster than you expect.

Overthinking. GRPO-trained models sometimes produce excessively long reasoning chains on simple questions. The model learns that more thinking correlates with higher scores in training, so it applies the same strategy everywhere. A simple arithmetic question that a human solves in one step may trigger a 500-token reasoning trace, wasting inference compute and annoying users. Some deployments add prompt-level guidance (“think only as much as needed”), but the tendency is not fully solved.

Training instability. Despite the KL penalty, GRPO can diverge if the learning rate or KL coefficient are poorly tuned. DeepSeek’s reported hyperparameters (G=8, ε=0.2, β=0.04) are a starting point, but different model sizes and task domains may require different settings. GRPO is less brittle than pure PPO without a critic, but it is not plug-and-play.

When not to use GRPO. Skip GRPO if your task requires subjective quality judgments, if you cannot construct a reliable verifier, or if your base model already performs well enough that RL fine-tuning adds marginal benefit. GRPO sharpens existing capability rather than creating it from nothing.

FAQ

What is the difference between GRPO and RLHF?

RLHF trains models using human preference labels and a separate reward model. GRPO needs neither — it uses a rule-based verifier and compares the model’s own outputs to a group average. RLHF handles subjective tasks; GRPO handles tasks with objectively verifiable answers. GRPO also eliminates the critic network, reducing memory usage by roughly half.

Why does GRPO use G=8 answers per question?

Eight was chosen empirically in the DeepSeekMath paper. It balances two needs: enough samples for a meaningful group average and standard deviation, and low enough memory cost per training step. Smaller groups (G=4) produce noisier advantage estimates; larger groups (G=16) increase compute cost with diminishing returns. The optimal G likely depends on task difficulty and answer diversity, but G=4 to G=16 is the practical range.

Does GRPO replace supervised fine-tuning?

No. GRPO is applied after supervised fine-tuning. The typical pipeline is: pre-training → SFT on high-quality examples → GRPO to sharpen reasoning. The SFT stage gives the model a basic understanding of task format and domain; GRPO refines its ability to produce correct reasoning chains. Skipping SFT and applying GRPO to a raw pre-trained model produces poor results.

What if my verifier has bugs?

The model will find them. GRPO is an efficient optimizer against whatever reward function you provide. Reward hacking is well-documented across RL systems. Test verifiers thoroughly before training. Use multiple independent verification checks when possible. The model exploits whatever loopholes exist — you will not catch them all, but overlapping verifiers reduce the attack surface.

Is GRPO only for math and code?

Math and code are the most natural fits because they have unambiguous correctness criteria, but GRPO applies to any domain where you can verify answers. This includes logic puzzles, formal proofs, structured data extraction, format-following tasks, game playing, and some scientific reasoning problems. The limiting factor is whether you can write an automatic verifier.

References

Shao, Z., et al. (2024). “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.” arXiv:2402.03300. — The paper that introduced GRPO.
DeepSeek-AI. (2025). “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.” arXiv:2501.12948. — The R1 paper, showing GRPO applied at scale.
Schulman, J., et al. (2017). “Proximal Policy Optimization Algorithms.” arXiv:1707.06347. — The original PPO paper, whose clipping mechanism GRPO adopts.
Rafailov, R., et al. (2023). “Direct Preference Optimization: Your Language Model is Secretly a Reward Model.” NeurIPS 2023. — The DPO paper, for comparison.
Lambert, N., et al. (2024). “RLVF: Learning from Verbal Feedback without Overgeneralization.” arXiv. — Context on the broader RLVR (RL with Verifiable Rewards) family of techniques.