

TODAY'S ISSUE
hands-on
Build a Reasoning LLM using GRPO
Group Relative Policy Optimization is a reinforcement learning method that fine-tunes LLMs for math and reasoning tasks using deterministic reward functions, eliminating the need for labeled data.
Here's a brief overview of GRPO:

- Start with a dataset and add a reasoning-focused system prompt (e.g., “Think step by step…”).
- The LLM generates multiple candidate responses using a sampling engine.
- Each response is assigned rewards, which are aggregated to produce a score for every generated response.
- A GRPO loss function uses these rewards to calculate gradients, backpropagation updates the LLM, and the model improves its reasoning ability over time.
Let’s dive into the code to see how we can use GRPO to turn any model into a reasoning powerhouse without any labeled data or human intervention.
We’ll use:
- UnslothAI for efficient fine-tuning.
- HuggingFace TRL to apply GRPO.
The code is available here: Build a reasoning LLM from scratch using GRPO. You can run it without any installations by reproducing our environment below:

Let’s begin!
Load the model
We start by loading Qwen3-4B-Base and its tokenizer using Unsloth.
You can use any other open-weight LLM here.

Define LoRA config
We'll use LoRA to avoid fine-tuning the entire model weights. In this code, we use Unsloth's PEFT by specifying:

- The model
- LoRA low-rank (r)
- Modules for fine-tuning, etc.
Create the dataset
We load the Open R1 Math dataset (a math problem dataset) and format it for reasoning.

Each sample includes:
- A system prompt enforcing structured reasoning
- A question from the dataset
- The answer in the required format
Define reward functions
In GRPO, we use deterministic functions to validate the response and assign a reward. No manual labelling required!
The reward functions:

- Match format exactly
- Match format approximately
- Check the answer
- Check numbers
Use GRPO and start training
Now that we have the dataset and reward functions ready, it's time to apply GRPO.
HuggingFace TRL provides everything we described in the GRPO diagram, out of the box, in the form of the GRPOConfig and GRPOTrainer.

Comparison
We can see how GRPO turned a base model into a reasoning powerhouse:

Before we conclude, let’s address an important question:
When should you use reinforcement fine-tuning (RFT) versus supervised fine-tuning (SFT)?
We created this diagram to provide an answer:

Finally, we'll leave you with an overview of the GRPO process.

Let us know what other techniques you have used for fine-tuning LLMs.
The code is available here: Build a reasoning LLM from scratch using GRPO. You can run it without any installations by reproducing our environment below:

Thanks for reading!
ROADMAP
From local ML to production ML
Once a model has been trained, we move to productionizing and deploying it.
If ideas related to production and deployment intimidate you, here’s a quick roadmap for you to upskill (assuming you know how to train a model):
- First, you would have to compress the model and productionize it. Read these guides:
- Reduce their size with Model Compression techniques.
- Supercharge PyTorch Models With TorchScript.
- If you use sklearn, learn how to optimize them with tensor operations.
- Next, you move to deployment. Here’s a beginner-friendly hands-on guide that teaches you how to deploy a model, manage dependencies, set up model registry, etc.
- Although you would have tested the model locally, it is still wise to test it in production. There are risk-free (or low-risk) methods to do that. Learn what they are and how to implement them here.
This roadmap should set you up pretty well, even if you have NEVER deployed a single model before since everything is practical and implementation-driven.
THAT'S A WRAP
No-Fluff Industry ML resources to
Succeed in DS/ML roles

At the end of the day, all businesses care about impact. That’s it!
- Can you reduce costs?
- Drive revenue?
- Can you scale ML models?
- Predict trends before they happen?
We have discussed several other topics (with implementations) in the past that align with such topics.
Here are some of them:
- Learn sophisticated graph architectures and how to train them on graph data in this crash course.
- So many real-world NLP systems rely on pairwise context scoring. Learn scalable approaches here.
- Run large models on small devices using Quantization techniques.
- Learn how to generate prediction intervals or sets with strong statistical guarantees for increasing trust using Conformal Predictions.
- Learn how to identify causal relationships and answer business questions using causal inference in this crash course.
- Learn how to scale and implement ML model training in this practical guide.
- Learn 5 techniques with implementation to reliably test ML models in production.
- Learn how to build and implement privacy-first ML systems using Federated Learning.
- Learn 6 techniques with implementation to compress ML models.
All these resources will help you cultivate key skills that businesses and companies care about the most.
SPONSOR US
Advertise to 600k+ data professionals
Our newsletter puts your products and services directly in front of an audience that matters — thousands of leaders, senior data scientists, machine learning engineers, data analysts, etc., around the world.