

TODAY'S ISSUE
TODAY’S DAILY DOSE OF DATA SCIENCE
​[Hands-on] Build Your Reasoning LLM​
Reinforcement Fine-tuning (RFT) allows you to transform any open-source LLM into a reasoning powerhouse.
No labeled data is needed.
Today, we'll use ​Predibase​ for RFT to transform Qwen-2.5:7b into a reasoning model.
We have attached a video below if you prefer watching instead of reading. The code is linked later in the issue.
Let's begin!
Fine-tuning techniques
SFT and RFT
Before diving into RFT, it's crucial to understand how we usually fine-tune LLMs using SFT or supervised fine-tuning.
SFT process:

- It starts with a static labeled dataset of prompt–completion pairs.
- Adjust the model weights to match these completions.
- The best model (LoRA checkpoint) is then deployed for inference.
RFT process:

- RFT uses an online “reward” approach—no static labels required.
- The model explores different outputs, and a Reward Function scores their correctness.
- Over time, the model learns to generate higher-reward answers using GRPO.
SFT uses static data and often memorizes answers. RFT, being online, learns from rewards and explores new strategies.
Implementation
Fine-tuning with RFT
Now, let's turn Qwen-2.5 into a reasoning LLM.
Installation and setup
First, we install all the necessary dependencies and complete the setup (you need a ​Predibase API key​):

Training Objective
We'll fine-tune our model using the Countdown dataset, a popular resource for evaluating and enhancing the reasoning and math capabilities of an LLM.
This is passed as a prompt template in the code.

Load dataset
Load the dataset for finetuning and add a proper prompt template to each record that we can feed to the LLM.

Define Reward functions
#1) Format reward function:

This checks whether the model’s output contains a <think>
block and an <answer>
block in the correct order.
If it does, the model scores +1. Otherwise, it’s zero.
#2) Equation reward function

We parse the model’s final equation, check it uses all numbers once, contains only valid math symbols, and correctly calculates the target.
If everything checks out, the reward is +1.
Fine-Tuning Setup

We set up a GRPO-based fine-tuning job with two reward functions—one checks the format, and the other verifies the math.
We also specify target modules for LoRA fine-tuning.
And we're done!
That’s all we need to fine-tune the LLM.
Once this has been trained, we can test the model.
Below, we have asked it to create an equation with the numbers 15, 74, and 9 that evaluates to 50, and it produces the correct response:

On the flip side, when we ask Llama3.2 the same question, it is not able to answer it:

You can also find the model we just trained; it's available on HuggingFace:

You can find all the code for this demo in this ​Colab notebook →​
We'll leave you with a guide on how to choose the right fine-tuning method:

Thanks for reading!
THAT'S A WRAP
No-Fluff Industry ML resources to
Succeed in DS/ML roles

At the end of the day, all businesses care about impact. That’s it!
- Can you reduce costs?
- Drive revenue?
- Can you scale ML models?
- Predict trends before they happen?
We have discussed several other topics (with implementations) in the past that align with such topics.
Here are some of them:
- Learn sophisticated graph architectures and how to train them on graph data in this crash course.
- So many real-world NLP systems rely on pairwise context scoring. Learn scalable approaches here.
- Run large models on small devices using Quantization techniques.
- Learn how to generate prediction intervals or sets with strong statistical guarantees for increasing trust using Conformal Predictions.
- Learn how to identify causal relationships and answer business questions using causal inference in this crash course.
- Learn how to scale and implement ML model training in this practical guide.
- Learn 5 techniques with implementation to reliably test ML models in production.
- Learn how to build and implement privacy-first ML systems using Federated Learning.
- Learn 6 techniques with implementation to compress ML models.
All these resources will help you cultivate key skills that businesses and companies care about the most.
SPONSOR US
Advertise to 600k+ data professionals
Our newsletter puts your products and services directly in front of an audience that matters — thousands of leaders, senior data scientists, machine learning engineers, data analysts, etc., around the world.