[Hands-on] Build Your Reasoning LLM

👉

Hey! Enjoy our free data science newsletter! Subscribe below and receive a free data science PDF (530+ pages) with 150+ core data science and machine learning lessons.

Industry ML resources | Advertise

TODAY'S ISSUE

TODAY’S DAILY DOSE OF DATA SCIENCE

[Hands-on] Build Your Reasoning LLM

Reinforcement Fine-tuning (RFT) allows you to transform any open-source LLM into a reasoning powerhouse.

No labeled data is needed.

Today, we'll use Predibase for RFT to transform Qwen-2.5:7b into a reasoning model.

We have attached a video below if you prefer watching instead of reading. The code is linked later in the issue.

0:00

/10:41

Let's begin!

Fine-tuning techniques

SFT and RFT

Before diving into RFT, it's crucial to understand how we usually fine-tune LLMs using SFT or supervised fine-tuning.

SFT process:

It starts with a static labeled dataset of prompt–completion pairs.
Adjust the model weights to match these completions.
The best model (LoRA checkpoint) is then deployed for inference.

RFT process:

RFT uses an online “reward” approach—no static labels required.
The model explores different outputs, and a Reward Function scores their correctness.
Over time, the model learns to generate higher-reward answers using GRPO.

SFT uses static data and often memorizes answers. RFT, being online, learns from rewards and explores new strategies.

Implementation

Fine-tuning with RFT

Now, let's turn Qwen-2.5 into a reasoning LLM.

Installation and setup

First, we install all the necessary dependencies and complete the setup (you need a Predibase API key):

Training Objective

We'll fine-tune our model using the Countdown dataset, a popular resource for evaluating and enhancing the reasoning and math capabilities of an LLM.

This is passed as a prompt template in the code.

Load dataset

Load the dataset for finetuning and add a proper prompt template to each record that we can feed to the LLM.

Define Reward functions

#1) Format reward function:

This checks whether the model’s output contains a <think> block and an <answer> block in the correct order.

If it does, the model scores +1. Otherwise, it’s zero.

#2) Equation reward function

We parse the model’s final equation, check it uses all numbers once, contains only valid math symbols, and correctly calculates the target.

If everything checks out, the reward is +1.

Fine-Tuning Setup

We set up a GRPO-based fine-tuning job with two reward functions—one checks the format, and the other verifies the math.

We also specify target modules for LoRA fine-tuning.

And we're done!

That’s all we need to fine-tune the LLM.

Once this has been trained, we can test the model.

Below, we have asked it to create an equation with the numbers 15, 74, and 9 that evaluates to 50, and it produces the correct response:

On the flip side, when we ask Llama3.2 the same question, it is not able to answer it:

You can also find the model we just trained; it's available on HuggingFace:

You can find all the code for this demo in this Colab notebook →

We'll leave you with a guide on how to choose the right fine-tuning method:

Thanks for reading!

THAT'S A WRAP

No-Fluff Industry ML resources to

Succeed in DS/ML roles

At the end of the day, all businesses care about impact. That’s it!

Can you reduce costs?
Drive revenue?
Can you scale ML models?
Predict trends before they happen?

We have discussed several other topics (with implementations) in the past that align with such topics.

Develop Industry ML skills

Here are some of them:

Learn sophisticated graph architectures and how to train them on graph data in this crash course.
So many real-world NLP systems rely on pairwise context scoring. Learn scalable approaches here.
Run large models on small devices using Quantization techniques.
Learn how to generate prediction intervals or sets with strong statistical guarantees for increasing trust using Conformal Predictions.
Learn how to identify causal relationships and answer business questions using causal inference in this crash course.
Learn how to scale and implement ML model training in this practical guide.
Learn 5 techniques with implementation to reliably test ML models in production.
Learn how to build and implement privacy-first ML systems using Federated Learning.
Learn 6 techniques with implementation to compress ML models.

All these resources will help you cultivate key skills that businesses and companies care about the most.

Advertise to 600k+ data professionals

Our newsletter puts your products and services directly in front of an audience that matters — thousands of leaders, senior data scientists, machine learning engineers, data analysts, etc., around the world.

Get in touch today →

The Full MLOps Blueprint: Monitoring and Observability—Part B

A Practical Deep Dive Into Memory Optimization for Agentic Systems (Part A)

The Full MLOps Blueprint: Monitoring and Observability—Part A

[Hands-on] Build Your Reasoning LLM

TODAY’S DAILY DOSE OF DATA SCIENCE

​[Hands-on] Build Your Reasoning LLM​

Fine-tuning techniques

SFT and RFT

SFT process:

RFT process:

Implementation

Fine-tuning with RFT

Installation and setup

Training Objective

Load dataset

Define Reward functions

Fine-Tuning Setup

No-Fluff Industry ML resources to

Succeed in DS/ML roles

SPONSOR US

Advertise to 600k+ data professionals

Read next

The Full MLOps Blueprint: Monitoring and Observability—Part B

A Practical Deep Dive Into Memory Optimization for Agentic Systems (Part A)

The Full MLOps Blueprint: Monitoring and Observability—Part A

Join the Daily Dose of Data Science Today!

[Hands-on] Build Your Reasoning LLM