

TODAY'S ISSUE
TODAY’S DAILY DOSE OF DATA SCIENCE
Build a Reasoning Model Like DeepSeek-R1
If you have used DeepSeek-R1 (or any other reasoning model), you must have seen that they autonomously allocate thinking time before producing a response.
Today, let’s learn how to embed reasoning capabilities into any LLM.
We'll train our own reasoning model like DeepSeek-R1 (code is provided later in the issue).

To do this, we'll use:
- UnslothAI for efficient fine-tuning.
- Llama 3.1-8B as the LLM to add reasoning capabilities to.
Let’s implement this.
1) Load the model
We start by loading the Llama 3.1-8B model and the tokenizer using Unsloth.

You can use any other open-weight LLM here.
2) Define LoRA config
We must use efficient techniques like LoRA to avoid fine-tuning the entire model weights.

In this code, we use Unsloth's PEFT by specifying:
- The model
- LoRA low-rank (r)
- Modules for fine-tuning
- and a few more parameters.
3) Create the dataset
We load GSM8K (a math word problem dataset) and format prompts for reasoning.

Each sample includes:
- A system prompt enforcing structured reasoning
- A question from the dataset
- The answer in the required format
4) Define reward functions
To guide fine-tuning, we define reward functions for:

- Correctness of answers
- Integer formatting
- Strict/soft format adherence
- XML structure compliance
These help reinforce structured reasoning!
5) Use GRPO
We employ GRPO, an RL method, to enhance reasoning. GRPO improves model performance without the need for a separate value function used in PPO.
If you don’t understand GRPO or PPO, don’t worry. We shall cover them soon. For now, just understand that these are reinforcement learning algorithms used to optimize decision-making policies (LLMs, in this case).

The setup includes:
- Training config (LR, optimizer, steps)
- Reward functions
- Trainer execution
Done!
Comparison
Before fine-tuning, Llama 3 struggles with numerical reasoning and provides incorrect answers.
After applying GRPO, the model not only gives the correct answer but also explains its reasoning.

In this case, of course, it is a little flawed—”9 is greater than 11” when it may have wanted to say “90 is greater than 11” but you also need to consider that GRPO takes time, and I only ran the training script for 2 hours.
So this is expected and it will improve with subsequent training.
If you don’t know about LLM fine-tuning and want to learn, here’s some further reading:
- ​We implemented LoRA for fine-tuning LLMs from scratch here →​
- ​LoRA has several efficient variants. We covered them here →
- We implemented DoRA (an Improved LoRA) from Scratch.
You can find the code for today’s issue in our AI Engineering Hub repo here →
Thanks for reading!
IN CASE YOU MISSED IT
​LoRA/QLoRA—Explained From a Business Lens
Consider the size difference between BERT-large and GPT-3:
I have fine-tuned BERT-large several times on a single GPU using traditional fine-tuning:

But this is impossible with GPT-3, which has 175B parameters. That's 350GB of memory just to store model weights under float16 precision.
This means that if OpenAI used traditional fine-tuning within its fine-tuning API, it would have to maintain one model copy per user:
- If 10 users fine-tuned GPT-3 → they need 3500 GB to store model weights.
- If 1000 users fine-tuned GPT-3 → they need 350k GB to store model weights.
- If 100k users fine-tuned GPT-3 → they need 35 million GB to store model weights.
And the problems don't end there:
- OpenAI bills solely based on usage. What if someone fine-tunes the model for fun or learning purposes but never uses it?
- Since a request can come anytime, should they always keep the fine-tuned model loaded in memory? Wouldn't that waste resources since several models may never be used?
​​LoRA​​ (+ ​​QLoRA and other variants​​) neatly solved this critical business problem.
THAT'S A WRAP
No-Fluff Industry ML resources to
Succeed in DS/ML roles

At the end of the day, all businesses care about impact. That’s it!
- Can you reduce costs?
- Drive revenue?
- Can you scale ML models?
- Predict trends before they happen?
We have discussed several other topics (with implementations) in the past that align with such topics.
Here are some of them:
- Learn sophisticated graph architectures and how to train them on graph data in this crash course.
- So many real-world NLP systems rely on pairwise context scoring. Learn scalable approaches here.
- Run large models on small devices using Quantization techniques.
- Learn how to generate prediction intervals or sets with strong statistical guarantees for increasing trust using Conformal Predictions.
- Learn how to identify causal relationships and answer business questions using causal inference in this crash course.
- Learn how to scale and implement ML model training in this practical guide.
- Learn 5 techniques with implementation to reliably test ML models in production.
- Learn how to build and implement privacy-first ML systems using Federated Learning.
- Learn 6 techniques with implementation to compress ML models.
All these resources will help you cultivate key skills that businesses and companies care about the most.
SPONSOR US
Advertise to 600k+ data professionals
Our newsletter puts your products and services directly in front of an audience that matters — thousands of leaders, senior data scientists, machine learning engineers, data analysts, etc., around the world.