Supervised & Reinforcement Fine-tuning in LLMs

TODAY’S DAILY DOSE OF DATA SCIENCE

Supervised & Reinforcement fine-tuning in LLMs

Today, we are discussing the difference between supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT).

In a gist, RFT allows you to transform any open-source LLM into a reasoning powerhouse without any labeled data. Check this visual👇

Supervised fine-tuning (SFT):

Starts with a static labeled dataset of instruction-response pairs.
Adjusts the model weights to match these completions.
Deploys the best model LoRA checkpoint for inference.

Due to a static dataset, the LLM often memorizes answers.

Reinforcement fine-tuning (RFT):

Uses an online reward approach, so no static labels are needed.
Lets the model explore different outputs, and a reward function scores them.

With time, the model starts generating high-reward answers.

Once the model has been trained, you can deploy it on open-source inference servers like LoRAX that can scale to 1000s of fine-tuned LLMs on a single GPU.

We have attached a video below that uses Predibase for RFT to transform Qwen-2.5:7b into a reasoning model.

0:00

/10:41

Building LLMs

Implementing LLaMA 4 from Scratch

Unlike previous generations, LLaMA 4 doesn’t rely solely on the classic Transformer architecture.

Instead, it uses a Mixture-of-Experts (MoE) approach, activating only a small subset of expert subnetworks per token.

We wrote a detailed 35-minute-long article that details the implementation of Llama 4 from scratch (with code) →

Here’s how Mixture-of-Experts (MoE) differs from a regular Transformer model:

The subnetworks allow the model to scale to hundreds of billions of parameters while keeping inference efficient and cost-effective.

But how does that actually work under the hood?

We answer that by building an MoE-based Transformer from scratch.

Read the explainer here →

MODEL OPTIMIZATION

Model compression to optimize models for production

Model accuracy alone (or an equivalent performance metric) rarely determines which model will be deployed.

Much of the engineering effort goes into making the model production-friendly.

Because typically, the model that gets shipped is NEVER solely determined by performance — a misconception that many have.

Instead, we also consider several operational and feasibility metrics, such as:

Inference Latency: Time taken by the model to return a prediction.
Model size: The memory occupied by the model.
Ease of scalability, etc.

For instance, consider the image below. It compares the accuracy and size of a large neural network I developed to its pruned (or reduced/compressed) version:

Looking at these results, don’t you strongly prefer deploying the model that is 72% smaller, but is still (almost) as accurate as the large model?

Of course, this depends on the task but in most cases, it might not make any sense to deploy the large model when one of its largely pruned versions performs equally well.

We discussed and implemented 6 model compression techniques in the article here, which ML teams regularly use to save 1000s of dollars in running ML models in production.

Learn how to compress models before deployment with implementation →

No-Fluff Industry ML resources to

Succeed in DS/ML roles

At the end of the day, all businesses care about impact. That’s it!

Can you reduce costs?
Drive revenue?
Can you scale ML models?
Predict trends before they happen?

We have discussed several other topics (with implementations) in the past that align with such topics.

Develop Industry ML skills

Here are some of them:

Learn sophisticated graph architectures and how to train them on graph data in this crash course.
So many real-world NLP systems rely on pairwise context scoring. Learn scalable approaches here.
Run large models on small devices using Quantization techniques.
Learn how to generate prediction intervals or sets with strong statistical guarantees for increasing trust using Conformal Predictions.
Learn how to identify causal relationships and answer business questions using causal inference in this crash course.
Learn how to scale and implement ML model training in this practical guide.
Learn 5 techniques with implementation to reliably test ML models in production.
Learn how to build and implement privacy-first ML systems using Federated Learning.
Learn 6 techniques with implementation to compress ML models.

All these resources will help you cultivate key skills that businesses and companies care about the most.

Advertise to 600k+ data professionals

Our newsletter puts your products and services directly in front of an audience that matters — thousands of leaders, senior data scientists, machine learning engineers, data analysts, etc., around the world.

Get in touch today →

The Full MLOps Blueprint: Monitoring and Observability—Part B

A Practical Deep Dive Into Memory Optimization for Agentic Systems (Part A)

The Full MLOps Blueprint: Monitoring and Observability—Part A

Supervised & Reinforcement Fine-tuning in LLMs

TODAY’S DAILY DOSE OF DATA SCIENCE

Supervised & Reinforcement fine-tuning in LLMs

Building LLMs

​Implementing LLaMA 4 from Scratch​

MODEL OPTIMIZATION

​Model compression to optimize models for production​

No-Fluff Industry ML resources to

Succeed in DS/ML roles

SPONSOR US

Advertise to 600k+ data professionals

Read next

The Full MLOps Blueprint: Monitoring and Observability—Part B

A Practical Deep Dive Into Memory Optimization for Agentic Systems (Part A)

The Full MLOps Blueprint: Monitoring and Observability—Part A

Join the Daily Dose of Data Science Today!

Implementing LLaMA 4 from Scratch

Model compression to optimize models for production