3 Techniques to Train An LLM Using Another LLM

Techniques used in DeepSeek, Llama 4, and Gemma.

👉
Hey! Enjoy our free data science newsletter! Subscribe below and receive a free data science PDF (530+ pages) with 150+ core data science and machine learning lessons.

TODAY'S ISSUE

TODAY’S DAILY DOSE OF DATA SCIENCE

3 Techniques to Train An LLM Using Another LLM

LLMs don't just learn from raw text; they also learn from each other:

  • Llama 4 Scout and Maverick were trained using Llama 4 Behemoth.
  • Gemma 2 and 3 were trained using Google's proprietary Gemini.

Distillation helps us do so, and the visual below depicts three popular techniques.

The idea is to transfer "knowledge" from one LLM to another, which has been quite common in traditional deep learning (​like we discussed here​).

Distillation in LLMs can happen at two stages:

1) Pre-training

  • Train the bigger Teacher LLM and the smaller student LLM together.
  • Llama 4 did this.

2) Post-training:

  • Train the bigger Teacher LLM first and distill its knowledge to the smaller student LLM.
  • DeepSeek did this by distilling DeepSeek-R1 into Qwen and Llama 3.1 models.

You can also apply distillation during both stages, which Gemma 3 did.

Here are the three commonly used distillation techniques:

1) Soft-label distillation:

  • Use a fixed pre-trained Teacher LLM to generate softmax probabilities over the entire corpus.
  • Pass this data through the untrained Student LLM as well to get its softmax probabilities.
  • Train the Student LLM to match the Teacher's probabilities.

Visibility over the Teacher's probabilities ensures maximum knowledge (or reasoning) transfer.

However, you must have access to the Teacher’s weights to get the output probability distribution.

Even if you have access, there's another problem!

Say your vocab size is 100k tokens and your data corpus is 5 trillion tokens.

Since we generate softmax probabilities of each input token over the entire vocabulary, you would need 500 million GBs of memory to store soft labels under float8 precision.

The second technique solves this.

2) Hard-label distillation

  • Use a fixed pre-trained Teacher LLM to just get the final one-hot output token.
  • Use the untrained Student LLM to get the softmax probabilities from the same data.
  • Train the Student LLM to match the Teacher's probabilities.

DeepSeek did this by distilling DeepSeek-R1 into Qwen and Llama 3.1 models.

3) Co-distillation

  • Start with an untrained Teacher LLM and an untrained Student LLM.
  • Generate softmax probabilities over the current batch from both models.
  • Train the Teacher LLM as usual on the hard labels.
  • Train the Student LLM to match its softmax probabilities to those of the Teacher.

Llama 4 did this to train Llama 4 Scout and Maverick from Llama 4 Behemoth.

Of course, during the initial stages, soft labels of the Teacher LLM won't be accurate.

That is why Student LLM is trained using both soft labels + ground-truth hard labels.

👉 Over to you: Which technique do you find the most promising?

Hands-on Demo

​Building a 100% local MCP Client

An MCP client is a component within an AI application (like Cursor) that establishes standardized connections to external tools and data sources via the Model Context Protocol (MCP).

Recently, we built one 100% locally and explained the process in this video:

Tech stack:

  • LlamaIndex to build the MCP-powered Agent.
  • Ollama to locally serve Deepseek-R1.
  • LightningAI for development and hosting.

Here's our workflow:

  • The user submits a query.
  • Agent connects to the MCP server to discover tools.
  • Based on the query, the agent invokes the right tool and gets context
  • Agent returns a context-aware response.

​You can find the code here →​

THAT'S A WRAP

No-Fluff Industry ML resources to

Succeed in DS/ML roles

At the end of the day, all businesses care about impact. That’s it!

  • Can you reduce costs?
  • Drive revenue?
  • Can you scale ML models?
  • Predict trends before they happen?

We have discussed several other topics (with implementations) in the past that align with such topics.

Here are some of them:

  • Learn sophisticated graph architectures and how to train them on graph data in this crash course.
  • So many real-world NLP systems rely on pairwise context scoring. Learn scalable approaches here.
  • Run large models on small devices using Quantization techniques.
  • Learn how to generate prediction intervals or sets with strong statistical guarantees for increasing trust using Conformal Predictions.
  • Learn how to identify causal relationships and answer business questions using causal inference in this crash course.
  • Learn how to scale and implement ML model training in this practical guide.
  • Learn 5 techniques with implementation to reliably test ML models in production.
  • Learn how to build and implement privacy-first ML systems using Federated Learning.
  • Learn 6 techniques with implementation to compress ML models.

All these resources will help you cultivate key skills that businesses and companies care about the most.

Our newsletter puts your products and services directly in front of an audience that matters — thousands of leaders, senior data scientists, machine learning engineers, data analysts, etc., around the world.

Get in touch today →


Join the Daily Dose of Data Science Today!

A daily column with insights, observations, tutorials, and best practices on data science.

Get Started!
Join the Daily Dose of Data Science Today!

Great! You’ve successfully signed up. Please check your email.

Welcome back! You've successfully signed in.

You've successfully subscribed to Daily Dose of Data Science.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.