4 Strategies for Multi-GPU Training

TODAY’S DAILY DOSE OF DATA SCIENCE

4 Strategies for Multi-GPU Training

By default, deep learning models only utilize a single GPU for training, even if multiple GPUs are available.

An ideal way to train models is to distribute the training workload across multiple GPUs.

The graphic below depicts four common strategies for multi-GPU training:

We covered multi-GPU training in detail with implementation here: A Beginner-friendly Guide to Multi-GPU Model Training.

Let’s discuss these four strategies below:

#1) Model parallelism

Different parts (or layers) of the model are placed on different GPUs.
Useful for huge models that do not fit on a single GPU.
However, model parallelism also introduces severe bottlenecks as it requires data flow between GPUs when activations from one GPU are transferred to another GPU.

#2) Tensor parallelism

Distributes and processes individual tensor operations across multiple devices or processors.
It is based on the idea that a large tensor operation, such as matrix multiplication, can be divided into smaller tensor operations, and each smaller operation can be executed on a separate device or processor.

Such parallelization strategies are inherently built into standard implementations of PyTorch and other deep learning frameworks, but they become much more pronounced in a distributed setting.

#3) Data parallelism

Replicate the model across all GPUs.
Divide the available data into smaller batches, and each batch is processed by a separate GPU.
The updates (or gradients) from each GPU are then aggregated and used to update the model parameters on every GPU.

#4) Pipeline parallelism

This is often considered a combination of data parallelism and model parallelism.
So the issue with standard model parallelism is that 1st GPU remains idle when data is being propagated through layers available in 2nd GPU:

Pipeline parallelism addresses this by loading the next micro-batch of data once the 1st GPU has finished the computations on the 1st micro-batch and transferred activations to layers available in the 2nd GPU. The process looks like this:
- 1st micro-batch passes through the layers on 1st GPU.
- 2nd GPU receives activations on 1st micro-batch from 1st GPU.
- While the 2nd GPU passes the data through the layers, another micro-batch is loaded on the 1st GPU.
- And the process continues.
GPU utilization drastically improves this way. This is evident from the animation below where multi-GPUs are being utilized at the same timestamp (look at t=1, t=2, t=5, and t=6):

Those were four common strategies for multi-GPU training.

To get into more details about multi-GPU training and implementation, read this article: A Beginner-friendly Guide to Multi-GPU Model Training.

Also, we covered 15 ways to optimize neural network training here (with implementation).

👉 Over to you: What are some other strategies for multi-GPU training?

Thanks for reading!

IN CASE YOU MISSED IT

What is Temperature in LLMs?

A low temperature value produces identical responses from the LLM (shown below):

But a high temperature value produces gibberish.

What exactly is temperature in LLMs?

We covered this in detail here →

ROADMAP

RAG vs. Agentic RAG

RAG has some issues:

It retrieves once and generates once. If the context isn’t enough, it cannot dynamically search for more info.
It cannot reason through complex queries.
The system can’t modify its strategy based on the problem.

Agentic RAG attempts to solve this.

The following visual depicts how it differs from traditional RAG.

The core idea is to introduce agentic behaviors at each stage of RAG.

Steps 1-2) An agent rewrites the query (removing spelling mistakes, etc.)

Step 3-8) An agent decides if it needs more context.

If not, the rewritten query is sent to the LLM.
If yes, an agent finds the best external source to fetch context, to pass it to the LLM.

Step 9) We get a response.

Step 10-12) An agent checks if the answer is relevant.

If yes, return the response.
If not, go back to Step 1.

This continues for a few iterations until we get a response or the system admits it cannot answer the query.

This makes RAG more robust since agents ensure individual outcomes are aligned with the goal.

That said, the diagram shows one of the many blueprints an agentic RAG system may possess.

You can adapt it according to your specific use case.

Soon, we shall cover Agentic RAG and many more related techniques to building robust RAG systems.

In the meantime, make sure you are fully equipped with everything we have covered so far like:

No-Fluff Industry ML resources to

Succeed in DS/ML roles

At the end of the day, all businesses care about impact. That’s it!

Can you reduce costs?
Drive revenue?
Can you scale ML models?
Predict trends before they happen?

We have discussed several other topics (with implementations) in the past that align with such topics.

Develop Industry ML skills

Here are some of them:

Learn sophisticated graph architectures and how to train them on graph data in this crash course.
So many real-world NLP systems rely on pairwise context scoring. Learn scalable approaches here.
Run large models on small devices using Quantization techniques.
Learn how to generate prediction intervals or sets with strong statistical guarantees for increasing trust using Conformal Predictions.
Learn how to identify causal relationships and answer business questions using causal inference in this crash course.
Learn how to scale and implement ML model training in this practical guide.
Learn 5 techniques with implementation to reliably test ML models in production.
Learn how to build and implement privacy-first ML systems using Federated Learning.
Learn 6 techniques with implementation to compress ML models.

All these resources will help you cultivate key skills that businesses and companies care about the most.

Advertise to 600k+ data professionals

Our newsletter puts your products and services directly in front of an audience that matters — thousands of leaders, senior data scientists, machine learning engineers, data analysts, etc., around the world.

Get in touch today →

The Full MCP Blueprint: Testing, Security, and Sandboxing in MCPs (Part B)

The Full MCP Blueprint: Testing, Security and Sandboxing in MCPs (Part A)

The Full MCP Blueprint: Integrating Sampling into MCP Workflows

4 Strategies for Multi-GPU Training

TODAY’S DAILY DOSE OF DATA SCIENCE

4 Strategies for Multi-GPU Training

#1) Model parallelism

#2) Tensor parallelism

#3) Data parallelism

#4) Pipeline parallelism

IN CASE YOU MISSED IT

​What is Temperature in LLMs?​​

ROADMAP

RAG vs. Agentic RAG

No-Fluff Industry ML resources to

Succeed in DS/ML roles

SPONSOR US

Advertise to 600k+ data professionals

Read next

The Full MCP Blueprint: Testing, Security, and Sandboxing in MCPs (Part B)

The Full MCP Blueprint: Testing, Security and Sandboxing in MCPs (Part A)

The Full MCP Blueprint: Integrating Sampling into MCP Workflows

Join the Daily Dose of Data Science Today!

What is Temperature in LLMs?