Transformer vs. Mixture of Experts in LLMs

TODAY’S DAILY DOSE OF DATA SCIENCE

Transformer vs. Mixture of Experts in LLMs

Mixture of Experts (MoE) is a popular architecture that uses different "experts" to improve Transformer models.

The visual below explains how they differ from Transformers.

Let's dive in to learn more about MoE!

Transformer and MoE differ in the decoder block:

Transformer uses a feed-forward network.
MoE uses experts, which are feed-forward networks but smaller compared to that in Transformer.

During inference, a subset of experts are selected. This makes inference faster in MoE.

Also, since the network has multiple decoder layers:

the text passes through different experts across layers.
the chosen experts also differ between tokens.

But how does the model decide which experts should be ideal?

The router does that.

The router is like a multi-class classifier that produces softmax scores over experts. Based on the scores, we select the top K experts.

The router is trained with the network and it learns to select the best experts.

But it isn't straightforward.

There are challenges.

Challenge 1) Notice this pattern at the start of training:

The model selects "Expert 2" (randomly since all experts are similar).
The selected expert gets a bit better.
It may get selected again since it’s the best.
This expert learns more.
The same expert can get selected again since it’s the best.
It learns even more.
And so on!

Essentially, this way, many experts go under-trained!

We solve this in two steps:

Add noise to the feed-forward output of the router so that other experts can get higher logits.
Set all but top K logits to -infinity. After softmax, these scores become zero.

This way, other experts also get the opportunity to train.

Challenge 2) Some experts may get exposed to more tokens than others—leading to under-trained experts.

We prevent this by limiting the number of tokens an expert can process.

If an expert reaches the limit, the input token is passed to the next best expert instead.

MoEs have more parameters to load. However, a fraction of them are activated since we only select some experts.

This leads to faster inference. Mixtral 8x7B by MistralAI is one famous LLM that is based on MoE.

Here's the visual again that compares Transformers and MoE!

If you want to learn how to build with LLMs…

…we have already covered a full crash course on building, optimizing, improving, evaluating, and monitoring RAG apps (with implementation).

Start here → RAG crash course (9 parts + 3 hours of read time).

RAG Crash Course

Over to you: Do you like the strategy of multiple experts instead of a single feed-forward network?

Thanks for reading!

IN CASE YOU MISSED IT

LoRA/QLoRA—Explained From a Business Lens

Consider the size difference between BERT-large and GPT-3:

GPT-4 (not shown here) is 10x bigger than GPT-3.

I have fine-tuned BERT-large several times on a single GPU using traditional fine-tuning:

But this is impossible with GPT-3, which has 175B parameters. That's 350GB of memory just to store model weights under float16 precision.

This means that if OpenAI used traditional fine-tuning within its fine-tuning API, it would have to maintain one model copy per user:

If 10 users fine-tuned GPT-3 → they need 3500 GB to store model weights.
If 1000 users fine-tuned GPT-3 → they need 350k GB to store model weights.
If 100k users fine-tuned GPT-3 → they need 35 million GB to store model weights.

And the problems don't end there:

OpenAI bills solely based on usage. What if someone fine-tunes the model for fun or learning purposes but never uses it?
Since a request can come anytime, should they always keep the fine-tuned model loaded in memory? Wouldn't that waste resources since several models may never be used?

LoRA (+ QLoRA and other variants) neatly solved this critical business problem.

We covered this in detail here →

No-Fluff Industry ML resources to

Succeed in DS/ML roles

At the end of the day, all businesses care about impact. That’s it!

Can you reduce costs?
Drive revenue?
Can you scale ML models?
Predict trends before they happen?

We have discussed several other topics (with implementations) in the past that align with such topics.

Develop Industry ML skills

Here are some of them:

Learn sophisticated graph architectures and how to train them on graph data in this crash course.
So many real-world NLP systems rely on pairwise context scoring. Learn scalable approaches here.
Run large models on small devices using Quantization techniques.
Learn how to generate prediction intervals or sets with strong statistical guarantees for increasing trust using Conformal Predictions.
Learn how to identify causal relationships and answer business questions using causal inference in this crash course.
Learn how to scale and implement ML model training in this practical guide.
Learn 5 techniques with implementation to reliably test ML models in production.
Learn how to build and implement privacy-first ML systems using Federated Learning.
Learn 6 techniques with implementation to compress ML models.

All these resources will help you cultivate key skills that businesses and companies care about the most.

Advertise to 600k+ data professionals

Our newsletter puts your products and services directly in front of an audience that matters — thousands of leaders, senior data scientists, machine learning engineers, data analysts, etc., around the world.

Get in touch today →

The Full MCP Blueprint: Practical MCP Integration with 4 Popular Agentic Frameworks

The Full MCP Blueprint: Testing, Security, and Sandboxing in MCPs (Part B)

The Full MCP Blueprint: Testing, Security and Sandboxing in MCPs (Part A)

Transformer vs. Mixture of Experts in LLMs

TODAY’S DAILY DOSE OF DATA SCIENCE

Transformer vs. Mixture of Experts in LLMs

IN CASE YOU MISSED IT

​LoRA/QLoRA—Explained From a Business Lens

No-Fluff Industry ML resources to

Succeed in DS/ML roles

SPONSOR US

Advertise to 600k+ data professionals

Read next

The Full MCP Blueprint: Practical MCP Integration with 4 Popular Agentic Frameworks

The Full MCP Blueprint: Testing, Security, and Sandboxing in MCPs (Part B)

The Full MCP Blueprint: Testing, Security and Sandboxing in MCPs (Part A)

Join the Daily Dose of Data Science Today!

LoRA/QLoRA—Explained From a Business Lens