KV Caching in LLMs, explained visually

TODAY’S DAILY DOSE OF DATA SCIENCE

KV Caching in LLMs, explained visually

KV caching is a popular technique to speed up LLM inference.

To get some perspective, look at the inference speed difference from our demo:

0:00

/0:46

with KV caching → 9 seconds
without KV caching → 40 seconds (~4.5x slower, and this gap grows as more tokens are produced).

Today, let’s visually understand how KV caching works.

Let's dive in!

To understand KV caching, we must know how LLMs output tokens.

As shown in the visual above:

Transformer produces hidden states for all tokens.
Hidden states are projected to vocab space.
Logits of the last token is used to generate the next token.
Repeat for subsequent tokens.

Thus, to generate a new token, we only need the hidden state of the most recent token. None of the other hidden states are required.

Next, let's see how the last hidden state is computed within the transformer layer from the attention mechanism.

During attention, we first do the product of query and key matrices, and the last row involves the last token’s query vector and all key vectors:

None of the other query vectors are needed during inference.

Also, the last row of the final attention result involves the last query vector and all key & value vectors. Check this visual to understand better:

The above insight suggests that to generate a new token, every attention operation in the network only needs:

Query vector of the last token.
All key & value vectors.

But there's one more key insight here.

As we generate new tokens, the KV vectors used for ALL previous tokens do not change.

Thus, we just need to generate a KV vector for the token generated one step before.

The rest of the KV vectors can be retrieved from a cache to save compute and time.

This is called KV caching!

To reiterate, instead of redundantly computing KV vectors of all context tokens, cache them.

To generate a token:

Generate QKV vector for the token generated one step before.
Get all other KV vectors from the cache.
Compute attention.
Store the newly generated KV values in the cache.

As you can tell, this saves time during inference.

In fact, this is why ChatGPT takes some time to generate the first token than the subsequent tokens. During that little pause, the KV cache of the prompt is computed.

That said, KV cache also takes a lot of memory.

Consider Llama3-70B, which has:

total layers = 80
hidden size = 8k
max output size = 4k

Here:

Every token takes up ~2.5 MB in the KV cache.
4k tokens will take up 10.5 GB.

More users → more memory.

We'll cover KV optimization soon.

👉 Over to you: How can we optimize the memory consumption?

IN CASE YOU MISSED IT

LoRA/QLoRA—Explained From a Business Lens

Consider the size difference between BERT-large and GPT-3:

GPT-4 (not shown here) is 10x bigger than GPT-3.

I have fine-tuned BERT-large several times on a single GPU using traditional fine-tuning:

But this is impossible with GPT-3, which has 175B parameters. That's 350GB of memory just to store model weights under float16 precision.

This means that if OpenAI used traditional fine-tuning within its fine-tuning API, it would have to maintain one model copy per user:

If 10 users fine-tuned GPT-3 → they need 3500 GB to store model weights.
If 1000 users fine-tuned GPT-3 → they need 350k GB to store model weights.
If 100k users fine-tuned GPT-3 → they need 35 million GB to store model weights.

And the problems don't end there:

OpenAI bills solely based on usage. What if someone fine-tunes the model for fun or learning purposes but never uses it?
Since a request can come anytime, should they always keep the fine-tuned model loaded in memory? Wouldn't that waste resources since several models may never be used?

LoRA (+ QLoRA and other variants) neatly solved this critical business problem.

We covered this in detail here →

No-Fluff Industry ML resources to

Succeed in DS/ML roles

At the end of the day, all businesses care about impact. That’s it!

Can you reduce costs?
Drive revenue?
Can you scale ML models?
Predict trends before they happen?

We have discussed several other topics (with implementations) in the past that align with such topics.

Develop Industry ML skills

Here are some of them:

Learn sophisticated graph architectures and how to train them on graph data in this crash course.
So many real-world NLP systems rely on pairwise context scoring. Learn scalable approaches here.
Run large models on small devices using Quantization techniques.
Learn how to generate prediction intervals or sets with strong statistical guarantees for increasing trust using Conformal Predictions.
Learn how to identify causal relationships and answer business questions using causal inference in this crash course.
Learn how to scale and implement ML model training in this practical guide.
Learn 5 techniques with implementation to reliably test ML models in production.
Learn how to build and implement privacy-first ML systems using Federated Learning.
Learn 6 techniques with implementation to compress ML models.

All these resources will help you cultivate key skills that businesses and companies care about the most.

Advertise to 600k+ data professionals

Our newsletter puts your products and services directly in front of an audience that matters — thousands of leaders, senior data scientists, machine learning engineers, data analysts, etc., around the world.

Get in touch today →

The Full MCP Blueprint: Testing, Security, and Sandboxing in MCPs (Part B)

The Full MCP Blueprint: Testing, Security and Sandboxing in MCPs (Part A)

The Full MCP Blueprint: Integrating Sampling into MCP Workflows

KV Caching in LLMs, explained visually

TODAY’S DAILY DOSE OF DATA SCIENCE

KV Caching in LLMs, explained visually​

IN CASE YOU MISSED IT

​LoRA/QLoRA—Explained From a Business Lens

No-Fluff Industry ML resources to

Succeed in DS/ML roles

SPONSOR US

Advertise to 600k+ data professionals

Read next

The Full MCP Blueprint: Testing, Security, and Sandboxing in MCPs (Part B)

The Full MCP Blueprint: Testing, Security and Sandboxing in MCPs (Part A)

The Full MCP Blueprint: Integrating Sampling into MCP Workflows

Join the Daily Dose of Data Science Today!

KV Caching in LLMs, explained visually

LoRA/QLoRA—Explained From a Business Lens