Feb 14, 2025 LLMs

KV Caching in LLMs, explained visually

A popular interview question.

Avi Chawla

👉

TODAY'S ISSUE

TODAY’S DAILY DOSE OF DATA SCIENCE

KV Caching in LLMs, explained visually

KV caching is a popular technique to speed up LLM inference.

To get some perspective, look at the inference speed difference from our demo:

0:00

/0:46

with KV caching → 9 seconds
without KV caching → 40 seconds (~4.5x slower, and this gap grows as more tokens are produced).

Today, let’s visually understand how KV caching works.

Let's dive in!

To understand KV caching, we must know how LLMs output tokens.

As shown in the visual above:

Transformer produces hidden states for all tokens.
Hidden states are projected to vocab space.
Logits of the last token is used to generate the next token.
Repeat for subsequent tokens.

Thus, to generate a new token, we only need the hidden state of the most recent token. None of the other hidden states are required.

Next, let's see how the last hidden state is computed within the transformer layer from the attention mechanism.

During attention, we first do the product of query and key matrices, and the last row involves the last token’s query vector and all key vectors:

None of the other query vectors are needed during inference.

Also, the last row of the final attention result involves the last query vector and all key & value vectors. Check this visual to understand better:

The above insight suggests that to generate a new token, every attention operation in the network only needs:

Query vector of the last token.
All key & value vectors.

But there's one more key insight here.

As we generate new tokens, the KV vectors used for ALL previous tokens do not change.

Thus, we just need to generate a KV vector for the token generated one step before.

The rest of the KV vectors can be retrieved from a cache to save compute and time.

This is called KV caching!

To reiterate, instead of redundantly computing KV vectors of all context tokens, cache them.

To generate a token:

Generate QKV vector for the token generated one step before.
Get all other KV vectors from the cache.
Compute attention.
Store the newly generated KV values in the cache.

As you can tell, this saves time during inference.

In fact, this is why ChatGPT takes some time to generate the first token than the subsequent tokens. During that little pause, the KV cache of the prompt is computed.

That said, KV cache also takes a lot of memory.

Consider Llama3-70B, which has:

total layers = 80
hidden size = 8k
max output size = 4k

Here:

Every token takes up ~2.5 MB in the KV cache.
4k tokens will take up 10.5 GB.

More users → more memory.

We'll cover KV optimization soon.

👉 Over to you: How can we optimize the memory consumption?

IN CASE YOU MISSED IT

LoRA/QLoRA—Explained From a Business Lens

Consider the size difference between BERT-large and GPT-3:

GPT-4 (not shown here) is 10x bigger than GPT-3.

I have fine-tuned BERT-large several times on a single GPU using traditional fine-tuning:

But this is impossible with GPT-3, which has 175B parameters. That's 350GB of memory just to store model weights under float16 precision.

This means that if OpenAI used traditional fine-tuning within its fine-tuning API, it would have to maintain one model copy per user:

If 10 users fine-tuned GPT-3 → they need 3500 GB to store model weights.
If 1000 users fine-tuned GPT-3 → they need 350k GB to store model weights.
If 100k users fine-tuned GPT-3 → they need 35 million GB to store model weights.

And the problems don't end there:

OpenAI bills solely based on usage. What if someone fine-tunes the model for fun or learning purposes but never uses it?
Since a request can come anytime, should they always keep the fine-tuned model loaded in memory? Wouldn't that waste resources since several models may never be used?

LoRA (+ QLoRA and other variants) neatly solved this critical business problem.

We covered this in detail here →

Published on Feb 14, 2025

KV Caching in LLMs, explained visually

TODAY’S DAILY DOSE OF DATA SCIENCE

KV Caching in LLMs, explained visually​

IN CASE YOU MISSED IT

​LoRA/QLoRA—Explained From a Business Lens

KV Caching in LLMs, explained visually

LoRA/QLoRA—Explained From a Business Lens