On an H100 running Llama 70B, a single inference request hits 92% GPU compute utilization during prefill, then drops to 28% during decode on the same hardware a moment later. The workload changed, not the GPU.
For context:
Prefill processes the entire prompt in parallel and saturates tensor cores.
Decode generates one token at a time and reads the full KV cache from HBM at every step, which makes it memory-bandwidth bound.
This asymmetry is why a single optimization never gets you very far, and why LLM inference prices have still fallen roughly 10x per year, with GPT-4-level performance going from $20 per million tokens in late 2022 to around $0.40 today.
Most of that drop came from the serving stack, and we put together this visual, which lists the techniques that go into optimizing LLMs in production.
Every technique in the grid above is a response to one of three bottlenecks: prefill compute, decode memory bandwidth, or the cost of everything that wraps the model.
Stacking enough of these techniques closes the 5-8x cost-efficiency gap between optimized vLLM or TensorRT-LLM deployments and naive FP16 inference.
Today, letβs walk through the nine layers, what each one actually solves, and how they stack up in a real production deployment.
We covered a lot more in the LLMOps course with implementations and engineering logic.
A 70B model in FP16 is 140GB before you load a single token of context. Compression attacks this usage directly.
INT8 halves the memory vs FP16.
INT4 cuts it 4x.
FP8 gives you native tensor core support on Hopper and Blackwell, which means compression plus speedup.
GPTQ, AWQ, and SmoothQuant are the three main algorithms here.
GPTQ uses Hessian-based second-order information
AWQ preserves salient weights based on activation magnitudes,
SmoothQuant handles both weights and activations at W8A8.
Distillation and pruning attack the parameter count itself rather than the bits per parameter.
Multi-LoRA serving is the escape hatch for multi-tenant deployments, where you keep one base model in memory and hot-swap small adapter weights per request.
Standard attention is O(NΒ²). At 128K context, this will have 16 billion computations, which is why naive attention is infeasible at long context even on H100-class hardware.
FlashAttention reorders the attention math to be IO-aware, avoiding materializing the full NΓN matrix.
PagedAttention applies OS-style virtual memory to the KV cache, eliminating fragmentation.
MQA, GQA, and MLA attack the number of KV heads.
MQA shares one KV head across all queries, GQA groups them, MLA compresses keys and values into a low-rank latent. DeepSeek-V2 reported a 93.3% KV cache reduction from MLA alone.
Sliding window attention restricts each token to a local window. MoE activates only a subset of experts per token. These are architectural choices driven entirely by serving economics.
Decode is memory-bound because every new token requires a full pass over the weights and KV cache.
Speculative decoding sidesteps this by generating a draft with a cheap model, then verifying in parallel with the main model.
Medusa attaches extra prediction heads to the model itself, so the same model can draft its own candidate tokens without needing a separate smaller model.
EAGLE improves on this by predicting at the hidden-state level rather than the token level, which gives higher draft accuracy and better speedups.
Lookahead decoding skips the draft model entirely. It generates and verifies multiple tokens in parallel from the main model alone.
Prompt lookup decoding copies spans directly from the input prompt, which is surprisingly effective for tasks with heavy prompt-output overlap like summarization or code edits.
Constrained decoding enforces grammars at the token level, which is how providers guarantee valid JSON.
Multi-token prediction trains the model to emit several tokens per forward pass.
The KV cache grows linearly with context length, and for long conversations, it dominates memory (learn KV caching here)
A 70B model with 4K context per request already consumes several gigabytes of KV just for a modest batch size.
Prefix caching reuses KV across requests sharing the same prefix, which is why system prompts and few-shot examples are effectively free after the first request.
KV offload tiers cold cache entries to CPU RAM or NVMe.
KV cache quantization compresses the cache itself, separate from the weights.
Token eviction methods like H2O and SnapKV drop low-attention tokens from the cache. SnapKV reports 92% KV compression at a 1024-token budget with a 3.6x decode speedup.
Attention sinks, from the StreamingLLM paper, keep the first few tokens permanently in the cache to prevent long-context generation from going incoherent past the cache limit.
Chunked prefill splits long prompts into smaller pieces so decode steps can interleave with prefill work.
LLM inference is memory-bandwidth bound during decode, which means the GPU is usually starved. Batching more requests together amortizes memory reads across more useful work.
Continuous batching does this at the iteration level. As soon as one request finishes generating, a new one takes its slot mid-flight.
Dynamic batching waits for a short window to group arriving requests. Batching 32 requests together cuts per-token cost roughly 85% with minor latency impact.
Prefill-decode disaggregation splits the two phases onto separate GPU pools. Perplexity, Meta, and Mistral run this in production because co-locating prefill and decode on the same GPU means decode requests freeze every time a new prefill enters the batch.
SLO-aware scheduling prioritizes interactive traffic over background jobs.
Spot GPU scheduling runs preemptible workloads on cheap capacity.
Prompt caching reuses the KV state of static prefixes across calls. Anthropic reports up to 90% cost reduction and 85% latency reduction for long cached prompts.
Semantic caching matches queries by embedding similarity rather than exact string match, which handles paraphrases.
Exact-match caching is the hash-based baseline.
Response caching stores completed outputs.
Embedding deflection routes simple queries to a vector search without ever calling the LLM.
Batch API endpoints run async jobs at roughly half the per-token price for non-realtime workloads
A reasonable setup for a general-purpose API might run FP8 weights, GQA-based attention with FlashAttention kernels, PagedAttention for KV, continuous batching with prefill-decode disaggregation, prefix caching for system prompts, semantic caching at the application layer, prompt compression for long retrieved contexts, and model routing to send trivial queries to a small model.
The gap between this stack and a naive FP16 deployment with static batching is 5-8x on cost-per-token, and each technique alone moves the number only a small amount, which is exactly why the compounding across all nine layers is what defines a serious production setup.
We covered a lot more in the LLMOps course with implementations and engineering logic.