🚀

Step 9 of 11

Inference & Serving

Training is one-time. Inference is forever.

Serving a frontier model economically requires every trick: FlashAttention 3 kernels, paged KV-cache (vLLM), speculative decoding (EAGLE-3, self-speculation), low-precision quantization (FP8, INT4, MXFP4), prompt caching, and disaggregated prefill/decode. Inference cost has dropped 10-100x in 18 months while quality climbed.

Why it matters

Inference cost dominates total cost-of-ownership for any deployed model — typical labs spend 5-20x more on inference than training.
Speculative decoding alone delivers 2-3x throughput at zero quality loss.
Prompt caching (Anthropic, OpenAI, Gemini) cut repeat-prompt costs ~90% — now table stakes.
Quantized inference (MXFP4, AWQ-INT4) lets 70B-class models run on a single Blackwell card.

State of the art

2025-2026

FlashAttention-3 (Hopper async + FP8) is the universal attention kernel.
vLLM PagedAttention is the default serving stack — 99% of OSS deployments.
EAGLE-3 (2024) and self-speculation deliver 3-4x decode speedup with no degradation.
MXFP4 weight quantization (OCP standard, Blackwell-native) crossed into production for 100B+ models.
Disaggregated prefill/decode (DistServe, Mooncake) — separate fleets for the two distinct workloads.
Prompt caching with explicit cache-control headers — 90% cost reduction for system-prompt-heavy workloads.

The recipe

A frontier-grade implementation, in order.

Pick a serving stack

vLLM (open, PagedAttention, broad model support). SGLang (constrained generation). TensorRT-LLM (NVIDIA, fastest on Blackwell). LMDeploy (multimodal).

Quantize

FP8 (Hopper/Blackwell, ~free quality). MXFP4 weights (~5% quality drop, ~2x memory savings). AWQ-INT4 for older hardware.

Speculative decoding

Train a draft model 5-50x smaller, OR use self-speculation via MTP heads. EAGLE-3 is the SOTA tree-style speculative.

KV-cache strategy

Paged KV (vLLM) by default. MLA (architectural) if you control training. SnapKV/H2O for prompt-heavy workloads. Prompt caching for repeated prefixes.

Disaggregated serving

Separate prefill (compute-bound) from decode (memory-bound). Different model parallelism, different GPU pools.

Continuous batching

Iteration-level scheduling (Orca-style). Mix in-flight requests at every decode step. Required for >50 QPS on a single replica.

⚠️

Common pitfalls

Don't quantize until you've validated SOTA-quality kernels for your model — bad MXFP4 implementations regress 10%+.

KV-cache eviction policies (H2O, StreamingLLM) hurt long-context retrieval. Validate on your workload.

Speculative decoding hurts at low batch size (single-stream). At batch=1, it's pure win; at batch=64+, gains shrink.

Prompt caching has eviction rules — bursting traffic patterns cause cold misses; co-locate similar prefixes.

Papers to read

In rough order of foundational → recent. Click any title to open the arXiv abstract.

FlashAttention-2: Faster Attention with Better Parallelism

2307.08691

Dao · 2023

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-Precision

2407.08608

Shah et al. · 2024

Efficient Memory Management for Large Language Model Serving with PagedAttention (vLLM)

2309.06180

Kwon et al. · 2023

Fast Inference from Transformers via Speculative Decoding

2211.17192

Leviathan et al. · 2022

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

2401.10774

Cai et al. · 2024

EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty

2401.15077

Li et al. · 2024

EAGLE-2: Faster Inference via Dynamic Draft Trees

2406.16858

Li et al. · 2024

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

2210.17323

Frantar et al. · 2022

AWQ: Activation-aware Weight Quantization

2306.00978

Lin et al. · 2023

SmoothQuant

2211.10438

Xiao et al. · 2022

H2O: Heavy-Hitter Oracle for Efficient Generative Inference

2306.14048

Zhang et al. · 2023

StreamingLLM: Efficient Streaming Language Models with Attention Sinks

2309.17453

Xiao et al. · 2023

DistServe: Disaggregating Prefill and Decoding for Goodput-Optimized LLM Serving

2401.09670

Zhong et al. · 2024

Previous · Step 8

← Multimodal

Next · Step 10