Build a Frontier Model
🚀
Step 9 of 11

Inference & Serving

Training is one-time. Inference is forever.

Serving a frontier model economically requires every trick: FlashAttention 3 kernels, paged KV-cache (vLLM), speculative decoding (EAGLE-3, self-speculation), low-precision quantization (FP8, INT4, MXFP4), prompt caching, and disaggregated prefill/decode. Inference cost has dropped 10-100x in 18 months while quality climbed.

Why it matters

  • Inference cost dominates total cost-of-ownership for any deployed model — typical labs spend 5-20x more on inference than training.
  • Speculative decoding alone delivers 2-3x throughput at zero quality loss.
  • Prompt caching (Anthropic, OpenAI, Gemini) cut repeat-prompt costs ~90% — now table stakes.
  • Quantized inference (MXFP4, AWQ-INT4) lets 70B-class models run on a single Blackwell card.

State of the art

2025-2026
  • FlashAttention-3 (Hopper async + FP8) is the universal attention kernel.
  • vLLM PagedAttention is the default serving stack — 99% of OSS deployments.
  • EAGLE-3 (2024) and self-speculation deliver 3-4x decode speedup with no degradation.
  • MXFP4 weight quantization (OCP standard, Blackwell-native) crossed into production for 100B+ models.
  • Disaggregated prefill/decode (DistServe, Mooncake) — separate fleets for the two distinct workloads.
  • Prompt caching with explicit cache-control headers — 90% cost reduction for system-prompt-heavy workloads.

The recipe

A frontier-grade implementation, in order.

1

Pick a serving stack

vLLM (open, PagedAttention, broad model support). SGLang (constrained generation). TensorRT-LLM (NVIDIA, fastest on Blackwell). LMDeploy (multimodal).

2

Quantize

FP8 (Hopper/Blackwell, ~free quality). MXFP4 weights (~5% quality drop, ~2x memory savings). AWQ-INT4 for older hardware.

3

Speculative decoding

Train a draft model 5-50x smaller, OR use self-speculation via MTP heads. EAGLE-3 is the SOTA tree-style speculative.

4

KV-cache strategy

Paged KV (vLLM) by default. MLA (architectural) if you control training. SnapKV/H2O for prompt-heavy workloads. Prompt caching for repeated prefixes.

5

Disaggregated serving

Separate prefill (compute-bound) from decode (memory-bound). Different model parallelism, different GPU pools.

6

Continuous batching

Iteration-level scheduling (Orca-style). Mix in-flight requests at every decode step. Required for >50 QPS on a single replica.

⚠️

Common pitfalls

Don't quantize until you've validated SOTA-quality kernels for your model — bad MXFP4 implementations regress 10%+.
KV-cache eviction policies (H2O, StreamingLLM) hurt long-context retrieval. Validate on your workload.
Speculative decoding hurts at low batch size (single-stream). At batch=1, it's pure win; at batch=64+, gains shrink.
Prompt caching has eviction rules — bursting traffic patterns cause cold misses; co-locate similar prefixes.