Training is one-time. Inference is forever.
Serving a frontier model economically requires every trick: FlashAttention 3 kernels, paged KV-cache (vLLM), speculative decoding (EAGLE-3, self-speculation), low-precision quantization (FP8, INT4, MXFP4), prompt caching, and disaggregated prefill/decode. Inference cost has dropped 10-100x in 18 months while quality climbed.
A frontier-grade implementation, in order.
vLLM (open, PagedAttention, broad model support). SGLang (constrained generation). TensorRT-LLM (NVIDIA, fastest on Blackwell). LMDeploy (multimodal).
FP8 (Hopper/Blackwell, ~free quality). MXFP4 weights (~5% quality drop, ~2x memory savings). AWQ-INT4 for older hardware.
Train a draft model 5-50x smaller, OR use self-speculation via MTP heads. EAGLE-3 is the SOTA tree-style speculative.
Paged KV (vLLM) by default. MLA (architectural) if you control training. SnapKV/H2O for prompt-heavy workloads. Prompt caching for repeated prefixes.
Separate prefill (compute-bound) from decode (memory-bound). Different model parallelism, different GPU pools.
Iteration-level scheduling (Orca-style). Mix in-flight requests at every decode step. Required for >50 QPS on a single replica.
In rough order of foundational → recent. Click any title to open the arXiv abstract.
Dao · 2023
Shah et al. · 2024
Kwon et al. · 2023
Leviathan et al. · 2022
Cai et al. · 2024
Li et al. · 2024
Li et al. · 2024
Frantar et al. · 2022
Lin et al. · 2023
Xiao et al. · 2022
Zhang et al. · 2023
Xiao et al. · 2023
Zhong et al. · 2024