Inference engines: vLLM, SGLang, TensorRT-LLM¶
The orchestrators on top of the kernel libraries. They take a model + a request and decide which kernel runs for which layer. Understanding their dispatch logic is how you understand which low-level kernel a given inference call ends up touching.
What an inference engine does¶
In rough order of execution:
- Loads weights from disk into GPU memory, applying parallelism plan (TP/PP/EP)
- Allocates KV cache (page-attention layout, fp8 or bf16, chosen at startup)
- Listens on an HTTP / gRPC API
- Schedules requests into batches (continuous batching, radix-tree caching, etc.)
- Dispatches each forward pass through the layer-by-layer kernel pipeline
- Streams output tokens back
The "dispatch" step is where the architecture matters: for each layer, the engine picks an attention kernel (FlashAttention, FlashInfer, or a Triton kernel), a GEMM kernel (CUTLASS, DeepGEMM, Marlin, or a custom path), and a MoE kernel (FlashInfer-MoE, DeepEP, or NCCL fallback).
vLLM¶
GitHub: vllm-project/vllm. License: Apache-2.0. Originally UC Berkeley; now community + Anyscale.
What's distinctive¶
- PagedAttention: pioneered the page-attention KV cache layout that's now standard
- Continuous batching: industry-leading scheduler
- Broad model support: catches up on new model architectures within days
SM120 story¶
Most of vLLM works on workstation Blackwell. The exceptions:
- Models using DSA (Differential Sparse Attention), particularly some GLM-5 variants, hit a kernel that doesn't compile on SM120. There's an open issue and PR landed in vLLM 0.7.x.
- Some MoE paths route through FlashInfer's MoE kernels, which can hit the atomics blocker.
For most non-DSA, non-MoE-EP models, vLLM "just works" on SM120 with appropriate flags.
Key flags for SM120¶
# Disable kernel paths that don't work
--quantization fp4 # Use NVFP4 (CUTLASS path)
--kv-cache-dtype fp8_e4m3 # Compact KV
--enforce-eager # Skip CUDA graph capture if it fails
# Tensor parallelism, no expert parallelism
--tensor-parallel-size 4
--pipeline-parallel-size 1
SGLang¶
GitHub: sgl-project/sglang. License: Apache-2.0. Originally LMSYS / UC Berkeley; community-maintained.
What's distinctive¶
- RadixAttention: aggressive prefix caching across requests
- Frontend DSL: programmable inference (control flow, structured outputs)
- MoE focus: among the better engines for MoE serving
SM120 story¶
SGLang has explicit SM120 support since version 0.5.10+. Supports:
- NVFP4 weights via FlashInfer + CUTLASS
- FP8 KV cache via FlashInfer's KV-attention
- Triton-based attention with KV-splits (the long-context fast path on SM120)
- TP=4 MoE without invoking EP-class kernels
Key flags for SM120¶
# Architecture-friendly
--quantization modelopt_fp4
--kv-cache-dtype fp8_e4m3
--attention-backend auto # Lets sglang pick Triton on SM120
--triton-attention-num-kv-splits 64 # The high-impact knob
# Plan
--tensor-parallel-size 4
# Performance knobs
--mem-fraction-static 0.94
--page-size 128 # 64 if MTP enabled
SGLang's environment variables¶
SGLANG_ENABLE_DEEP_GEMM=0 # DeepGEMM is SM100-only as of early 2026
SGLANG_DISABLE_DEEP_GEMM=1
SGLANG_ENABLE_JIT_DEEPGEMM=0
SGLANG_PYNCCL_SKIP_WARMUP=1 # Avoid TP=4 warmup deadlock on PCIe
These are belt-and-braces: they tell sglang to skip kernel paths known to be SM100-specific.
TensorRT-LLM¶
GitHub: NVIDIA/TensorRT-LLM. License: Apache-2.0. Maintained by NVIDIA.
What's distinctive¶
- Maximum throughput at the low end of model size (under 70B parameters); compiles to a TensorRT engine ahead of time
- NVIDIA-blessed, with the most aggressive use of CUTLASS templates
- Best-in-class FP8 inference on Hopper / SM100
SM120 story¶
TRT-LLM is "supposed to" work on SM120 but ships precompiled engines targeting sm_100a. To run on SM120 you typically:
- Build TRT-LLM from source with
--target-arch sm_120 - Compile your model to a TensorRT engine on the SM120 device (engines aren't portable across architectures)
This is gnarly enough that most workstation-Blackwell users skip TRT-LLM and use vLLM or sglang. TRT-LLM's advantages (peak Hopper throughput) don't translate to SM120 anyway.
When you'd still pick TRT-LLM¶
- Need NVIDIA-verified deployment (e.g., for a customer requiring NVIDIA validation)
- Specific kernel exists in TRT-LLM that hasn't been ported elsewhere
- Already comfortable with the TensorRT toolchain
Decision tree for SM120 deployment¶
graph TD
Start[Model to serve]
Start --> ModelType{Model type?}
ModelType -- Dense |Llama, Mistral| --> Dense[Dense path]
ModelType -- MoE --> MoEPath{Has working SM120 NVFP4 weights?}
Dense --> EngineDense{Engine}
EngineDense -- vLLM --> vLLMD[vLLM<br/>--quantization fp8<br/>--kv-cache-dtype fp8_e4m3]
EngineDense -- sglang --> sglangD[sglang<br/>similar flags]
MoEPath -- yes --> MoESmGood[Use sglang or vLLM<br/>TP=4 NOT EP=4<br/>Triton attention with kv-splits=64]
MoEPath -- no --> Quant[Re-quantize to NVFP4 or W4A16]
Quant --> MoESmGood
The dominant pattern: TP-only parallelism, NVFP4 weights, FP8 KV cache, Triton attention with high kv-splits, DeepGEMM disabled in favor of CUTLASS.
Common failures shared across engines¶
These show up regardless of which engine you pick on SM120:
-
no kernel image is available— kernel library shipped onlysm_100acubins. Reinstall a SM120-aware version of the offending library (FlashInfer, DeepGEMM, etc.). -
NCCL warmup deadlock at TP=4 — the engine's distributed-init NCCL warmup phase deadlocks on cross-root-complex P2P. Fix:
SGLANG_PYNCCL_SKIP_WARMUP=1(sglang) orVLLM_DISABLE_NCCL_WARMUP=1(vLLM, where supported). -
MoE all-to-all timeout — see
flashinferandnvshmem-and-deepep. The fix is plan-level: switch from EP to TP. -
Output is gibberish at long context — likely kv-splits is at the default (8). Set
triton-attention-num-kv-splits=64. -
OOM during prefill at long prompt —
mem-fraction-statictoo aggressive (>0.95). Drop to 0.94 or 0.92.
Engine version pinning¶
Specific behaviors documented above are pinned to versions current as of early 2026:
- vLLM 0.7.x
- SGLang 0.5.10–0.5.11
- TensorRT-LLM 0.18.x
Newer or older versions may behave differently. Check release notes before assuming a flag still does what this wiki says.
See also¶
flashinfer,flashattention,cutlass— what the engines call intocompatibility/— patterns for making things workcase-studies/— model-specific deployment recipes- Kwon et al., "Efficient Memory Management for Large Language Model Serving with PagedAttention" (vLLM, 2023)
- Zheng et al., "SGLang: Efficient Execution of Structured Language Model Programs" (2024)
- vLLM, SGLang, TRT-LLM project documentation