Kernel libraries¶

The set of CUDA kernel libraries that modern inference engines compose. Each one has a specific role, a specific Blackwell-era story, and a specific set of failure modes on workstation Blackwell.

The landscape¶

graph TD
    INF["Inference engines<br/>(vLLM, SGLang, TRT-LLM)"]
    INF --> ATTN["Attention<br/>(FlashAttention, FlashInfer)"]
    INF --> GEMM["GEMM<br/>(CUTLASS, DeepGEMM, Marlin)"]
    INF --> MOE["MoE all-to-all<br/>(DeepEP, FlashInfer-MoE, NCCL)"]
    INF --> COMM["Cross-GPU communication<br/>(NCCL, NVSHMEM)"]
    INF --> COMP["Custom kernels<br/>(Triton, TransformerEngine)"]

Inference engines aren't kernel implementations themselves — they're orchestrators. Each request goes through layers of dispatch: framework → kernel library → architecture-specific code path. The "does it work on SM120" question is answered at the bottom layer.

Pages in this section¶

cutlass — NVIDIA's GEMM template library
flashattention — the FA-2 / FA-3 family
flashinfer — attention + MoE for serving
deepgemm — DeepSeek's high-throughput FP8/FP4 GEMM
marlin-and-friends — INT4 paths (Marlin, AWQ, GPTQ)
triton-and-transformerengine — DSL kernels and NVIDIA's mixed-precision wrappers
nvshmem-and-deepep — communication primitives for MoE
inference-engines — vLLM, SGLang, TRT-LLM as kernel composers

How to read this section¶

Each library page follows the same template:

What it is — purpose, who maintains it
What it depends on — its place in the stack
SM100 story — how datacenter Blackwell is supported
SM120 story — what works, what doesn't, what's gated
Common failures — the specific errors users encounter
Detection — how to figure out whether a binary uses this library and which arch it targets
References — where to read more

If you're trying to understand a specific failure ("why doesn't FlashInfer's MoE all-to-all work?"), find the relevant library page and read sections 4–5.

Compatibility summary¶

A bird's-eye view of where each library sits:

Library	SM100 status	SM120 status	Notes
CUTLASS	full support, `sm_100a` templates	partial, `sm_120` templates exist; SMEM cliff is the main gotcha	NVIDIA-maintained
FlashAttention 2	works	works (FA-2 is portable)	Tri Dao, MIT
FlashAttention 3	yes (Hopper-extended)	not yet — Blackwell port in flight
FlashInfer	full	partial — NVFP4 ok, MoE one-shot a2a needs P2P atomics
DeepGEMM	full	unsupported as shipped, port in progress	DeepSeek-AI
Marlin	not the optimal path	works fine; older arch, broadly supported
Triton	works	works	DSL compiler — most kernels portable
TransformerEngine	NVIDIA's reference	evolving	NVIDIA-maintained
NVSHMEM	requires NVLink for performance	unusably slow without NVLink	NVIDIA
DeepEP intranode	requires NVLink + NVSHMEM	doesn't run	DeepSeek-AI
DeepEP internode	requires RDMA NIC	doesn't run
vLLM	works	works for non-DSA models; DSA needs SM120 fix
SGLang	works	works (specific versions, with patches)
TensorRT-LLM	works	builds on SM120; precompiled engines target SM100	NVIDIA

The pattern: anything compiled by NVIDIA against datacenter Blackwell defaults to sm_100a and ships only that target. Workstation Blackwell support requires either a recompile or an SM120-targeted variant.

Why so many libraries?¶

A reasonable question. The fragmentation reflects different optimization domains:

Attention vs GEMM are different problems with different optimal kernel structures (irregular sparsity vs dense matmul)
MoE all-to-all is a communication primitive, not a compute one
Quantization (NVFP4 vs MX-FP4 vs INT4) changes the kernel inner loop substantially
Different arches (SM80, SM90, SM100, SM120) each motivate different tile shapes and pipelines

The combinatorial explosion produces ~10 distinct libraries, each maintained by a different team. The inference engine on top has the unenviable job of choosing the right one for each layer of each model.

A note on versioning¶

These libraries change rapidly — most ship a new release every 2–4 weeks. Specific behaviors documented on these pages are pinned to versions current as of early 2026:

CUTLASS 3.6.x
FlashAttention 2.7.x, FA-3 development
FlashInfer 0.6.x – 0.7.x
DeepGEMM as of deepseek-ai/DeepGEMM main
vLLM 0.7.x
SGLang 0.5.x
TensorRT-LLM 0.18.x

When you read this in mid-2026 or later, expect specific commit hashes and versions to have moved. The architectural facts (which library uses tcgen05, which one needs P2P atomics) evolve more slowly than the version numbers.