Glossary¶

Project-specific and Blackwell-specific vocabulary. Cross-linked to the page where each term is explained in depth.

Compute capability¶

Compute capability — NVIDIA's versioning scheme for SM ISAs, written as <major>.<minor> (e.g., 7.0 Volta, 8.0 Ampere, 9.0 Hopper, 10.0 datacenter Blackwell, 12.0 workstation/consumer Blackwell). Same major version generally implies forward-compatible PTX; different major version means features may not exist. See fundamentals/cuda-pipeline.

sm_NN — the lowercase compiler-flag form of compute capability. sm_100, sm_120. Bare form: portable subset of the architecture.

sm_NNa — "architecture-specific accelerated." Allows non-portable instructions (e.g., sm_100a enables tcgen05.*). Code compiled with the a suffix runs only on that exact compute capability — not earlier, not later.

sm_NNf — "forward-compatible." Restricts the code to instructions that will run on sm_NN and any later same-major architecture. Useful for code that needs to work across sm_120 workstation parts and any future sm_12N parts.

Architectures and codenames¶

Blackwell — NVIDIA's GPU generation, 2024–2026. - GB100: B100/B200 datacenter chips, SM 10.0, HBM3e, NVLink 5. - GB202: workstation/consumer chips (RTX PRO 6000 Workstation, RTX 5090), SM 12.0, GDDR7, no NVLink.

Hopper — preceding NVIDIA generation. SM 9.0 (H100/H200). Introduced TMA, async tensor cores (wgmma.async), thread block clusters.

Ampere — generation before Hopper. SM 8.0/8.6/8.9 (A100, RTX 30/40 series).

SXM — NVIDIA's datacenter card form factor. Implies presence of NVLink. (RTX cards are PCIe form factor; the "PRO" line spans both.)

Memory¶

Global memory (HBM/GDDR) — off-chip device memory. HBM3e on datacenter Blackwell (~3–8 TB/s), GDDR7 on workstation (~1.6 TB/s).

Shared memory (SMEM) — on-chip per-block scratchpad. Programmer-managed. Capacities: 99 KiB/block on SM120, 228 KiB/block on SM100. The 99 vs 228 split is one of the most consequential numbers in this wiki. See fundamentals/memory-hierarchy.

Tensor Memory (TMEM) — a new on-chip memory class introduced with SM100. Holds Tensor Core accumulators decoupled from registers. Does not exist on SM120. See blackwell/tcgen05-and-tmem.

Registers — per-thread storage. Limit: 255 32-bit registers per thread on most current arches.

Constant memory — small, cached, read-only memory class.

L1 / L2 cache — on-chip caches in the standard memory hierarchy.

Tensor Core ISA¶

mma.sync — universal Tensor Core MMA instruction, available since Volta. Synchronous: warp blocks until result lands in registers. Operates on small tiles (m16n8k16 / m16n8k32). Available on both SM100 and SM120.

wgmma.async — Hopper's warp-group async MMA. Larger tiles, asynchronous. Mostly superseded by tcgen05.mma on Blackwell datacenter; still supported on Hopper.

tcgen05.mma — Blackwell datacenter MMA family. Asynchronous, large-tile (up to m128n128k64 single-CTA, m256n128k64 CTA-pair), accumulator in TMEM. Datacenter only. See blackwell/tcgen05-and-tmem.

tcgen05.alloc / tcgen05.commit / tcgen05.cp — companion instructions for managing TMEM allocation, completion, and copy-out.

Number formats¶

FP16 — half-precision IEEE float. 1 sign + 5 exponent + 10 mantissa.

BF16 — brain-float-16. 1 sign + 8 exponent + 7 mantissa. Same range as FP32; less precision.

FP8 E4M3 — 8-bit float, 4 exponent bits, 3 mantissa. Better precision, smaller range.

FP8 E5M2 — 8-bit float, 5 exponent, 2 mantissa. Larger range, less precision. Often used for gradients in training.

FP6 — 6-bit float (E2M3 or E3M2). Less common; appears in some quantization recipes.

FP4 (E2M1) — 4-bit float, 1 sign + 2 exponent + 1 mantissa. Tiny range; only useful in block-quantized form with a per-block scale.

MX-FP4 — Open Compute Project Microscaling spec for FP4. 32-element block, per-block FP6 (E3M2) scale.

NVFP4 — NVIDIA's variant of MX-FP4. 16-element block (smaller → better dynamic-range tracking), per-block FP8 (E4M3) scale (more scale precision than FP6). Native on both SM100 and SM120 Tensor Cores. See fundamentals/number-formats.

TF32 — 19-bit Tensor Core internal format on Ampere+. 1 sign + 8 exponent + 10 mantissa. Used for FP32 matmuls accelerated through Tensor Cores.

Parallelism plans¶

TP (Tensor Parallelism) — split each weight matrix across N GPUs, all-reduce per layer. Good for GEMM-bound models on any topology.

PP (Pipeline Parallelism) — split layers across N GPUs, microbatch through. Good for memory-bound models. Some bubble overhead.

EP (Expert Parallelism) — for MoE: each GPU owns a subset of experts, route tokens via all-to-all. Bandwidth-hungry; works well only on NVLink-class fabrics. See interconnect/moe-parallelism.

DP (Data Parallelism) — replicate the model, split batch. Common for training, less for inference.

Hybrid plans — TP + PP, EP + TP, etc. The pragmatic choice on most rigs.

Interconnect¶

NVLink — NVIDIA's proprietary high-bandwidth GPU-to-GPU interconnect. Generation 5 (Blackwell): 1.8 TB/s/GPU on NVL72. Datacenter only.

NVSwitch — NVLink-based switch fabric. Connects 8+ GPUs in a single chassis (DGX, HGX). Gives uniform bandwidth across all pairs.

MNNVL (Multi-Node NVLink) — NVL72-class fabric extending NVLink across racks (up to 72 GPUs).

PCIe — universal host-side interconnect. Gen4 (16 GB/s/lane), Gen5 (32 GB/s/lane). x16 → 32 GB/s or 64 GB/s per direction.

P2P (peer-to-peer) — direct GPU-to-GPU memory access without staging through host RAM. Enabled if GPUs share a switch or root complex.

Atomics — atomic memory operations across the interconnect. On consumer GPUs, P2P atomics are software-gated off by default; require BIOS (ACS Disabled) + driver (RMDisableFeatureDisablement=1) to enable. See interconnect/p2p-and-atomics.

ACS (Access Control Services) — PCIe feature that, when enabled, isolates devices behind separate IOMMU groups and blocks P2P atomics. Counterintuitively, disabling ACS is what allows atomics.

RDMA — remote-DMA over network (InfiniBand, RoCE). Used for multi-node GPU-to-GPU transfers in datacenters. Not relevant to single-node consumer setups.

Kernel libraries¶

CUTLASS — NVIDIA's CUDA Templates library. The reference implementation of high-performance GEMM. Templates compile per-architecture; SM100 templates default to sm_100a. See kernels/cutlass.

FlashAttention (FA-2, FA-3) — Tri Dao's attention kernel. FA-2 portable; FA-3 Hopper-only (a Blackwell port is in development). See kernels/flashattention.

FlashInfer — kernel library for serving (attention + MoE). NVFP4 paths exist; some MoE all-to-all kernels need P2P atomics. See kernels/flashinfer.

DeepGEMM — DeepSeek's high-throughput FP8/FP4 GEMM. SM100-only as shipped. See kernels/deepgemm.

Marlin — INT4 GEMM with FP16 activations. Older arch; works on SM120.

Triton — DSL compiler for custom kernels. Works on SM120.

TransformerEngine — NVIDIA's mixed-precision wrapper library. SM120 support evolving.

NVSHMEM — one-sided GPU memory primitives over NVLink. Requires NVLink for performance; PCIe fallback exists but is unusably slow.

DeepEP — DeepSeek's expert-parallel a2a kernels. Intranode requires NVLink + NVSHMEM; internode requires RDMA.

vLLM, sglang, TensorRT-LLM — high-level inference engines. They compose the above libraries. See kernels/inference-engines.

Models referenced¶

DeepSeek-V3 / V4 / V4-Flash — DeepSeek's frontier MoE family (671B, with V4 evolutions). Heavy users of tcgen05, DeepGEMM, NVSHMEM, EP.

Kimi-K2 / K2.6 — Moonshot's MoE model family. Similar dependencies.

GLM-5.0 / 5.1 — Zhipu's MoE family (~478B–744B). Less aggressive on tcgen05, but still reference-deployed on SM100.

Qwen-3 (MoE variants) — Alibaba's open MoE family.

REAP — "REbalanced Activation Pruning," a pruning technique that removes whole experts from a MoE model with minimal quality loss.

Compilation pipeline¶

nvcc — NVIDIA CUDA Compiler. Front-end that drives host C++ compilation and produces PTX/cubin.

ptxas — PTX assembler. Lowers PTX to SASS (cubin).

cuobjdump — inspector for compiled CUDA binaries.

nvdisasm — disassembler for SASS.

SASS — NVIDIA's per-architecture machine code. Not portable across SM versions.

Cubin — compiled CUDA binary. Contains SASS for one or more architectures, plus optional PTX for JIT.

JIT — just-in-time. The driver can compile PTX to SASS at load time if no matching cubin section exists.

Other¶

KV cache — key-value cache in attention. Stores past tokens' K and V projections so attention is O(N) per new token rather than O(N²).

Page-attention / paged KV — block-based KV cache management (vLLM-style).

MTP (Multi-Token Prediction) — speculative-decoding scheme that predicts multiple tokens in parallel.

NSA (Native Sparse Attention) — DeepSeek's sparse-attention variant.

REAP-NN — pruning denoting how many experts (per layer) survive (e.g., REAP-160 = 160 experts kept out of 256).

Watchdog — sglang/vLLM background thread that kills the server if a forward pass takes too long.

xid — NVIDIA driver error code (e.g., Xid 79 = GPU has fallen off the bus).

AER (Advanced Error Reporting) — PCIe link-layer error reporting. RxErr counters on stressed Gen4 links.

Bus / function / device IDs — PCIe addressing, e.g., 01:00.0. May change after BIOS settings change.