Abbreviations¶

Short forms used throughout the wiki, alphabetical.

Abbrev	Expansion
A2A	All-to-all (collective communication)
AR	All-reduce (collective communication)
B100/B200/B300	NVIDIA Blackwell datacenter GPUs (SM100)
BW	Bandwidth
CGA	Cooperative thread-block cluster (sometimes "CTA group", same as cluster)
CTA	Cooperative Thread Array (a thread block, in PTX terminology)
CUTLASS	CUDA Templates for Linear Algebra Subroutines
DSA	DeepSeek Sparse Attention
EP	Expert Parallelism
FA	FlashAttention
FA2 / FA3	FlashAttention v2 / v3
FP4	4-bit floating-point (E2M1)
FP8	8-bit floating-point (E4M3 or E5M2)
FP16	16-bit floating-point (IEEE half)
FP32	32-bit floating-point (IEEE single)
GB200/GB300	Grace–Blackwell superchips
GEMM	General Matrix-Matrix Multiplication
GPC	Graphics Processing Cluster (GPU subdivision)
GPU	Graphics Processing Unit
GQA	Grouped-Query Attention
HBM	High-Bandwidth Memory (used in datacenter Blackwell)
IB	InfiniBand
KV	Key/Value (in transformer attention)
L1 / L2	Level-1 / Level-2 cache
MHA	Multi-Head Attention
MIG	Multi-Instance GPU
MLA	Multi-Latent Attention
MMA	Matrix-Multiply-Accumulate (Tensor Core operation)
MoE	Mixture-of-Experts
NCCL	NVIDIA Collective Communications Library
NSA	DeepSeek Native Sparse Attention
NVFP4	NVIDIA FP4 microscaled format
NVL72	NVLink-72 rack interconnect
NVSHMEM	NVIDIA Symmetric Hierarchical Memory (PGAS for GPUs)
OOM	Out of memory
P2P	Peer-to-Peer (GPU-to-GPU access)
PCIe	PCI Express
PGAS	Partitioned Global Address Space
PP	Pipeline Parallelism
PTX	Parallel Thread eXecution (NVIDIA's GPU IR)
RDMA	Remote Direct Memory Access
REAP	Router-weighted Expert Activation Pruning
SASS	Shader Assembly (NVIDIA's GPU machine code)
SDPA	Scaled Dot-Product Attention
SM	Streaming Multiprocessor
SM90 / SM100 / SM120	Compute capability tags (Hopper / Blackwell DC / Blackwell workstation)
SMEM	Shared Memory (per-SM, per-block)
SXM	NVIDIA's high-power form factor (used for datacenter cards)
T	Tera (10¹²); tok/s context: tokens per second
TC	Tensor Core
TE	TransformerEngine (NVIDIA library)
TFLOPS	Tera-FLoating-point OPerations per Second
TMA	Tensor Memory Accelerator
TMEM	Tensor Memory (SM100/SM101 only)
TP	Tensor Parallelism
VRAM	Video RAM (i.e., GPU device memory)
W4A16	4-bit Weight, 16-bit Activation (a quantization scheme)

For full glossary entries with prose definitions, see overview/glossary.