SM100 vs SM120¶
The architectural diff between datacenter Blackwell (compute capability 10.0) and workstation/consumer Blackwell (compute capability 12.0). The most important page in this wiki.
The codenames¶
| Compute capability | NVIDIA codename | Products | Form factor |
|---|---|---|---|
| 10.0 (SM100) | GB100, GB200, GB300 | B100, B200, B300; GB200/GB300 superchips | SXM, PCIe |
| 12.0 (SM120) | GB202 | RTX PRO 6000 Blackwell Server Edition, RTX PRO 6000 Blackwell Workstation Edition, RTX 5090, RTX 5080 | PCIe |
Note on the "RTX PRO 6000 Blackwell" branding: both the Server Edition and the Workstation Edition are GB202 / SM120 — the same silicon as the RTX 5090, just in different cooling and power form factors. They are not datacenter Blackwell. SM100 is reserved for the B100/B200/B300 SXM parts and the GB200/GB300 superchips. If your card has "RTX PRO 6000" anywhere in the name, you are on SM120.
What stays the same¶
- Tensor Core generation is identical: gen 5, with native FP4/FP6/FP8/BF16/FP16 paths
- PTX ISA major version: 8 (with minor version 5 as of 2026)
- Driver and CUDA toolkit: same binaries support both
- Threading model: warps of 32, CTAs/blocks, grid hierarchy
- L1/L2 caching architecture: same conceptual structure (sizes differ)
- Native NVFP4 + MX-FP4 GEMM: both halves have it
What's different¶
1. The tcgen05 ISA family¶
SM100 has it. SM120 does not.
A new family of PTX instructions introduced with datacenter Blackwell:
tcgen05.alloc/tcgen05.dealloc— manage Tensor Memory allocationstcgen05.mma.cta_group::1/cta_group::2— issue large-tile asynchronous MMA into TMEMtcgen05.commit— wait barrier on outstanding MMAstcgen05.cp.shared::cta::tmem— copy from SMEM to TMEM (and back)tcgen05.shift— shift accumulator data within TMEM (used for layout transforms)
When you assemble PTX containing any of these for --gpu-name=sm_120, ptxas errors with "instruction not supported." There is no fallback path; the kernel has to be rewritten. See tcgen05-and-tmem.
2. Tensor Memory (TMEM)¶
SM100 has 256 KB/SM. SM120 has none.
A new on-chip memory class, separate from registers and SMEM, dedicated to Tensor Core accumulators. See fundamentals/memory-hierarchy and tcgen05-and-tmem.
Without TMEM, accumulators must live in either:
- Registers: limited to 255 32-bit registers/thread. A 128×128 FP32 accumulator (64 KB) doesn't fit in a single warp's worth of registers.
- SMEM: limited to 99 KiB/block on SM120, and shared with operand-staging buffers and pipeline state.
So SM120 GEMMs must use smaller accumulator tiles than their SM100 counterparts.
3. Per-block shared memory ceiling¶
SM100: 228 KiB per block. SM120: 99 KiB per block.
This is the single most consequential per-architecture number for kernel porting. CUTLASS templates, FlashAttention pipelines, and many custom kernels size pipeline-stage buffers as a function of available SMEM. Templates compiled for SM100 with StageCountAutoCarveout can request up to ~220 KiB and assume the launch will succeed.
On SM120 the same template still launches — but the memory request is silently truncated, the kernel writes past the 99 KiB boundary into adjacent SMEM banks (or into garbage), and the result is zeroed/scrambled outputs. No error code.
The canonical CUTLASS issue is NVIDIA/cutlass#3096, which tracks the SM120 audit pass needed to detect this at compile time.
4. Thread block clusters¶
SM100 supports cluster sizes up to 16 CTAs. SM120 supports only cluster size 1 (i.e., no clustering).
A cluster (introduced Hopper, expanded Blackwell datacenter) is a group of CTAs that:
- Share a cluster shared memory address space across SMEM banks of co-located SMs
- Can synchronize via
cluster.sync - Can issue cluster-wide TMA (
cp.async.bulk.tensor.shared::cluster.global)
A kernel declaring .cluster_dim 2,1,1 requires SM100 to actually run with 2 cooperating CTAs. On SM120, the cluster dim is silently downgraded to 1,1,1 — the kernel still launches but doesn't have its expected partner CTA. Anything that synchronized against the partner deadlocks or produces wrong output.
5. TMA — Tensor Memory Accelerator¶
SM100 has both single-CTA and cluster-shared TMA. SM120 has single-CTA TMA only.
TMA is a hardware DMA engine (introduced Hopper) that asynchronously copies tensor tiles between global memory and SMEM, with built-in addressing logic for multidimensional tiles. Single-CTA TMA exists on both halves of Blackwell (and on Hopper). Cluster-shared TMA, which can deposit a single tile across multiple SMs' SMEM at once, exists only on SM100.
6. NVLink and MNNVL¶
SM100-class cards typically have NVLink. SM120-class cards do not.
This is product-segmentation, not strictly an SM-feature, but in practice it correlates perfectly: NVIDIA does not ship NVLink bridges or NVLink endpoints on consumer Blackwell SKUs. Implications discussed in interconnect/.
7. Memory and bandwidth¶
| SM100 (B100 example) | SM120 (RTX PRO 6000 W. example) | |
|---|---|---|
| Memory type | HBM3e | GDDR7 |
| Capacity | 192 GB | 96 GB |
| Bandwidth | 8 TB/s | ~1.8 TB/s |
| L2 cache | ~50 MB | ~16 MB |
This is hardware-economic, not ISA. Even with perfect software, an SM120 card has less memory bandwidth — meaningful for memory-bound workloads (long-context decode, very small batches).
8. PCIe generation¶
| SM100 (PCIe parts) | SM120 (consumer cards) | |
|---|---|---|
| PCIe | Gen5 x16 (typical) | Gen5 x16 (consumer/workstation), Gen4 on older platforms |
In principle the same; in practice consumer cards land in workstation motherboards that may only run them at Gen4 (32 GB/s/dir) or marginal Gen5.
9. P2P features¶
SM100 datacenter products typically support P2P over NVLink + MNNVL. SM120 supports PCIe P2P but with software-gated atomics.
Specifics in interconnect/p2p-and-atomics.
A side-by-side feature matrix¶
| Feature | SM 9.0 (Hopper) | SM 10.0 (Blackwell-DC) | SM 12.0 (Blackwell-WS) |
|---|---|---|---|
mma.sync (gen 1+) |
✓ | ✓ | ✓ |
wgmma.async |
✓ | ✓ | ✓ (lower throughput) |
tcgen05.mma (single-CTA) |
✓ | ||
tcgen05.mma (CTA-pair) |
✓ | ||
| Tensor Memory (TMEM) | ✓ | ||
| FP8 Tensor Core | ✓ | ✓ | ✓ |
| FP6 Tensor Core | ✓ | ✓ | |
| FP4 / NVFP4 Tensor Core | ✓ | ✓ | |
| Single-CTA TMA | ✓ | ✓ | ✓ |
| Cluster-shared TMA | ✓ | ✓ | |
| Thread block clusters > 1 | ✓ | ✓ | |
| SMEM/block | 228 KiB | 228 KiB | 99 KiB |
| NVLink | ✓ (NVL5) | ✓ (NVL5) | |
| MNNVL | ✓ | ||
| HBM | HBM3 | HBM3e | GDDR7 |
Failure modes, summarized¶
The architectural deltas above produce these specific failure modes when SM100 software is run on SM120:
| Software depends on... | SM120 failure mode |
|---|---|
tcgen05.* instructions |
ptxas error "instruction not supported" or runtime "no kernel image" if pre-compiled |
| TMEM allocation | same as above |
| Cluster size > 1 | silent downgrade to cluster-1; subsequent cluster.sync deadlocks |
| Cluster-shared TMA | runtime error from TMA descriptor validation |
| 228 KiB SMEM ceiling | launches succeed; outputs are silently corrupted as SMEM allocation overflows the 99 KiB boundary |
| NVLink-class bandwidth | the operation completes but at PCIe speeds (often 30–50× slower for all-to-all traffic) |
| P2P atomics | atomics return "not supported"; busy-poll loops never observe completion; watchdog timeout |
The first four are explicit failures — you get an error or a hang, and you know something is wrong. The last three are silent or performance failures — the program runs to completion but produces wrong outputs or crawls. The silent class is the more dangerous one.
What you can do about it¶
Three families of approaches, covered in compatibility/:
- Substitute — swap the SM100-only kernel for an equivalent SM120-targeted kernel from a different library
- Lower — rewrite the SM100 PTX to SM120 PTX, replacing
tcgen05with chains ofmma.syncand TMEM with smaller register/SMEM accumulators - Re-plan — at the model-system level, change the parallelism plan (EP → TP+PP) so the SM100-dependent code path isn't taken at all
Each family has tradeoffs. The right answer depends on your performance budget and what you actually need.
See also¶
tcgen05-and-tmem— the new ISA family in detailthread-block-clusters— the clusters storyfundamentals/cuda-pipeline—sm_100avssm_120in the build pipelinekernels/cutlass— how CUTLASS handles the split- NVIDIA Blackwell Architecture Whitepaper
- NVIDIA PTX ISA spec, sections that gate features behind
target sm_100a