Skip to content

SM100 vs SM120

The architectural diff between datacenter Blackwell (compute capability 10.0) and workstation/consumer Blackwell (compute capability 12.0). The most important page in this wiki.

The codenames

Compute capability NVIDIA codename Products Form factor
10.0 (SM100) GB100, GB200, GB300 B100, B200, B300; GB200/GB300 superchips SXM, PCIe
12.0 (SM120) GB202 RTX PRO 6000 Blackwell Server Edition, RTX PRO 6000 Blackwell Workstation Edition, RTX 5090, RTX 5080 PCIe

Note on the "RTX PRO 6000 Blackwell" branding: both the Server Edition and the Workstation Edition are GB202 / SM120 — the same silicon as the RTX 5090, just in different cooling and power form factors. They are not datacenter Blackwell. SM100 is reserved for the B100/B200/B300 SXM parts and the GB200/GB300 superchips. If your card has "RTX PRO 6000" anywhere in the name, you are on SM120.

What stays the same

  • Tensor Core generation is identical: gen 5, with native FP4/FP6/FP8/BF16/FP16 paths
  • PTX ISA major version: 8 (with minor version 5 as of 2026)
  • Driver and CUDA toolkit: same binaries support both
  • Threading model: warps of 32, CTAs/blocks, grid hierarchy
  • L1/L2 caching architecture: same conceptual structure (sizes differ)
  • Native NVFP4 + MX-FP4 GEMM: both halves have it

What's different

1. The tcgen05 ISA family

SM100 has it. SM120 does not.

A new family of PTX instructions introduced with datacenter Blackwell:

  • tcgen05.alloc / tcgen05.dealloc — manage Tensor Memory allocations
  • tcgen05.mma.cta_group::1 / cta_group::2 — issue large-tile asynchronous MMA into TMEM
  • tcgen05.commit — wait barrier on outstanding MMAs
  • tcgen05.cp.shared::cta::tmem — copy from SMEM to TMEM (and back)
  • tcgen05.shift — shift accumulator data within TMEM (used for layout transforms)

When you assemble PTX containing any of these for --gpu-name=sm_120, ptxas errors with "instruction not supported." There is no fallback path; the kernel has to be rewritten. See tcgen05-and-tmem.

2. Tensor Memory (TMEM)

SM100 has 256 KB/SM. SM120 has none.

A new on-chip memory class, separate from registers and SMEM, dedicated to Tensor Core accumulators. See fundamentals/memory-hierarchy and tcgen05-and-tmem.

Without TMEM, accumulators must live in either:

  • Registers: limited to 255 32-bit registers/thread. A 128×128 FP32 accumulator (64 KB) doesn't fit in a single warp's worth of registers.
  • SMEM: limited to 99 KiB/block on SM120, and shared with operand-staging buffers and pipeline state.

So SM120 GEMMs must use smaller accumulator tiles than their SM100 counterparts.

3. Per-block shared memory ceiling

SM100: 228 KiB per block. SM120: 99 KiB per block.

This is the single most consequential per-architecture number for kernel porting. CUTLASS templates, FlashAttention pipelines, and many custom kernels size pipeline-stage buffers as a function of available SMEM. Templates compiled for SM100 with StageCountAutoCarveout can request up to ~220 KiB and assume the launch will succeed.

On SM120 the same template still launches — but the memory request is silently truncated, the kernel writes past the 99 KiB boundary into adjacent SMEM banks (or into garbage), and the result is zeroed/scrambled outputs. No error code.

The canonical CUTLASS issue is NVIDIA/cutlass#3096, which tracks the SM120 audit pass needed to detect this at compile time.

4. Thread block clusters

SM100 supports cluster sizes up to 16 CTAs. SM120 supports only cluster size 1 (i.e., no clustering).

A cluster (introduced Hopper, expanded Blackwell datacenter) is a group of CTAs that:

  • Share a cluster shared memory address space across SMEM banks of co-located SMs
  • Can synchronize via cluster.sync
  • Can issue cluster-wide TMA (cp.async.bulk.tensor.shared::cluster.global)

A kernel declaring .cluster_dim 2,1,1 requires SM100 to actually run with 2 cooperating CTAs. On SM120, the cluster dim is silently downgraded to 1,1,1 — the kernel still launches but doesn't have its expected partner CTA. Anything that synchronized against the partner deadlocks or produces wrong output.

See thread-block-clusters.

5. TMA — Tensor Memory Accelerator

SM100 has both single-CTA and cluster-shared TMA. SM120 has single-CTA TMA only.

TMA is a hardware DMA engine (introduced Hopper) that asynchronously copies tensor tiles between global memory and SMEM, with built-in addressing logic for multidimensional tiles. Single-CTA TMA exists on both halves of Blackwell (and on Hopper). Cluster-shared TMA, which can deposit a single tile across multiple SMs' SMEM at once, exists only on SM100.

SM100-class cards typically have NVLink. SM120-class cards do not.

This is product-segmentation, not strictly an SM-feature, but in practice it correlates perfectly: NVIDIA does not ship NVLink bridges or NVLink endpoints on consumer Blackwell SKUs. Implications discussed in interconnect/.

7. Memory and bandwidth

SM100 (B100 example) SM120 (RTX PRO 6000 W. example)
Memory type HBM3e GDDR7
Capacity 192 GB 96 GB
Bandwidth 8 TB/s ~1.8 TB/s
L2 cache ~50 MB ~16 MB

This is hardware-economic, not ISA. Even with perfect software, an SM120 card has less memory bandwidth — meaningful for memory-bound workloads (long-context decode, very small batches).

8. PCIe generation

SM100 (PCIe parts) SM120 (consumer cards)
PCIe Gen5 x16 (typical) Gen5 x16 (consumer/workstation), Gen4 on older platforms

In principle the same; in practice consumer cards land in workstation motherboards that may only run them at Gen4 (32 GB/s/dir) or marginal Gen5.

9. P2P features

SM100 datacenter products typically support P2P over NVLink + MNNVL. SM120 supports PCIe P2P but with software-gated atomics.

Specifics in interconnect/p2p-and-atomics.

A side-by-side feature matrix

Feature SM 9.0 (Hopper) SM 10.0 (Blackwell-DC) SM 12.0 (Blackwell-WS)
mma.sync (gen 1+)
wgmma.async ✓ (lower throughput)
tcgen05.mma (single-CTA)
tcgen05.mma (CTA-pair)
Tensor Memory (TMEM)
FP8 Tensor Core
FP6 Tensor Core
FP4 / NVFP4 Tensor Core
Single-CTA TMA
Cluster-shared TMA
Thread block clusters > 1
SMEM/block 228 KiB 228 KiB 99 KiB
NVLink ✓ (NVL5) ✓ (NVL5)
MNNVL
HBM HBM3 HBM3e GDDR7

Failure modes, summarized

The architectural deltas above produce these specific failure modes when SM100 software is run on SM120:

Software depends on... SM120 failure mode
tcgen05.* instructions ptxas error "instruction not supported" or runtime "no kernel image" if pre-compiled
TMEM allocation same as above
Cluster size > 1 silent downgrade to cluster-1; subsequent cluster.sync deadlocks
Cluster-shared TMA runtime error from TMA descriptor validation
228 KiB SMEM ceiling launches succeed; outputs are silently corrupted as SMEM allocation overflows the 99 KiB boundary
NVLink-class bandwidth the operation completes but at PCIe speeds (often 30–50× slower for all-to-all traffic)
P2P atomics atomics return "not supported"; busy-poll loops never observe completion; watchdog timeout

The first four are explicit failures — you get an error or a hang, and you know something is wrong. The last three are silent or performance failures — the program runs to completion but produces wrong outputs or crawls. The silent class is the more dangerous one.

What you can do about it

Three families of approaches, covered in compatibility/:

  1. Substitute — swap the SM100-only kernel for an equivalent SM120-targeted kernel from a different library
  2. Lower — rewrite the SM100 PTX to SM120 PTX, replacing tcgen05 with chains of mma.sync and TMEM with smaller register/SMEM accumulators
  3. Re-plan — at the model-system level, change the parallelism plan (EP → TP+PP) so the SM100-dependent code path isn't taken at all

Each family has tradeoffs. The right answer depends on your performance budget and what you actually need.

See also