SM100 vs SM120¶

The architectural diff between datacenter Blackwell (compute capability 10.0) and workstation/consumer Blackwell (compute capability 12.0). The most important page in this wiki.

The codenames¶

Compute capability	NVIDIA codename	Products	Form factor
10.0 (SM100)	GB100, GB200, GB300	B100, B200, B300; GB200/GB300 superchips	SXM, PCIe
12.0 (SM120)	GB202	RTX PRO 6000 Blackwell Server Edition, RTX PRO 6000 Blackwell Workstation Edition, RTX 5090, RTX 5080	PCIe

Note on the "RTX PRO 6000 Blackwell" branding: both the Server Edition and the Workstation Edition are GB202 / SM120 — the same silicon as the RTX 5090, just in different cooling and power form factors. They are not datacenter Blackwell. SM100 is reserved for the B100/B200/B300 SXM parts and the GB200/GB300 superchips. If your card has "RTX PRO 6000" anywhere in the name, you are on SM120.

What stays the same¶

Tensor Core generation is identical: gen 5, with native FP4/FP6/FP8/BF16/FP16 paths
PTX ISA major version: 8 (with minor version 5 as of 2026)
Driver and CUDA toolkit: same binaries support both
Threading model: warps of 32, CTAs/blocks, grid hierarchy
L1/L2 caching architecture: same conceptual structure (sizes differ)
Native NVFP4 + MX-FP4 GEMM: both halves have it

What's different¶

1. The `tcgen05` ISA family¶

SM100 has it. SM120 does not.

A new family of PTX instructions introduced with datacenter Blackwell:

tcgen05.alloc / tcgen05.dealloc — manage Tensor Memory allocations
tcgen05.mma.cta_group::1 / cta_group::2 — issue large-tile asynchronous MMA into TMEM
tcgen05.commit — wait barrier on outstanding MMAs
tcgen05.cp.shared::cta::tmem — copy from SMEM to TMEM (and back)
tcgen05.shift — shift accumulator data within TMEM (used for layout transforms)

When you assemble PTX containing any of these for --gpu-name=sm_120, ptxas errors with "instruction not supported." There is no fallback path; the kernel has to be rewritten. See tcgen05-and-tmem.

2. Tensor Memory (TMEM)¶

SM100 has 256 KB/SM. SM120 has none.

A new on-chip memory class, separate from registers and SMEM, dedicated to Tensor Core accumulators. See fundamentals/memory-hierarchy and tcgen05-and-tmem.

Without TMEM, accumulators must live in either:

Registers: limited to 255 32-bit registers/thread. A 128×128 FP32 accumulator (64 KB) doesn't fit in a single warp's worth of registers.
SMEM: limited to 99 KiB/block on SM120, and shared with operand-staging buffers and pipeline state.

So SM120 GEMMs must use smaller accumulator tiles than their SM100 counterparts.

3. Per-block shared memory ceiling¶

SM100: 228 KiB per block. SM120: 99 KiB per block.

This is the single most consequential per-architecture number for kernel porting. CUTLASS templates, FlashAttention pipelines, and many custom kernels size pipeline-stage buffers as a function of available SMEM. Templates compiled for SM100 with StageCountAutoCarveout can request up to ~220 KiB and assume the launch will succeed.

On SM120 the same template still launches — but the memory request is silently truncated, the kernel writes past the 99 KiB boundary into adjacent SMEM banks (or into garbage), and the result is zeroed/scrambled outputs. No error code.

The canonical CUTLASS issue is NVIDIA/cutlass#3096, which tracks the SM120 audit pass needed to detect this at compile time.

4. Thread block clusters¶

SM100 supports cluster sizes up to 16 CTAs. SM120 supports only cluster size 1 (i.e., no clustering).

A cluster (introduced Hopper, expanded Blackwell datacenter) is a group of CTAs that:

Share a cluster shared memory address space across SMEM banks of co-located SMs
Can synchronize via cluster.sync
Can issue cluster-wide TMA (cp.async.bulk.tensor.shared::cluster.global)

A kernel declaring .cluster_dim 2,1,1 requires SM100 to actually run with 2 cooperating CTAs. On SM120, the cluster dim is silently downgraded to 1,1,1 — the kernel still launches but doesn't have its expected partner CTA. Anything that synchronized against the partner deadlocks or produces wrong output.

See thread-block-clusters.

5. TMA — Tensor Memory Accelerator¶

SM100 has both single-CTA and cluster-shared TMA. SM120 has single-CTA TMA only.

TMA is a hardware DMA engine (introduced Hopper) that asynchronously copies tensor tiles between global memory and SMEM, with built-in addressing logic for multidimensional tiles. Single-CTA TMA exists on both halves of Blackwell (and on Hopper). Cluster-shared TMA, which can deposit a single tile across multiple SMs' SMEM at once, exists only on SM100.

6. NVLink and MNNVL¶

SM100-class cards typically have NVLink. SM120-class cards do not.

This is product-segmentation, not strictly an SM-feature, but in practice it correlates perfectly: NVIDIA does not ship NVLink bridges or NVLink endpoints on consumer Blackwell SKUs. Implications discussed in interconnect/.

7. Memory and bandwidth¶

	SM100 (B100 example)	SM120 (RTX PRO 6000 W. example)
Memory type	HBM3e	GDDR7
Capacity	192 GB	96 GB
Bandwidth	8 TB/s	~1.8 TB/s
L2 cache	~50 MB	~16 MB

This is hardware-economic, not ISA. Even with perfect software, an SM120 card has less memory bandwidth — meaningful for memory-bound workloads (long-context decode, very small batches).

8. PCIe generation¶

	SM100 (PCIe parts)	SM120 (consumer cards)
PCIe	Gen5 x16 (typical)	Gen5 x16 (consumer/workstation), Gen4 on older platforms

In principle the same; in practice consumer cards land in workstation motherboards that may only run them at Gen4 (32 GB/s/dir) or marginal Gen5.

9. P2P features¶

SM100 datacenter products typically support P2P over NVLink + MNNVL. SM120 supports PCIe P2P but with software-gated atomics.

Specifics in interconnect/p2p-and-atomics.

A side-by-side feature matrix¶

Feature	SM 9.0 (Hopper)	SM 10.0 (Blackwell-DC)	SM 12.0 (Blackwell-WS)
`mma.sync` (gen 1+)	✓	✓	✓
`wgmma.async`	✓	✓	✓ (lower throughput)
`tcgen05.mma` (single-CTA)		✓
`tcgen05.mma` (CTA-pair)		✓
Tensor Memory (TMEM)		✓
FP8 Tensor Core	✓	✓	✓
FP6 Tensor Core		✓	✓
FP4 / NVFP4 Tensor Core		✓	✓
Single-CTA TMA	✓	✓	✓
Cluster-shared TMA	✓	✓
Thread block clusters > 1	✓	✓
SMEM/block	228 KiB	228 KiB	99 KiB
NVLink	✓ (NVL5)	✓ (NVL5)
MNNVL		✓
HBM	HBM3	HBM3e	GDDR7

Failure modes, summarized¶

The architectural deltas above produce these specific failure modes when SM100 software is run on SM120:

Software depends on...	SM120 failure mode
`tcgen05.*` instructions	`ptxas` error "instruction not supported" or runtime "no kernel image" if pre-compiled
TMEM allocation	same as above
Cluster size > 1	silent downgrade to cluster-1; subsequent `cluster.sync` deadlocks
Cluster-shared TMA	runtime error from TMA descriptor validation
228 KiB SMEM ceiling	launches succeed; outputs are silently corrupted as SMEM allocation overflows the 99 KiB boundary
NVLink-class bandwidth	the operation completes but at PCIe speeds (often 30–50× slower for all-to-all traffic)
P2P atomics	atomics return "not supported"; busy-poll loops never observe completion; watchdog timeout

The first four are explicit failures — you get an error or a hang, and you know something is wrong. The last three are silent or performance failures — the program runs to completion but produces wrong outputs or crawls. The silent class is the more dangerous one.

What you can do about it¶

Three families of approaches, covered in compatibility/:

Substitute — swap the SM100-only kernel for an equivalent SM120-targeted kernel from a different library
Lower — rewrite the SM100 PTX to SM120 PTX, replacing tcgen05 with chains of mma.sync and TMEM with smaller register/SMEM accumulators
Re-plan — at the model-system level, change the parallelism plan (EP → TP+PP) so the SM100-dependent code path isn't taken at all

Each family has tradeoffs. The right answer depends on your performance budget and what you actually need.