tcgen05 and Tensor Memory¶

The new datacenter-Blackwell-only Tensor Core ISA family, and the on-chip memory class it depends on.

What `tcgen05` is¶

tcgen05 is a family of PTX instructions introduced in PTX ISA 8.4 (with refinements in 8.5), targeting compute capability 10.0 only (i.e., SM100). The 5 denotes Tensor Core generation 5; the gen05 denotes generation-5-specific. Its design goals:

Decouple Tensor Core execution from warp execution. The warp issues an MMA and continues; the Tensor Core runs to completion in parallel.
Support larger MMA tiles than wgmma.async. Up to 128×128 single-CTA, 256×128 CTA-pair.
Reduce register-file bandwidth pressure. Accumulators land in TMEM, not registers.

These goals together produce roughly 2–3× peak FP4/FP6/FP8 throughput relative to a wgmma.async-based kernel on the same SM.

The instructions¶

Instruction	Role
`tcgen05.alloc.cta_group::N %dst, N`	Allocate N bytes of TMEM, return base address in `%dst`
`tcgen05.dealloc %addr, N`	Free N bytes at `%addr`
`tcgen05.relinquish_alloc_permit`	Inform the runtime that this CTA won't allocate more TMEM
`tcgen05.cp.shared::cta::tmem.b64 [%tmem], [%smem]`	Copy from SMEM to TMEM
`tcgen05.cp.tmem.shared::cta.b64 [%smem], [%tmem]`	Copy from TMEM to SMEM
`tcgen05.shift [%tmem], shift_amount`	Logical shift within a TMEM allocation (used for layout transforms)
`tcgen05.mma.cta_group::1.kind::<dtype>`	Single-CTA MMA
`tcgen05.mma.cta_group::2.kind::<dtype>`	CTA-pair MMA
`tcgen05.commit.cta_group::N %sema`	Commit a barrier to wait on all outstanding MMAs
`tcgen05.wait.cta_group::N %sema`	Wait on a previously-committed barrier

<dtype> enumerates supported MMA kinds: f4, mxf4, nvf4, f6, f8f6f4, f8, f16, bf16, tf32, etc.

A complete `tcgen05` MMA in PTX¶

A datacenter-Blackwell GEMM tile, simplified:

.reg .b64 %tmem_base;
.reg .b32 %sema;

// 1. Allocate 16 KB of TMEM for accumulator
tcgen05.alloc.cta_group::1 %tmem_base, 16384;

// 2. (Operands A and B already staged in SMEM via TMA)

// 3. Issue MMA: D = A * B (FP4 inputs, FP32 accumulator in TMEM)
tcgen05.mma.cta_group::1.kind::nvf4
    [%tmem_base],          // accumulator
    [%smem_a],             // operand A (in SMEM)
    [%smem_b],             // operand B (in SMEM)
    %scale_a, %scale_b;    // NVFP4 scale registers

// 4. Commit barrier
tcgen05.commit.cta_group::1 %sema;

// 5. Continue doing other work while MMA runs
// ... more code ...

// 6. Wait for MMA completion
tcgen05.wait.cta_group::1 %sema;

// 7. Copy result from TMEM to SMEM for downstream consumption
tcgen05.cp.tmem.shared::cta.b64 [%smem_out], [%tmem_base];

// 8. Free TMEM
tcgen05.dealloc %tmem_base, 16384;
tcgen05.relinquish_alloc_permit;

The contrast with wgmma.async:

wgmma accumulators live in registers; tcgen05 accumulators live in TMEM
wgmma is issued by a warp group (4 warps); tcgen05 is issued by a single warp (or a CTA pair)
wgmma tiles max out at m64n256k16; tcgen05 tiles go to m128n128k64 (single CTA) or m256n128k64 (CTA pair)
Both are async; both have commit/wait barriers

Tensor Memory (TMEM)¶

A new on-chip memory class. Properties:

Capacity: 256 KB per SM
Allocation granularity: 128 bytes
Allocator: tcgen05.alloc returns a TMEM base address, tcgen05.dealloc frees it
Addressing: TMEM addresses are separate from SMEM and global addresses — they're 32-bit logical addresses within the per-SM TMEM region
Bandwidth: high enough to feed tcgen05.mma at peak rates
Visibility: TMEM is per-SM; the issuing CTA (or CTA pair) can address it; other CTAs can't

TMEM exists for one specific reason: at the throughput levels of FP4/FP6 MMA, register file bandwidth becomes the bottleneck for a wgmma.async-style kernel. The Tensor Core wants to consume operands faster than a warp's worth of register reads can supply. By moving accumulators out of registers (and operand-staging out of registers, into SMEM and then TMEM), the warp's register file is free to serve only the issue and finalize stages, not the running MMA.

A useful mental model: TMEM is to Tensor Cores what L1 cache is to ALUs.

TMEM layouts¶

tcgen05 supports several TMEM layouts:

Default: row-major within each 32-element band
Strided: configurable stride for transposed access
Replicated: one logical operand replicated across multiple TMEM regions for CTA-pair MMA
Compressed: NVFP4-aware layout that interleaves values and scales

The tcgen05.shift instruction transforms between layouts in place.

CTA-pair / `cta_group::2` mode¶

The largest tcgen05.mma tile (m256n128k64) is too big to fit a single CTA's TMEM budget. cta_group::2 mode uses two CTAs cooperating:

Both CTAs are launched as part of the same cluster (.cluster_dim 2,1,1)
They share TMEM allocations across the cluster's SMEM-link
A single tcgen05.mma.cta_group::2 instruction is issued (by both CTAs in lock-step) and produces a tile twice the size of single-CTA mode

This is one of the reasons SM100 supports thread block clusters > 1: tcgen05 CTA-pair mode requires it.

Workstation Blackwell does not support clusters > 1, and therefore cannot use tcgen05 CTA-pair MMA at all. A kernel compiled for SM120 must use single-CTA tile shapes only — or, more typically, must not use tcgen05 at all.

Why workstation Blackwell doesn't have `tcgen05`¶

NVIDIA's likely reasoning (inferred from the architecture):

TMEM costs significant die area (256 KB/SM is real silicon)
Cluster execution requires extra SM-to-SM linkage (more silicon)
Consumer workloads (gaming, content creation, light ML) get little benefit from m128n128k64 GEMMs
Differentiating datacenter from consumer is a deliberate product strategy

The result: workstation Blackwell has the same Tensor Core hardware (gen 5, native FP4/FP6/FP8) but accesses it only through mma.sync and wgmma.async, both of which are register-bound. So peak FP4 throughput per SM is similar to Hopper-FP8 throughput per SM — useful, but not the 2–3× generational jump that SM100 sees.

What runs on what, with examples¶

Concrete examples of code that does and does not run:

// ✓ Runs on SM 9.0, 10.0, 12.0 — universal
mma.sync.aligned.m16n8k32.row.col.f32.bf16.bf16.f32 ...;

// ✓ Runs on SM 9.0, 10.0, 12.0 — but lower throughput on 12.0
wgmma.mma_async.sync.aligned.m64n128k16.f32.bf16.bf16 ...;

// ✓ Runs on SM 10.0 only
tcgen05.mma.cta_group::1.kind::nvf4 ...;

// ✓ Runs on SM 10.0 only
tcgen05.mma.cta_group::2.kind::nvf4 ...;

// ✓ Runs on SM 10.0 only
tcgen05.cp.shared::cta::tmem.b64 ...;

If you compile any of the last three for --gpu-name=sm_120, ptxas errors:

ptxas fatal: Internal error: instruction 'tcgen05.mma' not supported in this PTX version

How libraries handle this¶

CUTLASS exposes the choice:

// CUTLASS Blackwell datacenter template — uses tcgen05
using GemmKernel = cutlass::gemm::collective::CollectiveBuilder<
    cutlass::arch::Sm100,                    // ← architecture choice
    ...,
    cutlass::gemm::collective::StageCountAutoCarveout<
        sizeof(typename CollectiveOp::SharedStorage)>,
    cutlass::gemm::KernelTmaWarpSpecializedCooperative
>::CollectiveOp;

Switching to cutlass::arch::Sm120 selects a parallel template tree that uses mma.sync and wgmma.async, with smaller tile shapes that fit the 99 KiB SMEM ceiling.

DeepGEMM, by contrast, currently has only an Sm100-targeted code path (as of early 2026); a Sm120 port is in progress but not landed. Loading DeepGEMM kernels on workstation Blackwell fails at runtime.

FlashInfer has separate Triton-based and CUTLASS-based attention kernels; the CUTLASS-Blackwell path uses tcgen05, the Triton path doesn't, so workstation Blackwell falls back to the Triton path with reduced throughput.

Translating `tcgen05` to `mma.sync`¶

If you have an SM100-only kernel and need it on SM120, the conceptual translation:

SM100 op	SM120 equivalent
`tcgen05.alloc N`	`__shared__` allocation of N bytes (counts against 99 KiB)
`tcgen05.cp.shared::cta::tmem`	`cp.async.bulk` or simple SMEM staging
`tcgen05.mma.cta_group::1.kind::nvf4 m128n128k64`	~256 sequential `mma.sync m16n8k32` instructions, accumulating in registers
`tcgen05.commit`	`bar.sync` or just the last `mma.sync`'s completion
`tcgen05.cp.tmem.shared::cta`	direct register-to-SMEM store
`tcgen05.dealloc`	scope end

The translation is mechanical but produces substantially more PTX — roughly 256× as many instructions for the largest tile. The achieved Tensor Core throughput per SM ends up around 40–70 % of optimal SM120 throughput (which is itself a fraction of optimal SM100 throughput). See compatibility/translating-tcgen05 for the detailed pattern.

Checkpoint¶

You should be able to answer:

What's TMEM and how is it different from SMEM and registers?
Why does tcgen05.mma exist when wgmma.async already provided async MMA?
What does cta_group::2 mean?
Why does SM120 not have tcgen05?
Roughly how does SM120 throughput compare to SM100 for FP4 GEMM?

tcgen05 and Tensor Memory¶

What tcgen05 is¶

The instructions¶

A complete tcgen05 MMA in PTX¶