Compatibility patterns¶
General patterns for bridging SM100-targeted software to SM120 hardware. Pure techniques, not a specific implementation.
The three layers¶
graph TD
Lib[SM100 library binary]
Lib --> Compile[L1: rebuild from source<br/>with --target-arch sm_120]
Lib --> Substitute[L2: substitute a different<br/>SM120-targeted kernel]
Lib --> Lower[L3: lower SM100 PTX<br/>to SM120 PTX]
Compile --> Done1[Cleanest, but requires source]
Substitute --> Done2[Fastest path, but needs<br/>functional equivalent]
Lower --> Done3[Last resort,<br/>quality of result varies]
Each layer has tradeoffs. The right answer depends on what you have and what your performance budget is.
Pages in this section¶
translating-tcgen05— patterns for rewritingtcgen05PTX asmma.syncchainssmem-budget-management— fitting kernels into the 99 KiB SMEM ceilingcluster-rewriting— what to do about kernels that assume cluster size > 1ep-to-tp-rewriting— restructuring expert-parallel plans into tensor-parallel onesruntime-detection— detecting topology and architecture at startup, dispatching accordingly
When to use each¶
| Situation | Best approach |
|---|---|
| Open-source library, you have build infrastructure | L1: rebuild with SM120 target. Cleanest. |
| Pre-built library, no source available | L2: substitute. Use a different kernel library that has SM120 support. |
| Specific kernel needs to work, no equivalent exists | L3: lower the PTX. Most work, lowest performance, but always feasible. |
| Issue is at parallelism plan, not kernel | None of the above: rewrite the plan instead. See ep-to-tp-rewriting. |
A note on philosophy¶
These patterns are describing techniques, not advocating for any specific automation. Tools that automate them (compatibility shims, transpilers, plan rewriters) exist, but the patterns themselves are conceptual: how do you take an SM100-targeted thing and make it work on SM120?
The patterns don't make consumer Blackwell as fast as datacenter Blackwell. They make it work. The performance gap is hardware-fundamental: less memory bandwidth, fewer SMs, no NVLink. Software techniques close maybe half the gap; the other half is silicon.
Reading order¶
If you're porting a specific kernel: read translating-tcgen05 and smem-budget-management — those cover the two most common SM100-only constructs.
If you're working at the system level: ep-to-tp-rewriting is the highest-impact pattern. A correct plan rewrite often eliminates the need for kernel-level work.
If you're building tooling: runtime-detection describes the data structures and probes you'd need.