GLM-5 family¶
Zhipu AI / Tsinghua's MoE family. GLM-4.5, GLM-4.7, GLM-5.0, GLM-5.1 (2025–2026). The model that's most commonly cited as the easiest large MoE to run on workstation Blackwell, because of how its dependency stack is structured.
The model¶
| GLM-5.0 | GLM-5.1 (full) | GLM-5.1 (REAP-160) | |
|---|---|---|---|
| Total parameters | ~620 B | 744 B | 478 B |
| Active per token | ~36 B | 42 B | ~36 B |
| Number of experts | 160 routed | 256 routed | 160 routed (after pruning) |
| Top-k | 6 | 8 | 8 |
| Hidden dim | 6144 | 7168 | 7168 |
| Attention | MHA + DSA option | MHA + DSA option | MHA + NSA option |
| Native quantization | FP8 | NVFP4 | NVFP4 |
GLM-5.1 introduced two attention variants: standard MHA, and DSA (Differential Sparse Attention) for long-context. DSA has a known kernel issue on SM120 (see below); MHA works fine.
REAP-160 is Zhipu's official pruning recipe that brings GLM-5.1 from 256 to 160 experts with minimal quality loss (~0.3 perplexity drop). The resulting 478B-parameter model is specifically sized to fit 4× 96 GB workstation Blackwell.
What the reference deployment assumes¶
Zhipu's deployment guidance for GLM-5 targets:
- Hardware: H200 / B200, with NVLink ideally
- GEMM: stock CUTLASS NVFP4 templates
- Attention: FlashAttention-2 (MHA) or custom DSA kernel
- MoE: sglang or vLLM, with the engine choosing between FlashInfer-MoE and CUTLASS-MoE backends
- Parallelism: EP=N for the experts, TP within each expert
The dependency stack is the lightest of the three families covered here: stock CUTLASS, stock FlashAttention, no custom GEMM library. The only tcgen05-specific code path is the optional FlashInfer-MoE one-shot a2a, which is opt-in.
What breaks on workstation Blackwell¶
1. EP plan (default for some configs)¶
If the inference engine defaults to EP, you hit the all-to-all bandwidth wall. Switch to TP.
2. DSA path on SM120¶
GLM-5.1's DSA kernel has a path that fails to compile on SM120 in some inference engine versions. This is a known issue; vLLM 0.7.x and sglang 0.5.10+ have patched it.
For long-context inference, you can either:
- Enable DSA with the patched engines
- Fall back to MHA (slower at long context, but correct)
3. CUTLASS SMEM cliff¶
Same as elsewhere — CUTLASS NVFP4 templates may overflow SMEM. Use SM120 templates with smaller tiles.
4. fp8_e4m3 KV on certain sglang versions¶
Some intermediate sglang versions hit a tcgen05.cp-based dequant path for FP8 KV. Pin to sglang 0.5.10.post1 or later which uses an mma.sync-based dequant path.
What works on workstation Blackwell¶
A working configuration for GLM-5.1 REAP-160 NVFP4:
weights: NVFP4 (REAP-160 prune of GLM-5.1, fits 4× 96GB)
kv_cache: FP8 E4M3
attention: Triton-based with kv_splits=64
parallelism:
tensor_parallel: 4
pipeline_parallel: 1
expert_parallel: 1
gemm_backend: cutlass (SM120 templates)
attention_backend: triton (auto-selected on SM120)
GLM-5.1 REAP-160 + NVFP4 quantization fits 4× 96 GB workstation Blackwell with substantial KV-cache headroom — enough for 200k-token context with FP8 KV. This is the model class for which workstation Blackwell is genuinely viable as a deployment target.
Performance expectations¶
On 4× workstation Blackwell:
| Variant | Context | Decode tok/s |
|---|---|---|
| GLM-5.1 REAP-160 | 256 | 46 |
| GLM-5.1 REAP-160 | 4 K | 42 |
| GLM-5.1 REAP-160 | 16 K | 38 |
| GLM-5.1 REAP-160 | 150 K | 22 |
These are from a TP=4, NVFP4-weights, FP8-KV, kv_splits=64 configuration. Without the kv_splits=64 setting, the long-context number drops to ~5 tok/s — that one knob is the largest single perf lever on this hardware class.
For comparison: the same model on a 4× B100 (datacenter Blackwell) deployment hits ~120–150 tok/s in the same configuration. The ~5× gap is consistent across all the case studies.
What makes GLM-5 the "easy" case¶
- Lighter custom-kernel surface: stock CUTLASS, stock FlashAttention. No DeepGEMM-equivalent.
- Pruning recipe sized for consumer hardware: REAP-160 specifically targets 4× 96 GB.
- MoE plan flexibility: model architecture works fine under TP-only, no EP requirement
- Active community: workstation Blackwell deployment is documented and tested
If you're new to running MoE on workstation Blackwell and want a model that "just works," GLM-5.1 REAP-160 NVFP4 is the recommended starting point.
What MTP and NSA add¶
- MTP (Multi-Token Prediction): speculative decoding for higher throughput. Adds an extra prediction head; requires
page_size=64and BF16 KV. Optional; opt-in via inference engine flags. - NSA (Native Sparse Attention): sparsity in the attention computation for long-context. Has SM120 kernel issues in some versions; safest to leave disabled.
For initial deployments, leave both disabled. Enable MTP if you need throughput beyond what the baseline configuration provides.
See also¶
deepseek-v3-v4— heavier dependency stackkimi-k2— comparable size, more memory pressurecompatibility/— the patterns- Zhipu AI's GLM-5 release notes and HuggingFace model cards
- The REAP pruning paper