Case studies¶
Why specific frontier models break on workstation Blackwell, and what each one teaches about the architecture split.
Pages in this section¶
deepseek-v3-v4— the canonical EP-over-NVLink designkimi-k2— Moonshot's MoE family, similar profileglm-5— Zhipu's MoE family, less aggressive ontcgen05generic-moe-on-consumer-blackwell— the synthesis: a checklist for any new MoE model
Why these models in particular¶
Each of these models was released between mid-2024 and early 2026 by a frontier lab targeting datacenter Blackwell as the deployment platform. They're representative of what the modern open-weight MoE looks like:
- 100B–700B total parameters, 30B–60B active per token
- 64–256 experts, top-k=2 to top-k=8 routing
- NVFP4 quantization shipped as part of the release
- Reference deployment using DeepGEMM, FlashInfer, sglang or vLLM
- Performance numbers benchmarked on B100 or H200
The common pattern: the reference deployment assumes datacenter Blackwell. Running on workstation Blackwell uncovers the same subset of issues each time, with model-specific variations.
How to use this section¶
Each case study answers four questions:
- What is the model? (architecture, size, intended use)
- What does its reference deployment assume? (hardware, kernels, parallelism)
- What breaks on workstation Blackwell? (specific kernels, specific failure modes)
- What's the workaround? (configuration, kernel substitution, plan rewriting)
If you're trying to deploy a specific model, find it here and read the relevant page. If your model isn't covered, generic-moe-on-consumer-blackwell gives you the diagnostic procedure to apply yourself.
A meta-observation¶
The case studies all converge to the same answer: TP-only parallelism + NVFP4 (or W4A16) weights + FP8 KV + Triton attention with high kv-splits + DeepGEMM disabled.
This is approximately the recipe that any workstation-Blackwell MoE deployment ends up using by mid-2026. It's not optimal — there's a 30–50× throughput gap to a datacenter deployment — but it's the configuration that runs, which is what most users need.
The rest of the deployment story (specific tile shapes, specific FP8 KV layouts, specific page sizes) varies between models, but the broad strokes are converging on this pattern.