Skip to content

Kimi-K2 family

Moonshot AI's MoE family. Released across 2025–2026 (K2.0 → K2.6+). Similar profile to DeepSeek-V3/V4 in that it's MoE with NVFP4 quantization, but with a few distinguishing factors that affect how it runs on workstation Blackwell.

The model

K2.0 K2.6
Total parameters ~600 B ~700 B
Active per token ~32 B ~40 B
Number of experts 384 384 (with refined routing)
Top-k 8 6
Hidden dim 6144 7168
Attention GQA (Grouped-Query Attention) GQA
Native quantization FP8 → NVFP4 (later release) NVFP4

Kimi uses standard GQA rather than DeepSeek's MLA. The KV cache is larger per token than DeepSeek MLA, but the attention kernels are universally available (FlashAttention-2 supports GQA out-of-the-box).

What the reference deployment assumes

Moonshot's deployment guidance for K2 targets:

  • Hardware: H200 / B100 / B200, ideally with NVLink
  • GEMM: their fork of CUTLASS with custom NVFP4 templates targeting sm_100a
  • Attention: FlashAttention-2 or FlashInfer (GQA paths)
  • MoE: vLLM with FlashInfer-MoE for the all-to-all
  • Parallelism: EP for the experts, TP within each expert, PP across model layers

The dependency surface is less aggressive than DeepSeek's: Kimi doesn't ship a custom GEMM library; they use CUTLASS. They use FlashAttention-2 (portable) rather than a custom MLA kernel. The MoE all-to-all is the single significant SM100-only assumption.

What breaks on workstation Blackwell

1. CUTLASS NVFP4 paths hit the SMEM cliff

Moonshot's CUTLASS templates inherit the same SMEM-budget assumptions as upstream CUTLASS. On SM120, the auto-carveout request exceeds 99 KiB and corrupts SMEM banks.

Fix: either patch the CUTLASS templates to set explicit smaller StageCount, or use the upstream SM120-targeted templates (with slightly different tile shapes than Moonshot's defaults).

2. EP-with-FlashInfer-a2a breaks on PCIe atomics

Same as DeepSeek. Use NCCL fallback or switch to TP-only.

3. The 384-expert count makes TP-only memory-tight

With 384 experts each ~1.5 GB at NVFP4, total expert weight memory is ~570 GB. On a 4× 96 GB rig (384 GB total), TP=4 with replicated experts (each GPU holds all 384 experts' TP-slices) needs 570/4 ≈ 143 GB per GPU — doesn't fit in 96 GB.

This is the case where pure TP-only doesn't work and you need a hybrid plan: TP=4 + PP=2 (split layers across GPU pairs), or accept EP-with-NCCL despite the bandwidth cost.

For K2.6, you typically need either:

  • Pruning down to ~256 active experts (REAP-style) → fits TP-only
  • Hybrid TP × PP plan (slower per-token decode, lower memory pressure)
  • Acceptance of EP-with-NCCL (slow but works)

4. Routing kernel quirks

K2 uses a custom top-k routing kernel that assumes tcgen05-style asynchronous Tensor Core execution for the routing softmax. On SM120, this falls back to a slower path. Not a correctness issue, just a performance hit.

Working configuration

weights: NVFP4 (Moonshot's K2.6 release, 256 experts after REAP pruning)
kv_cache: FP8 E4M3
attention: FlashAttention-2 (GQA path) or Triton fallback
parallelism:
  tensor_parallel: 4
  pipeline_parallel: 1 (or 2 if memory tight)
  expert_parallel: 1
gemm_backend: cutlass with explicit StageCount=2 or 3

Notice: with FlashAttention-2 working out-of-the-box for GQA, Kimi-K2 is in some ways easier than DeepSeek-V4 to run on workstation Blackwell, despite being a comparable-scale model.

Performance expectations

On 4× workstation Blackwell:

Variant Decode tok/s
K2.6 with REAP pruning to 256 experts 30–50
K2.6 full 384 experts via TP × PP=2 15–30
K2.6 via EP-NCCL 5–10

A datacenter B100 deployment hits 100–200 tok/s. The gap is similar to DeepSeek-V4: ~5×.

What's distinctive about Kimi

  • No custom GEMM library: depends on CUTLASS, which has SM120 support. Easier to port.
  • Standard GQA attention: works on every kernel library.
  • High expert count: stresses memory more than DeepSeek (256 → 384).
  • Routing kernel uses tcgen05-style: a specific kernel-level dependency, not a model-architecture one.

If you can run DeepSeek-V4 on workstation Blackwell, you can definitely run Kimi-K2 — except possibly for the memory pressure from 384 experts.

See also