Skip to content

Bibliography

Selected sources for further reading. Grouped by topic.

NVIDIA architecture & PTX

  • NVIDIA. Blackwell Architecture Whitepaper. 2024–2025. The canonical reference for SM100 and SM120 design.
  • NVIDIA. PTX ISA 8.5 Reference Manual. Current as of CUDA 12.x / 13.x. Contains specifications for mma.sync, tcgen05.*, cp.async, cp.async.bulk (TMA), thread-block clusters, distributed shared memory.
  • NVIDIA. CUDA C++ Programming Guide (current edition). Section on Tensor Cores, async copies, cooperative groups, cluster launch.
  • NVIDIA. CUDA C++ Best Practices Guide. Memory hierarchy guidance, occupancy tuning.

Number formats

  • NVIDIA. NVFP4 Block Format Specification. Part of CUTLASS docs. Defines the 16-element block size, FP8 (E4M3) per-block scale, and SM100 vs SM120 layouts.
  • OCP (Open Compute Project). Microscaling (MX) Formats Specification, v1.0. October 2023. Defines MX-FP4, MX-FP6, MX-FP8 (the open-format counterparts to NVFP4).
  • NVIDIA. FP8 Formats for Deep Learning. White paper, 2022. E4M3 and E5M2 details, scaling, and accuracy implications.

Tensor Cores & matrix ops

  • CUTLASS GitHub repository. github.com/NVIDIA/cutlass — primary source for understanding Tensor Core invocation patterns. The include/cutlass/arch/mma_sm100.h and mma_sm120.h files document tcgen05 and mma.sync in code form.
  • NVIDIA. Programming Tensor Cores in CUDA. Developer blog post, multiple iterations.

FlashAttention

  • Dao, Tri, et al. "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." NeurIPS 2022.
  • Dao, Tri. "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning." 2023.
  • Shah, Jay, et al. "FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-Precision." 2024. Hopper-specific design; documents the warp-specialized 2-stage pipeline that doesn't extend cleanly to SM100, much less SM120.
  • FlashAttention GitHub repo: github.com/Dao-AILab/flash-attention. FA2 source is portable; FA3 is Hopper-only.

DeepSeek architecture

  • DeepSeek-AI. DeepSeek-V3 Technical Report. December 2024. Introduces MLA (Multi-Latent Attention).
  • DeepSeek-AI. DeepSeek-V3.2 Technical Report. Introduces DSA (DeepSeek Sparse Attention).
  • DeepSeek-AI. Native Sparse Attention. Paper introducing NSA, which V4 is widely speculated to use.
  • DeepGEMM GitHub repository: github.com/deepseek-ai/DeepGEMM. FP8/FP4 GEMM kernels, SM100-targeted.

Kimi K2

  • Moonshot AI. Kimi K2 Technical Report. 2024. Background on the K2 family, MuonClip optimizer, MoE structure.
  • Kimi K2 model card on Hugging Face: hyperparameters and architectural details.

GLM-5

  • Zhipu AI / Tsinghua KEG. GLM-5.1 Technical Report (forthcoming or released depending on version). Architecture details for GLM-5.1 478B-A42B.
  • REAP Pruning paper. Routes-weighted expert pruning method enabling REAP-160 and REAP-172.

NCCL & collectives

  • NVIDIA. NCCL User Guide. Documents NCCL_P2P_LEVEL, NCCL_IB_DISABLE, ring vs tree algorithms, environment variables.
  • NCCL GitHub repository: github.com/NVIDIA/nccl. Source for understanding ring-allreduce mechanics.

NVSHMEM & PGAS

  • NVIDIA. NVSHMEM Documentation. PGAS abstraction for GPU clusters.
  • NVSHMEM GitHub repository: github.com/NVIDIA/nvshmem (or via the open-source mirror).

DeepEP

Inference engines

MoE all-to-all

  • PPLX library: alternative MoE all-to-all implementation, sometimes used by sglang and vLLM as a non-NVSHMEM path.

Marlin & weight-only quantization

  • Frantar, Elias, et al. "Marlin: Fast 4-bit Inference Kernels." 2024. Mixed-precision GEMM design.
  • Marlin GitHub repository.

Triton

  • Tillet, Philippe, et al. "Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations." MAPL 2019.
  • Triton GitHub repository: github.com/triton-lang/triton.

Mixture-of-Experts foundations

  • Shazeer, Noam, et al. "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer." ICLR 2017. Origin of modern transformer MoE.
  • Fedus, William, et al. "Switch Transformers." JMLR 2022.
  • Lepikhin, Dmitry, et al. "GShard." ICLR 2021.

REAP pruning

  • (REAP paper, 2024–2025). "Router-weighted Expert Activation Pruning" — describes REAP-160 / REAP-172 methodology.

This wiki itself is opinionated synthesis, not a primary source. When in doubt, the NVIDIA documentation and the kernel library source code are authoritative.