Bibliography¶
Selected sources for further reading. Grouped by topic.
NVIDIA architecture & PTX¶
- NVIDIA. Blackwell Architecture Whitepaper. 2024–2025. The canonical reference for SM100 and SM120 design.
- NVIDIA. PTX ISA 8.5 Reference Manual. Current as of CUDA 12.x / 13.x. Contains specifications for
mma.sync,tcgen05.*,cp.async,cp.async.bulk(TMA), thread-block clusters, distributed shared memory. - NVIDIA. CUDA C++ Programming Guide (current edition). Section on Tensor Cores, async copies, cooperative groups, cluster launch.
- NVIDIA. CUDA C++ Best Practices Guide. Memory hierarchy guidance, occupancy tuning.
Number formats¶
- NVIDIA. NVFP4 Block Format Specification. Part of CUTLASS docs. Defines the 16-element block size, FP8 (E4M3) per-block scale, and SM100 vs SM120 layouts.
- OCP (Open Compute Project). Microscaling (MX) Formats Specification, v1.0. October 2023. Defines MX-FP4, MX-FP6, MX-FP8 (the open-format counterparts to NVFP4).
- NVIDIA. FP8 Formats for Deep Learning. White paper, 2022. E4M3 and E5M2 details, scaling, and accuracy implications.
Tensor Cores & matrix ops¶
- CUTLASS GitHub repository. github.com/NVIDIA/cutlass — primary source for understanding Tensor Core invocation patterns. The
include/cutlass/arch/mma_sm100.handmma_sm120.hfiles documenttcgen05andmma.syncin code form. - NVIDIA. Programming Tensor Cores in CUDA. Developer blog post, multiple iterations.
FlashAttention¶
- Dao, Tri, et al. "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." NeurIPS 2022.
- Dao, Tri. "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning." 2023.
- Shah, Jay, et al. "FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-Precision." 2024. Hopper-specific design; documents the warp-specialized 2-stage pipeline that doesn't extend cleanly to SM100, much less SM120.
- FlashAttention GitHub repo: github.com/Dao-AILab/flash-attention. FA2 source is portable; FA3 is Hopper-only.
DeepSeek architecture¶
- DeepSeek-AI. DeepSeek-V3 Technical Report. December 2024. Introduces MLA (Multi-Latent Attention).
- DeepSeek-AI. DeepSeek-V3.2 Technical Report. Introduces DSA (DeepSeek Sparse Attention).
- DeepSeek-AI. Native Sparse Attention. Paper introducing NSA, which V4 is widely speculated to use.
- DeepGEMM GitHub repository: github.com/deepseek-ai/DeepGEMM. FP8/FP4 GEMM kernels, SM100-targeted.
Kimi K2¶
- Moonshot AI. Kimi K2 Technical Report. 2024. Background on the K2 family, MuonClip optimizer, MoE structure.
- Kimi K2 model card on Hugging Face: hyperparameters and architectural details.
GLM-5¶
- Zhipu AI / Tsinghua KEG. GLM-5.1 Technical Report (forthcoming or released depending on version). Architecture details for GLM-5.1 478B-A42B.
- REAP Pruning paper. Routes-weighted expert pruning method enabling REAP-160 and REAP-172.
NCCL & collectives¶
- NVIDIA. NCCL User Guide. Documents
NCCL_P2P_LEVEL,NCCL_IB_DISABLE, ring vs tree algorithms, environment variables. - NCCL GitHub repository: github.com/NVIDIA/nccl. Source for understanding ring-allreduce mechanics.
NVSHMEM & PGAS¶
- NVIDIA. NVSHMEM Documentation. PGAS abstraction for GPU clusters.
- NVSHMEM GitHub repository: github.com/NVIDIA/nvshmem (or via the open-source mirror).
DeepEP¶
- DeepEP GitHub repository: github.com/deepseek-ai/DeepEP. MoE expert-parallel communication library, datacenter Blackwell targeted.
Inference engines¶
- vLLM GitHub repository: github.com/vllm-project/vllm. PagedAttention, continuous batching.
- sglang GitHub repository: github.com/sgl-project/sglang. Structured-output-aware inference engine.
- TensorRT-LLM GitHub repository: github.com/NVIDIA/TensorRT-LLM. NVIDIA's production inference compiler.
- FlashInfer GitHub repository: github.com/flashinfer-ai/flashinfer. Attention and MoE kernel library used by major engines.
MoE all-to-all¶
- PPLX library: alternative MoE all-to-all implementation, sometimes used by sglang and vLLM as a non-NVSHMEM path.
Marlin & weight-only quantization¶
- Frantar, Elias, et al. "Marlin: Fast 4-bit Inference Kernels." 2024. Mixed-precision GEMM design.
- Marlin GitHub repository.
Triton¶
- Tillet, Philippe, et al. "Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations." MAPL 2019.
- Triton GitHub repository: github.com/triton-lang/triton.
Mixture-of-Experts foundations¶
- Shazeer, Noam, et al. "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer." ICLR 2017. Origin of modern transformer MoE.
- Fedus, William, et al. "Switch Transformers." JMLR 2022.
- Lepikhin, Dmitry, et al. "GShard." ICLR 2021.
REAP pruning¶
- (REAP paper, 2024–2025). "Router-weighted Expert Activation Pruning" — describes REAP-160 / REAP-172 methodology.
This wiki itself is opinionated synthesis, not a primary source. When in doubt, the NVIDIA documentation and the kernel library source code are authoritative.