P2P and atomics¶
Peer-to-peer GPU memory access, and the atomic operations that ride on top. The atomics question is the difference between "MoE all-to-all is slow" and "MoE all-to-all doesn't work at all."
What "P2P" means¶
P2P (peer-to-peer) GPU memory access is the ability of one GPU to read/write another GPU's memory directly, without copying through host RAM as an intermediate.
The driver enables P2P selectively. To check:
nvidia-smi topo -p2p r # which pairs can do P2P reads
nvidia-smi topo -p2p w # writes
nvidia-smi topo -p2p a # atomic operations
Output is a matrix where each cell is one of:
OK— supported and enabledNS— Not Supported (hardware/driver doesn't allow it)X— diagonal (a GPU's own memory)
Three operation classes¶
| Class | What | Latency | Use cases |
|---|---|---|---|
Reads (p2p r) |
GPU A reads from GPU B's HBM | high but tolerable | distributed inference activations, weight prefetch |
Writes (p2p w) |
GPU A writes to GPU B's HBM | similar to reads | NCCL P2P sends, NVSHMEM puts |
Atomics (p2p a) |
GPU A does atomic op on GPU B's HBM | slowest, requires special hardware support | completion flags, lock-free synchronization, NVSHMEM signal/wait |
On NVLink, all three are typically supported and fast. On PCIe, reads and writes are usually OK; atomics are typically NS on consumer cards by default.
Why atomics are special¶
A peer-to-peer atomic operation (e.g., atomicAdd on a remote GPU's memory) requires the PCIe fabric and root complex to coordinate transactions in a way that's not needed for plain reads/writes. Specifically:
- The atomic must be indivisible with respect to other accesses to the same address
- The fabric must serialize concurrent atomics from different sources
- Each root complex must implement the AtomicOp PCIe extensions
PCIe Gen3 added the AtomicOp capability. PCIe Gen4 / Gen5 build on it. But whether a given PCIe path actually supports AtomicOp depends on:
- Both endpoints (GPU and root complex) declaring support
- The intermediate fabric (PCIe switches, root complex routing) propagating it
- The driver/firmware enabling it (often gated as a "datacenter feature")
On consumer GPUs, NVIDIA gates AtomicOp support off by default — partly to differentiate from datacenter products, partly because validation effort for consumer SKUs doesn't include atomic verification.
Enabling atomics on consumer Blackwell¶
Two settings, both required:
1. BIOS: ACS Disable¶
ACS (Access Control Services) is a PCIe feature that, when enabled, isolates devices into separate IOMMU groups and blocks P2P atomics across groups.
- ACS Enabled → atomics blocked
- ACS Disabled → atomics allowed (within hardware capability)
The setting lives in motherboard BIOS, typically:
(Or the Intel equivalent. Naming varies by vendor; "ACS" is the keyword to look for.)
Some workstation motherboards do not expose this setting, or expose it but ignore the change (BIOS bug). Server motherboards almost always expose it.
2. Driver: RMDisableFeatureDisablement¶
A NVIDIA driver registry-dwords setting that overrides the consumer-card atomic gating:
# In /etc/modprobe.d/nvidia-unlock.conf:
options nvidia NVreg_RegistryDwords="RMDisableFeatureDisablement=1"
Then reload the nvidia kernel module:
sudo systemctl stop docker.service
sudo modprobe -r nvidia_uvm nvidia_drm nvidia_modeset nvidia
sudo modprobe nvidia
sudo systemctl start docker.service
This re-enables features that the driver normally disables for product segmentation, including P2P atomics where hardware permits.
Verifying atomics¶
After applying both settings:
If the off-diagonal cells show OK, atomics are working. If they show NS despite both settings being applied, one of:
- BIOS isn't actually applying the ACS Disable
- The motherboard's PCIe routing doesn't support AtomicOp for that path
- The driver registry-dwords didn't take effect (verify with
cat /proc/driver/nvidia/params | grep RegistryDwords)
When atomics matter¶
Several inference-stack components depend on P2P atomics:
- FlashInfer's MoE one-shot all-to-all — busy-polls on completion flags updated via atomic writes
- NVSHMEM signal/wait pairs — same pattern, generalized
- Some custom MoE dispatch kernels — DeepEP variants, custom CUTLASS templates
- Some attention kernels with cross-GPU KV sharing — rare but present in some experimental stacks
If you're running any of these on a configuration where atomics are NS, the kernel busy-polls forever, the watchdog times out at 60 seconds, and the server crashes.
The error message looks like:
Rank 0 timed out waiting for completion flag from rank 1
Rank 1 timed out waiting for completion flag from rank 2
...
cudaFuncSetAttribute ... unspecified launch failure at cutlass_fused_moe_kernels.cuh:417
The "completion flag" terminology is your clue — it's a busy-poll on an atomic-updated flag.
When atomics don't matter¶
Many workloads don't need atomics at all:
- NCCL collectives (allreduce, allgather, all-to-all) use SM-resident scheduling and don't need P2P atomics
- Tensor parallelism within a single TP group uses NCCL allreduce
- Pipeline parallelism uses NCCL P2P sends/recvs
- Most attention kernels are single-GPU (KV cache lives on one rank)
If your inference stack uses only NCCL and not NVSHMEM-style one-sided ops, atomics being NS is invisible.
A pragmatic decision tree¶
graph TD
Start[Need to run MoE on workstation Blackwell]
Start --> CanBIOS{Can you change BIOS<br/>ACS to Disabled?}
CanBIOS -- Yes --> SetACS[Set ACS Disabled,<br/>set driver RegistryDwords,<br/>verify atomics OK]
CanBIOS -- No --> Avoid[Avoid atomic-dependent kernels:<br/>use TP not EP,<br/>NCCL not NVSHMEM,<br/>FlashInfer NCCL fallback not one-shot]
SetACS --> Test{Atomics<br/>actually OK?}
Test -- Yes --> EP[Can use EP MoE,<br/>FlashInfer one-shot]
Test -- No --> Avoid
In practice, "can you change BIOS" is often "no" — many workstation motherboards don't expose ACS Disable, or the BIOS ignores it. So the most common workstation Blackwell deployment ends up in the "avoid atomic-dependent kernels" branch.
Why this gate exists at all¶
NVIDIA's product segmentation: PCIe atomics is a "datacenter feature" that's part of what you pay for in a datacenter SKU. Consumer cards are deliberately limited to encourage commercial deployments to use datacenter parts.
The technical implementation is RMDisableFeatureDisablement: a runtime knob that flips the segmentation. It exists because validation infrastructure shares code between datacenter and consumer products; flipping the knob enables the datacenter code paths.
NVIDIA does not officially "support" this knob on consumer cards, but it's used widely in research and small-scale deployments. Some specific atomic patterns work; others have edge-case bugs because they were never tested on consumer hardware.
Summary¶
| Step | Effect |
|---|---|
nvidia-smi topo -p2p a shows NS |
atomics blocked, MoE one-shot a2a will fail |
| BIOS: ACS Enabled | one of two reasons atomics are blocked |
| BIOS: ACS Disabled | first half of the unlock |
| Driver default | second of two reasons atomics are blocked |
Driver RMDisableFeatureDisablement=1 |
second half of the unlock |
Both applied → nvidia-smi topo -p2p a shows OK |
atomics enabled, MoE one-shot a2a should work |
| Hardware doesn't physically support → still NS | no software fix; use NCCL fallback |
See also¶
nvlink-vs-pcie— the bandwidth contextmoe-parallelism— what to do when atomics aren't availablekernels/flashinfer— the kernel that exposes this most- PCIe Specification 4.0/5.0, "AtomicOp" extension
- NVIDIA developer forum threads on
RMDisableFeatureDisablement