DCQCN — Quantized Congestion Control for RDMA
DCQCN is the closed-loop congestion control that keeps a RoCEv2 AI fabric out of PFC pause and away from packet loss simultaneously. The switch marks ECN early; the receiver echoes a CNP; the sender quantizes its rate. OcNOS-DC ships pre-tuned defaults for NCCL-class workloads — and exposes every threshold for fabrics that need to deviate.
The DCQCN Closed Loop
Sender NIC, congested switch, receiver NIC. The switch's WRED ECN marker fires before the queue hits the PFC pause threshold. The receiver generates a Congestion Notification Packet (CNP); the sender's reaction point reduces rate, then ramps back. Lossless, no PFC pressure, fast convergence.
The job DCQCN does in an AI fabric
RoCEv2 has two ways to handle congestion: PFC pause (back-pressure that propagates hop-by-hop) and DCQCN (an end-to-end rate-control loop). PFC alone works, but it pushes congestion upstream and risks pause storms and head-of-line blocking. DCQCN works ahead of PFC — marking packets with ECN before the queue reaches the pause threshold, so the sender slows down before the switch ever has to assert pause.
Done well, you spend most of your fabric life on DCQCN feedback alone, with PFC as a safety backstop. Done badly, ECN thresholds are misaligned with PFC headroom and you get pause storms even with DCQCN configured. Threshold tuning is the whole game — and OcNOS-DC publishes defaults that have been validated against NCCL workloads, while exposing every knob for fabrics that have specific traffic patterns.
The three actors
- Reaction Point (sender NIC). Receives CNPs and runs the DCQCN α-update / multiplicative-decrease / additive-increase loop to quantize its sending rate.
- Congestion Point (switch). Marks ECN-capable packets to
CEusing a WRED curve when the queue depth crosses K-min, with marking probability rising linearly to P-max at K-max. - Notification Point (receiver NIC). Generates a CNP back to the sender on each marked flow, rate-limited (typically one per 50 µs per flow).
The OcNOS DCQCN implementation
K-min, K-max, P-max
Per-priority-queue WRED ECN marking with configurable K-min and K-max thresholds and a P-max marking probability. NCCL-class defaults out of the box; exposed as YANG paths for tuning.
Independent of PFC
ECN marking is configured independently from PFC pause thresholds. Misalignment is the most common DCQCN configuration error — OcNOS validates the relationship between K-max, headroom, and the pause asserts before applying.
ECN over VXLAN
ECN bits are preserved through VXLAN encap/decap so DCQCN works end-to-end across an EVPN-VXLAN overlay — not just on the underlay.
Per-queue ECN counters
gNMI-streamed counters for ECN-marked packets per egress queue, queue depth distribution, and CNP-trigger rates. Closed-loop tuning during cluster bring-up.
Verify before you commit
CLI sanity-check that K-min / K-max / PFC headroom are mathematically consistent with the buffer space allocated to the lossless priority. Fail fast on a misconfig.
DC-PLUS license tier
Part of the OcNOS-DC PLUS SKU. Same image, same support; no per-feature add-on required to activate the lossless RDMA stack.
Why this matters more than it sounds
Most "RoCEv2 isn't behaving" support cases land on DCQCN threshold misalignment. Either ECN is configured but never marks (K-min too high) and PFC carries the whole congestion-control burden, or ECN marks too aggressively (K-min too low) and senders cut rate before there's any real congestion. OcNOS-DC ships defaults that work on most TH4 / TH5 fabrics; for fabrics that need to deviate, every parameter is YANG-modeled and verifiable.