RoCEv2 — Lossless Ethernet for AI Fabrics

RDMA over Converged Ethernet v2 is what carries GPU collective traffic across modern AI fabrics. OcNOS implements the full RoCEv2 toolkit — PFC, ECN/DCQCN, adaptive load-balancing, and per-priority telemetry — on validated 400G and 800G open hardware.

AI Fabric Rail Topology

A compact rail slice — two spines and two leaves carrying RoCEv2 between four GPUs. PFC pause frames travel hop-by-hop on congestion, while ECN marks elephant flows for DCQCN reaction at the source.

RoCEv2 leaf-spine AI fabric with PFC and lossless RDMA A two-spine, two-leaf AI fabric carrying lossless RoCEv2 RDMA traffic between four GPU servers. PFC pause arrows show priority-based flow control protecting the queues that carry RDMA traffic. PFC pause (CoS 3) Spine-1 ECN · WRED Spine-2 ECN · WRED Leaf-1 PFC · DCQCN Leaf-2 PFC · DCQCN GPU-0 GPU-1 GPU-2 GPU-3 PFC + ECN + DCQCN · ADAPTIVE LB · PER-PRIORITY TELEMETRY

Why RoCEv2 matters for AI/ML fabrics

GPU collectives (all-reduce, all-gather, all-to-all) generate elephant flows that saturate single fabric paths and demand near-zero loss to keep training jobs efficient. Drop a single packet on a 400G RoCEv2 link and the affected NIC will re-transmit the entire RDMA send window — measurable as seconds of GPU idle time. RoCEv2 turns a leaf-spine fabric into a lossless transport for these workloads, with three pillars: PFC (Priority Flow Control), ECN (Explicit Congestion Notification), and DCQCN (Data Center Quantized Congestion Notification).

The OcNOS RoCEv2 implementation

PFC

Per-priority pause

802.1Qbb PFC on configurable priority queues, paired with watchdog timers to detect deadlock conditions and auto-recover before they propagate.

ECN + DCQCN

Adaptive marking

WRED-based ECN marking on a per-queue basis with DCQCN reaction-point feedback. Tuned defaults for NCCL-class workloads; parametric override for custom RDMA stacks.

Load Balancing

Adaptive flowlet

Dynamic Load Balancing (DLB) re-bins flowlets on link saturation in sub-millisecond intervals. Removes the static hashing collisions that hurt symmetric topologies.

Telemetry

Per-priority queue stats

gNMI streaming sensors for queue depth, PFC pause counters, ECN-marked packets, and microburst detection — exported at 1-second granularity.

Topology

Rail-optimized fabrics

Validated for rail-aligned and scheduled-fabric topologies. Recipes for 256–4,096 GPU clusters using off-the-shelf 400G and 800G open switches.

Diagnostics

Lossless verification

CLI diagnostics to verify a known-good lossless config end-to-end: PFC headroom math, ECN threshold sanity, and a synthetic incast test.

What you get with OcNOS

  • Open hardware choice. Run RoCEv2 on UfiSpace, Edgecore, Wedge, or Celestica platforms with the same NOS image — no vendor lock-in for the fabric layer.
  • Day-one feature parity. Adaptive LB, DCQCN tuning, and ASIC-native telemetry are not paid add-ons — they're part of the base OcNOS-DC license.
  • Reference designs. Validated configs for popular AI fabric topologies; we publish the configs and the test results.
  • Engineering access. Premium support tier includes direct dialog with the OcNOS RoCEv2 team during fabric bring-up.

Building or scaling an AI fabric? Talk to a network architect.

Request a Technical Demo →