Dynamic Load Balancing — Adaptive Routing for AI Fabrics
Static-hash ECMP was built for north-south web traffic, not GPU collectives. OcNOS Dynamic Load Balancing (DLB) re-bins flowlets onto less-congested paths in sub-millisecond intervals — closing the gap between Ethernet and InfiniBand for distributed training workloads.
Adaptive Routing on a Leaf-Spine Fabric
A 4-spine, 2-leaf slice carrying GPU AllReduce traffic. DLB measures local egress queue depth in real time. When Spine-3 saturates, the leaf re-bins the next flowlet onto Spine-2 — keeping all four uplinks balanced.
Why static ECMP fails on AI fabrics
Standard ECMP picks an egress port by hashing the 5-tuple at flow start and pinning the flow there for its entire lifetime. On north-south web traffic — millions of short-lived flows — the law of large numbers smooths utilisation across paths. On an AI fabric, you have a small number of elephant flows from GPU collectives (AllReduce, AllGather, All-to-All) that each consume an entire 400G or 800G uplink for seconds at a time. Two elephants hashed onto the same uplink will collide for the duration of the operation, while another uplink sits idle.
The result is hash polarisation: measured fabric utilisation around 50–60% with random hot-spots, and tail-latency outliers that stall the entire training job. DLB closes this gap by re-evaluating the path decision on every flowlet — a sub-flow chunk delimited by a small inter-packet gap — using live egress queue-depth and port-utilisation telemetry from the ASIC.
The OcNOS DLB implementation
Sub-millisecond gap timer
ASIC-native flowlet inactivity timer (typical 16–256 µs) splits long elephant flows into chunks safe to spray across paths without TCP/RoCEv2 reordering.
Live queue-depth feedback
DLB consumes per-egress port queue-occupancy and link-utilisation signals from the Tomahawk pipeline to score every ECMP next-hop in real time.
Adaptive next-hop selection
On flowlet boundary, the highest-quality member is selected. Member quality is recomputed every few microseconds, so a saturated spine drops out of the candidate set within one flowlet.
Co-tuned with PFC & ECN
DLB integrates with the RoCEv2 lossless stack — PFC, ECN/DCQCN, headroom math — so flowlet rebinding happens before pause frames propagate upstream.
gNMI export
Per-member rebind counts, flowlet-gap distributions, and member quality scores stream over gNMI dial-out for closed-loop fabric tuning.
TH4 / TH5 native
Validated on Broadcom Tomahawk 4 (25.6T) and Tomahawk 5 (51.2T) spine platforms — 64×400G and 64×800G port configurations — with no software fast-path penalty.
What DLB delivers in production AI fabrics
- Higher utilisation. Field measurements move fabric utilisation from ~55% on static ECMP toward 90%+ on the same hardware — without buying more uplinks.
- Lower tail latency. P99.9 collective completion time tightens because no single link saturates while others sit idle.
- Faster training. Less GPU idle time waiting on the slowest rank means measurable wall-clock improvement on AllReduce-heavy workloads.
- No NIC changes. DLB lives in the switch ASIC. Existing RoCEv2 NICs and NCCL stacks see correct in-order delivery without code changes.
- One license. DLB is part of the OcNOS-DC PLUS SKU — same image, same support contract, no per-feature add-on.