Dynamic Load Balancing — Adaptive Routing for AI Fabrics

Static-hash ECMP was built for north-south web traffic, not GPU collectives. OcNOS Dynamic Load Balancing (DLB) re-bins flowlets onto less-congested paths in sub-millisecond intervals — closing the gap between Ethernet and InfiniBand for distributed training workloads.

Adaptive Routing on a Leaf-Spine Fabric

A 4-spine, 2-leaf slice carrying GPU AllReduce traffic. DLB measures local egress queue depth in real time. When Spine-3 saturates, the leaf re-bins the next flowlet onto Spine-2 — keeping all four uplinks balanced.

Dynamic Load Balancing across an AI leaf-spine fabric A four-spine, two-leaf AI fabric. GPU servers attached to the leaves send AllReduce flows. Three spine paths carry balanced flowlets. The fourth spine is congested (red), and Dynamic Load Balancing re-bins the next flowlet onto a less-loaded spine. Bottom band labels DLB metrics: queue depth, port utilisation, flowlet rebind. Spine-3 congested → next flowlet re-bound to Spine-2 Spine-1 queue 18% Spine-2 queue 22% Spine-3 queue 92% Spine-4 queue 25% Leaf-1 DLB · flowlet Leaf-2 DLB · flowlet GPU-0 GPU-1 GPU-2 GPU-3 DLB · QUEUE-DEPTH FEEDBACK · FLOWLET REBIND · CONGESTION-AWARE ECMP

Why static ECMP fails on AI fabrics

Standard ECMP picks an egress port by hashing the 5-tuple at flow start and pinning the flow there for its entire lifetime. On north-south web traffic — millions of short-lived flows — the law of large numbers smooths utilisation across paths. On an AI fabric, you have a small number of elephant flows from GPU collectives (AllReduce, AllGather, All-to-All) that each consume an entire 400G or 800G uplink for seconds at a time. Two elephants hashed onto the same uplink will collide for the duration of the operation, while another uplink sits idle.

The result is hash polarisation: measured fabric utilisation around 50–60% with random hot-spots, and tail-latency outliers that stall the entire training job. DLB closes this gap by re-evaluating the path decision on every flowlet — a sub-flow chunk delimited by a small inter-packet gap — using live egress queue-depth and port-utilisation telemetry from the ASIC.

The OcNOS DLB implementation

Flowlet Detection

Sub-millisecond gap timer

ASIC-native flowlet inactivity timer (typical 16–256 µs) splits long elephant flows into chunks safe to spray across paths without TCP/RoCEv2 reordering.

Path Quality

Live queue-depth feedback

DLB consumes per-egress port queue-occupancy and link-utilisation signals from the Tomahawk pipeline to score every ECMP next-hop in real time.

Re-bind

Adaptive next-hop selection

On flowlet boundary, the highest-quality member is selected. Member quality is recomputed every few microseconds, so a saturated spine drops out of the candidate set within one flowlet.

Lossless

Co-tuned with PFC & ECN

DLB integrates with the RoCEv2 lossless stack — PFC, ECN/DCQCN, headroom math — so flowlet rebinding happens before pause frames propagate upstream.

Telemetry

gNMI export

Per-member rebind counts, flowlet-gap distributions, and member quality scores stream over gNMI dial-out for closed-loop fabric tuning.

Hardware

TH4 / TH5 native

Validated on Broadcom Tomahawk 4 (25.6T) and Tomahawk 5 (51.2T) spine platforms — 64×400G and 64×800G port configurations — with no software fast-path penalty.

What DLB delivers in production AI fabrics

  • Higher utilisation. Field measurements move fabric utilisation from ~55% on static ECMP toward 90%+ on the same hardware — without buying more uplinks.
  • Lower tail latency. P99.9 collective completion time tightens because no single link saturates while others sit idle.
  • Faster training. Less GPU idle time waiting on the slowest rank means measurable wall-clock improvement on AllReduce-heavy workloads.
  • No NIC changes. DLB lives in the switch ASIC. Existing RoCEv2 NICs and NCCL stacks see correct in-order delivery without code changes.
  • One license. DLB is part of the OcNOS-DC PLUS SKU — same image, same support contract, no per-feature add-on.

Tuning DLB for your GPU fabric? Talk to a network architect.

Request a Technical Demo →