AI Fabric Topologies — Rail-Optimized & Scheduled Designs
The shape of your fabric decides the shape of your training job. This page lays out the three reference topologies OcNOS-DC ships against — rail-only, rail-optimized, and scheduled 3-stage Clos — sized in concrete port-counts on Broadcom Tomahawk 4 and Tomahawk 5 hardware. Pick the design that matches your scale; we'll publish the configs for it.
Choose by GPU count, not by buzzword
An AI fabric topology has one job: keep every GPU's outbound link saturated during a collective without creating tail-latency outliers. The right topology is the smallest one that does this for your GPU count, with a fall-back path for the next size up. Below: three reference designs OcNOS-DC validates today, with concrete port maths.
Rail-only single pod
One rack-row, eight rail-aligned ToRs. No spine tier needed. Two-tier collapsed design.
Rail-optimized leaf-spine
Rail-aligned leaves with shared spine tier. East-west traffic between rails uses the spine; intra-rail traffic stays local.
3-stage Clos scheduled
Leaf, spine, super-spine. Non-blocking 1:1 oversubscription end-to-end. DLB at every tier; GLB end-to-end with OcNOS 7.1.
Scaled scheduled fabric
Multi-pod 3-stage Clos with a super-spine plane. Sized for the trillion-parameter training class.
Rail-Optimized Single Pod — 256–1,024 GPUs
Each GPU server has 8 NICs, one per "rail" (a dedicated NCCL channel). Each rail is its own dedicated leaf — so all 8 NICs on every server land on a different leaf. AllReduce across rail-N stays inside leaf-N. No east-west pressure on the spine for the dominant collective pattern.
OcNOS pieces: EVPN-VXLAN underlay, BGP unnumbered, RoCEv2 lossless on every leaf, DLB at the spine tier. Validated on UfiSpace S9600/S9700 and Edgecore AS9736 (TH5) and AS9716 (TH4) platforms.
Scheduled vs Rail-Aligned — what changes at scale
Rail-optimized stops scaling somewhere between 1k and 2k GPUs — you run out of leaf radix, or the spine tier becomes too oversubscribed. Above that, every modern AI fabric is a 3-stage Clos: leaf, spine, super-spine. The "scheduled" descriptor refers to using cell-based scheduled fabric scheduling or credit-based scheduling on top of the Clos to drive utilisation toward 1.0 — exactly what UEC and GLB are designed to do.
3-Stage Clos Scheduled Fabric — 4,096–16,384 GPUs
Three tiers: leaf, spine, super-spine. Every GPU is exactly four switch hops from any other GPU. Non-blocking when the radix maths works out. DLB at every hop, GLB across the full path with OcNOS 7.1, UEC packet-spray on UEC-capable NICs.
OcNOS pieces: eBGP unnumbered underlay, EVPN-VXLAN multi-tenant overlay, RoCEv2 lossless, DLB at every tier, GLB end-to-end on the OcNOS 7.1 train, gNMI streaming telemetry to your observability stack. Validated on TH5 64×800G chassis throughout.
Multi-DC and DCI for distributed training
When a single training run spans more than one data hall — increasingly common for trillion-parameter models — the fabric extends across the WAN. OcNOS-DC supports 400G ZR / ZR+ coherent optics directly on the spine for transponder-free DCI, with EVPN tunnel extension carrying VXLAN tenants across sites.
Multi-DC AI Fabric — Coherent DCI
Two AI data centers stitched together with 400G ZR/ZR+ on the spine. EVPN inter-DC carries L2/L3 tenant extension; the underlying 3-stage Clos in each site is unchanged.
OcNOS pieces: 400G ZR/ZR+ pluggable coherent optics on the spine itself, EVPN inter-DC for tenant L2/L3 extension, gNMI telemetry across sites. No external transponders required.
Design rules of thumb
- Match the topology to the GPU count. Below 256 GPUs, rail-only is enough. 256–1k, rail-optimized leaf-spine. Above 1k, 3-stage Clos is the only design that scales without oversubscription compromises.
- Always 1:1 oversubscription on the AI plane. Storage and CPU racks can run higher oversubscription. The GPU plane should not.
- Plan the rail count from NCCL, not from cabling convenience. 8 rails is the current de-facto standard for 8-NIC GPU servers. Don't combine rails into fewer leaves.
- Pick the silicon by power and density, not the badge. TH4 (25.6T) and TH5 (51.2T) are the workhorses; the choice between them is rack power and breakout-cable cost.
- Plan for GLB / UEC at design time. Build the telemetry plane in from day one — even on a 7.0 fabric — so the OcNOS 7.1 GLB upgrade is purely a software step. See GLB and Ultra Ethernet.
- Validate against the HCL. Every reference here is built on hardware listed in the OcNOS Hardware Compatibility List; pick from there for first-class support.