AI Fabric Topologies — Rail-Optimized & Scheduled Designs

The shape of your fabric decides the shape of your training job. This page lays out the three reference topologies OcNOS-DC ships against — rail-only, rail-optimized, and scheduled 3-stage Clos — sized in concrete port-counts on Broadcom Tomahawk 4 and Tomahawk 5 hardware. Pick the design that matches your scale; we'll publish the configs for it.

Choose by GPU count, not by buzzword

An AI fabric topology has one job: keep every GPU's outbound link saturated during a collective without creating tail-latency outliers. The right topology is the smallest one that does this for your GPU count, with a fall-back path for the next size up. Below: three reference designs OcNOS-DC validates today, with concrete port maths.

256GPUs

Rail-only single pod

One rack-row, eight rail-aligned ToRs. No spine tier needed. Two-tier collapsed design.

8 × TH3/TH4 leaves · 32 GPUs/leaf
1,024GPUs

Rail-optimized leaf-spine

Rail-aligned leaves with shared spine tier. East-west traffic between rails uses the spine; intra-rail traffic stays local.

32 leaves · 8 spines · TH4 / TH5 mix
4,096GPUs

3-stage Clos scheduled

Leaf, spine, super-spine. Non-blocking 1:1 oversubscription end-to-end. DLB at every tier; GLB end-to-end with OcNOS 7.1.

128 leaves · 64 spines · 16 super-spines (TH5)
16,384GPUs

Scaled scheduled fabric

Multi-pod 3-stage Clos with a super-spine plane. Sized for the trillion-parameter training class.

512 leaves · 256 spines · 64 super-spines (TH5 800G)
Reference Design 1

Rail-Optimized Single Pod — 256–1,024 GPUs

Each GPU server has 8 NICs, one per "rail" (a dedicated NCCL channel). Each rail is its own dedicated leaf — so all 8 NICs on every server land on a different leaf. AllReduce across rail-N stays inside leaf-N. No east-west pressure on the spine for the dominant collective pattern.

Rail-optimized AI fabric — 8 rails, 8 leaves, shared spine tier Rail-optimized AI fabric. Eight GPU servers across the bottom each have eight NICs aligned to eight rail-leaves. Rail-N from every server connects to leaf-N. A spine tier above the leaves carries cross-rail traffic. The dominant AllReduce traffic stays inside one rail, never traversing the spine. Spine-1TH5 · 800G Spine-2TH5 · 800G Spine-3TH5 · 800G Spine-4TH5 · 800G Rail-1leaf Rail-2leaf Rail-3leaf Rail-4leaf Rail-5leaf Rail-6leaf Rail-7leaf Rail-8leaf GPU Server 1 8 × NIC · 8 rails GPU Server 2 8 × NIC · 8 rails GPU Server 3 8 × NIC · 8 rails GPU Server 4 8 × NIC · 8 rails RAIL-OPTIMIZED · 8 RAILS · INTRA-RAIL ALLREDUCE STAYS LOCAL

OcNOS pieces: EVPN-VXLAN underlay, BGP unnumbered, RoCEv2 lossless on every leaf, DLB at the spine tier. Validated on UfiSpace S9600/S9700 and Edgecore AS9736 (TH5) and AS9716 (TH4) platforms.

Scheduled vs Rail-Aligned — what changes at scale

Rail-optimized stops scaling somewhere between 1k and 2k GPUs — you run out of leaf radix, or the spine tier becomes too oversubscribed. Above that, every modern AI fabric is a 3-stage Clos: leaf, spine, super-spine. The "scheduled" descriptor refers to using cell-based scheduled fabric scheduling or credit-based scheduling on top of the Clos to drive utilisation toward 1.0 — exactly what UEC and GLB are designed to do.

Reference Design 2

3-Stage Clos Scheduled Fabric — 4,096–16,384 GPUs

Three tiers: leaf, spine, super-spine. Every GPU is exactly four switch hops from any other GPU. Non-blocking when the radix maths works out. DLB at every hop, GLB across the full path with OcNOS 7.1, UEC packet-spray on UEC-capable NICs.

3-stage Clos AI fabric scheduled topology Three-stage Clos topology. Top tier shows four super-spine switches. Middle tier shows eight spine switches. Bottom tier shows 12 leaf switches feeding GPU pods. Full mesh links from leaf to spine and spine to super-spine. Bottom band labels: 4096 GPU scheduled fabric, DLB at every tier, GLB end-to-end with OcNOS 7.1. Super-Spine-1 Super-Spine-2 Super-Spine-3 Super-Spine-4 Spine-1 Spine-2 Spine-3 Spine-4 Spine-5 Spine-6 Spine-7 Spine-8 L1 L2 L3 L4 L5 L6 L7 L8 L9 L10 L11 L12 SUPER-SPINE SPINE LEAF GPU PODS 12 pods · ~340 GPUs/pod · 4,096 GPUs total · TH5 · 800G 3-STAGE CLOS · 4,096 GPU · DLB EVERY HOP · GLB E2E (OcNOS 7.1) · UEC-READY

OcNOS pieces: eBGP unnumbered underlay, EVPN-VXLAN multi-tenant overlay, RoCEv2 lossless, DLB at every tier, GLB end-to-end on the OcNOS 7.1 train, gNMI streaming telemetry to your observability stack. Validated on TH5 64×800G chassis throughout.

Multi-DC and DCI for distributed training

When a single training run spans more than one data hall — increasingly common for trillion-parameter models — the fabric extends across the WAN. OcNOS-DC supports 400G ZR / ZR+ coherent optics directly on the spine for transponder-free DCI, with EVPN tunnel extension carrying VXLAN tenants across sites.

Reference Design 3

Multi-DC AI Fabric — Coherent DCI

Two AI data centers stitched together with 400G ZR/ZR+ on the spine. EVPN inter-DC carries L2/L3 tenant extension; the underlying 3-stage Clos in each site is unchanged.

Multi-DC AI fabric with 400G ZR/ZR+ DCI Two AI data centers, each with a leaf-spine fabric. The two spines connect through 400G ZR/ZR+ coherent optics across a WAN. EVPN inter-DC tunnels extend tenants from one site to the other. Bottom band: transponder-free coherent DCI. DATA CENTER A DATA CENTER B Spine-A1400G ZR+ Spine-A2400G ZR+ Spine-B1400G ZR+ Spine-B2400G ZR+ EVPN inter-DC · 400G ZR/ZR+ Leaf-A1 Leaf-A2 Leaf-A3 Leaf-B1 Leaf-B2 Leaf-B3 GPU pods · Site A GPU pods · Site B COHERENT DCI · TRANSPONDER-FREE · EVPN INTER-DC · 400G ZR/ZR+

OcNOS pieces: 400G ZR/ZR+ pluggable coherent optics on the spine itself, EVPN inter-DC for tenant L2/L3 extension, gNMI telemetry across sites. No external transponders required.

Design rules of thumb

  • Match the topology to the GPU count. Below 256 GPUs, rail-only is enough. 256–1k, rail-optimized leaf-spine. Above 1k, 3-stage Clos is the only design that scales without oversubscription compromises.
  • Always 1:1 oversubscription on the AI plane. Storage and CPU racks can run higher oversubscription. The GPU plane should not.
  • Plan the rail count from NCCL, not from cabling convenience. 8 rails is the current de-facto standard for 8-NIC GPU servers. Don't combine rails into fewer leaves.
  • Pick the silicon by power and density, not the badge. TH4 (25.6T) and TH5 (51.2T) are the workhorses; the choice between them is rack power and breakout-cable cost.
  • Plan for GLB / UEC at design time. Build the telemetry plane in from day one — even on a 7.0 fabric — so the OcNOS 7.1 GLB upgrade is purely a software step. See GLB and Ultra Ethernet.
  • Validate against the HCL. Every reference here is built on hardware listed in the OcNOS Hardware Compatibility List; pick from there for first-class support.

Designing your AI fabric? We'll do the port-count maths with you.

Book an Architecture Review →