AI Fabric 拓撲：Rail-Optimized 與調度式設計

Q: Should I use Tomahawk 4 or Tomahawk 5?

Both run OcNOS-DC. Tomahawk 4 (25.6 Tbps, 64×400G) is the cost-optimized choice for entry pods and 400G GPU NICs. Tomahawk 5 (51.2 Tbps, 64×800G) is the workhorse for 800G GPU servers and larger fabrics. Tomahawk 4 has no native 800G, so match the switch to your NIC speed.

The shape of your fabric decides the shape of your training job. This page lays out the reference topologies OcNOS-DC ships against, from a rail-optimized single pod, through a scheduled 3-stage Clos, to coherent multi-DC DCI, sized in concrete port-counts on Broadcom Tomahawk 4 and Tomahawk 5 hardware.

Which AI fabric topology should you use? Pick the smallest non-blocking design that keeps every GPU's link saturated during collectives. Up to about 1,000 GPUs, use a rail-optimized leaf-spine pod (8 rails per GPU server, one rail per leaf). From roughly 1,000 to 16,000+ GPUs, move to a 3-stage Clos (leaf, spine, super-spine). To span data centers, extend with 400G ZR/ZR+ coherent DCI. All three run on OcNOS-DC over a RoCEv2 lossless (PFC + ECN) L3 fabric.

按 GPU 數量選擇，而非按流行術語

AI fabric 拓撲只有一個使命：保持 every GPU 的出向連結在集合運算期間達到飽和，同時不產生尾端延遲的離群值。合適的拓撲是能夠針對您的 GPU 數量實現此目標的最小規模拓撲，並具備擴展至下一更大規模的回退路徑。以下為 OcNOS-DC 目前支援的三種參考設計，並附有具體的埠計算。

正在為自己的叢集進行容量規劃？ The AI Fabric Design Suite 可提供快速的初步測算：它可測算出一個 無阻塞兩層架構 假設條件下的 leaf-spine pod 每個 GPU 一片 fabric NIC，並在進入三層規模時予以提示。下面的參考設計採用 same 非阻塞的leaf/spine計算，並將其擴展為大規模的3級Clos，從而使交換器數量與工具相吻合。 Rail-optimized here is the wiring discipline of an 8-NIC GPU server (one rail per leaf, so intra-rail AllReduce stays on the leaf) layered on that non-blocking fabric: it changes traffic locality, not the switch count. Use the tool for a ballpark; use these designs for the build.

256GPUs

入門級無阻塞 pod

在小型 spine 層之上部署一行軌道對齊 leaf。兩層摺疊式 Clos，1:1 無阻塞。

8 個 leaf · 4 個 spine · TH4 · 400G

1,024GPUs

軌道最佳化的兩層 pod

採用軌道對齊 leaf 加 1:1 無阻塞 spine。軌道內 AllReduce 保持在 leaf 上完成；跨軌道流量經由 spine。標準的單 pod 可擴展單元。

32 個 leaf · 16 個 spine · TH5 · 800G

4,096GPUs

3 級 Clos

Leaf, spine, super-spine. Each 1,024-GPU pod is 1:1 non-blocking; a super-spine plane scales across pods. DLB at every tier; GLB end-to-end on the OcNOS 7.1 train.

128 leaves · 64 spines · 32 super-spines · TH5 · 800G

16,384GPUs

規模化 3 級 Clos

帶有超級骨幹平面的多 Pod 三級 Clos。專為萬億參數訓練級別設計。

512 leaves · 256 spines · 128 super-spines · TH5 · 800G

參考設計 1

軌道最佳化單 Pod

每臺 GPU 伺服器配備 8 個 NIC,每個對應一條 "rail",即專用 xCCL (NCCL / RCCL / oneCCL) 集體通信通道。每條 rail 擁有獨立的專用葉交換機、因此每臺伺服器的 8 個 NIC 都落到不同的葉交換機上。跨 rail-N 的 AllReduce 保持在 leaf-N 內部。主導集體通信模式不會對 spine 產生東西向壓力。

Rail-optimized AI fabric topology: four 800G Tomahawk 5 spines above eight rail leaves, with four GPU servers each mapping one of its eight NICs to a different rail leaf, so intra-rail AllReduce stays on one leaf. — Rail-optimized single pod: each GPU server's 8 NICs map one per rail to 8 dedicated leaves, so same-rail AllReduce stays on the leaf and only cross-rail traffic reaches the spine.

OcNOS 組件： BGP-unnumbered L3 underlay, RoCEv2 lossless (PFC + ECN) on every leaf, DLB at the spine tier. Built on HCL-listed hardware: the 800G scalable unit uses TH5 64×800G leaves and spines (Edgecore AIS800-64D or UfiSpace S9321-64E); the entry 256-GPU pod uses Edgecore AS9736-64D (TH4, 64×400G).

調度式對比 Rail-Aligned：規模化時的變化

Rail-optimized stops scaling somewhere between 1k and 2k GPUs: you run out of leaf radix, or the spine tier becomes too oversubscribed. Above that, most modern AI fabrics move to a 3-stage Clos: leaf, spine, super-spine. The hard part on a Clos is spreading flows evenly so no link becomes a hot spot. Approaches run from per-flow ECMP, through adaptive (dynamic) load balancing, to per-packet spray, the model Ultra Ethernet uses with reordering handled at the NIC. A separate family, cell-based scheduled fabrics such as Broadcom DDC, segments traffic into cells and schedules it inside the fabric. OcNOS keeps the GPU plane balanced with DLB today and adds fabric-wide GLB on the 7.1 train, and is UEC-ready as UEC NICs arrive.

參考設計 2

3-Stage Clos Scheduled Fabric: 4,096-16,384 GPUs

Three tiers: leaf, spine, super-spine. Any two GPUs are at most four switch hops apart; same-leaf and same-pod peers are closer. Non-blocking within a pod, with the super-spine plane setting the cross-pod ratio. DLB at every hop, GLB across the full path on the OcNOS 7.1 train, UEC packet-spray on UEC-capable NICs. The diagram is schematic: it draws a reduced tier count; the 4,096-GPU build is 128 leaves / 64 spines / 32 super-spines on TH5 800G.

Three-stage Clos AI fabric topology: a super-spine tier over a spine tier over leaf switches feeding GPU pods, sized to 4,096 GPUs on 800G Tomahawk 5 with DLB at every hop. — Scheduled 3-stage Clos: leaf, spine, and super-spine tiers scale a non-blocking GPU plane to thousands of GPUs, with DLB at every hop and fabric-wide GLB on the OcNOS 7.1 train.

OcNOS 組件： eBGP-unnumbered L3 underlay, RoCEv2 lossless (PFC + ECN), DLB at every tier, GLB end-to-end on the OcNOS 7.1 train, and gNMI streaming telemetry to your observability stack. Built on HCL-listed TH5 64×800G chassis throughout.

Subscription is a dial, not a fixed rule. These counts make each 1,024-GPU pod 1:1 non-blocking and use a cost-optimized ~2:1 super-spine for cross-pod traffic, the rail-optimized approach hyperscale Ethernet fabrics rely on (published large-scale designs oversubscribe the top tier far more, because collective traffic stays pod-local). Want maximal any-to-any headroom instead? A fully non-blocking 1:1 build is 128 / 128 / 64 at 4,096 GPUs and 512 / 512 / 256 at 16,384; only the spine and super-spine counts change. Model either in the AI Fabric Design Suite.

適用於分布式訓練的多 DC 與 DCI

When a single training run spans more than one data hall, increasingly common for trillion-parameter models, the fabric extends across the WAN. OcNOS-DC supports 400G ZR / ZR+ coherent optics directly on the spine for transponder-free DCI across sites.

參考設計 3

多資料中心 AI 網路：相干 DCI

Two AI data centers stitched together with 400G ZR/ZR+ coherent optics on the spine. The underlying 3-stage Clos in each site is unchanged.

Multi-data-center AI fabric topology: two leaf-spine sites joined by 400G ZR and ZR+ coherent optics on the spine, extending the fabric across the WAN, no external transponders. — Coherent multi-DC DCI: two AI data centers joined by 400G ZR and ZR+ optics on the spine, extending the fabric across sites without external transponders.

OcNOS 組件： 400G ZR/ZR+ pluggable coherent optics on a DWDM-capable spine or border-leaf port, with gNMI telemetry across sites. No external transponders required. Reach: 400ZR to roughly 120 km amplified; OpenZR+ reaches farther on oFEC.

設計經驗法則

使拓撲與 GPU 數量相匹配。 最小規模 pod（不超過單臺 leaf 的 NIC 基數）：僅用 rail-only 即可。單 pod 規模：採用 rail 最佳化的 leaf-spine。多 pod 規模：三級 Clos 是唯一能在不犧牲超額訂閱的前提下實現擴展的設計。
AI 平面始終保持 1:1 訂閱比。 儲存與 CPU 機架可承受更高的超額訂閱比率。GPU 平面則不應如此。
軌道數應依據 xCCL 規劃,而非布線便利。 對於 8-NIC GPU 伺服器，8 軌是當前的事實標準。請勿將多條軌合併到更少的 leaf 上。
按功耗和密度選擇晶片，而非按品牌標識。 TH4（25.6T）和 TH5（51.2T）是主力晶片；二者之間的取捨在於機架功耗和 breakout 線纜成本。
在設計階段就為 GLB / UEC 做好規劃。 從第一天起就將遙測平面構建到位，即便是在 7.0 fabric 上，這樣 OcNOS 7.1 GLB 升級便純粹是一次軟體操作。詳見 GLB and Ultra Ethernet.
對照 HCL 進行驗證。 此處的每個參考方案均構建於所列硬體之上，詳見 OcNOS 硬體兼容性列表；從那裡開始即可獲得一流支持。

常見問題

AI fabric topology FAQ

What is a rail-optimized topology, and how is it different from rail-only?

Rail-optimized wiring connects each of a GPU server's 8 NICs to its own dedicated rail leaf, so the dominant same-rail AllReduce traffic stays on one leaf and never traverses the spine. Rail-only is the small-cluster case: a single rack-row of rail-aligned leaves with no spine tier, where cross-rail traffic relies on the GPU scale-up domain. Rail-optimized adds a non-blocking spine so cross-rail flows have a network path.

How many GPUs can a 3-stage Clos scale to?

It depends on how each 800G port is mapped to GPUs. On radix-64 Tomahawk 5 switches, a 2-tier leaf-spine reaches about 2,048 GPUs at one 800G fabric port per GPU (1:1 non-blocking), and scales toward 8,000+ GPUs when 800G ports break out to multiple GPU NICs. A 3-stage Clos with a super-spine tier extends that to 16,000+ GPUs in the reference designs above, and up to about 65,000 GPUs at the theoretical fat-tree limit. Because most collective traffic stays rail-local, the super-spine plane is sized to the cross-pod ratio you actually need.

Should I use Tomahawk 4 or Tomahawk 5?

Both run OcNOS-DC. Tomahawk 4 (25.6 Tbps, 64×400G) is the cost-optimized choice for entry pods and 400G GPU NICs. Tomahawk 5 (51.2 Tbps, 64×800G) is the workhorse for 800G GPU servers and larger fabrics. Tomahawk 4 has no native 800G, so match the switch to your NIC speed.

Do I need InfiniBand, or is Ethernet enough?

Ethernet is now a first-class AI-fabric transport. RoCEv2 with PFC and ECN delivers lossless RDMA today. Ultra Ethernet (UEC) removes the network-wide PFC dependency using endpoint packet-spray, selective retransmission, and link-level retry as UEC NICs ship. OcNOS-DC runs the RoCEv2 fabric today and is UEC-ready.

Where does scale-up end and scale-out begin?

Inside a GPU server and its NVLink domain (for example GB200 NVL72), GPUs communicate over the scale-up fabric at terabit speeds. The rail, leaf-spine, and Clos network is the scale-out fabric between servers and pods. Most same-rail collective traffic is absorbed by scale-up first, so the network carries the cross-rail and cross-pod remainder, which is why 1:1 non-blocking matters most on the GPU plane.

正在設計您的 AI fabric？我們與您一起完成埠數量的測算。

預約架構評審 →

AI 網路

Design the whole AI fabric with OcNOS

From the business case to the port-count maths, pick up wherever you are in the build.

解決方案 Open AI Fabric The complete 800G AI fabric: open switches, OcNOS-DC, and support under one contract.

您當前所在位置 AI Fabric 拓撲 Rail-optimized, scheduled 3-stage Clos, and coherent DCI, sized in real port counts.

Size & build AI Fabric Design Suite Size a GPU fabric: leaf, spine, and super-spine counts with a component and power summary.

初次接觸AI組網？從這裡開始什麼是AI組網？什麼是GPU組網？什麼是無損乙太網？什麼是RDMA？

The technology inside RoCEv2 lossless Rail-optimized network DLB adaptive routing GLB (7.1) Ultra Ethernet DCQCN PFC 死鎖 InfiniBand 與乙太網對比 RoCE 與 InfiniBand 比較 AI fabric architecture Coherent DCI

規格書與解決方案簡介