OcNOS-DC · UEC 1.0 Aligned · DCQCN + DLB + GLB · Up to 16k GPU Scale

An open AI fabric — built for what your training job actually feels.

At thousands of accelerators you don't measure switches in Tbps — you measure job completion time, GPU utilization, and tail latency under microbursts. OcNOS-DC moves those numbers on open merchant silicon with a 24/7 carrier-grade SLA: the same technical floor as the closed AI stacks, none of the lock-in.

Up to 16k GPUsReference design ceiling
Sub-ms DLBFlowlet rebinding
UEC 1.0Fabric-profile aligned
24/7 SLACarrier-grade global
16k GPU
Reference design ceiling
DCQCN
xCCL-tuned, every threshold YANG-modeled
DLB + GLB
Flowlet local + fabric-wide adaptive routing
UEC 1.0
Fabric-profile aligned · open answer to IB
The builder's question

"Will my training job actually finish faster?"

At scale, traditional network metrics lose their meaning. What matters is Job Completion Time, GPU utilization, and tail latency under microbursts — because every minute a multi-billion-dollar cluster waits on a synchronization step is capital burned.

The lossless, low-latency performance AI needs no longer requires a closed, proprietary stack. On open merchant silicon with a carrier-grade SLA, OcNOS-DC matches the technical floor of closed architectures with no vendor lock-in — congestion management, sub-millisecond dynamic routing, and Ultra Ethernet alignment, tuned for the bursty patterns of collective traffic. GPUs spend their time processing data, not waiting on the network.

Every threshold is exposed, so your team can tune it against real xCCL (NCCL / RCCL / oneCCL) traffic. Below: each workload pattern, the mechanism that handles it, and what the operator gets back.

AllReduce / AllGather
Every GPU talks to every other GPU at once.
Static ECMP pins elephant flows to one spine link — hot spots, idle uplinks, slow sync.
DLB rebinds flowlets sub-ms on live queue depth.
GLB (OcNOS 7.1) scores leaf · spine · super-spine.
Result: no hash-collision hot spots; AllReduce holds near line rate.
Microburst / Incast
N senders converge on one queue in microseconds.
A drop restarts the collective; a pause storm blocks the line. Either way the run stalls.
DCQCN (xCCL-tuned ECN + CNP) caps rate before the drop.
PFC Watchdog auto-drains stuck queues per-port.
Result: jobs survive bursts; deadlocks self-recover — no 3 a.m. power-cycle.
Multi-rail / Scale-out
One flow needs every parallel path simultaneously.
Hash-pinned single-path ECMP leaves multi-rail bandwidth idle.
UEC 1.0: packet spray + multi-path RDMA + out-of-order delivery.
→ The switch you buy today stays when UEC NICs land.
Result: tail-latency outliers shrink as UEC NICs roll out — the open answer to InfiniBand.
~55% → 90%+

Field measurement. DLB lifts fabric utilization from ~55% on static ECMP to 90%+ on the same hardware — no extra uplinks. Local at each hop; system-wide across the AllReduce.

DLB deep-dive →
What it looks like in a rack row

800G spine-leaf, lossless from rack to rack.

A 3-stage Clos: eBGP unnumbered underlay, ECMP at every tier, PFC/ECN per priority group, isolated out-of-band bus for ZTP and telemetry. Hover any node for switch, port count, and ASIC.

800G AI fabric topology with full-mesh eBGP and isolated OOB management Horizontal 800G AI fabric. Three GPU racks on the left feed two leaf VTEPs running OcNOS-DC, which connect to two 51.2 Tbps spines over a full-mesh eBGP ECMP underlay with DLB. An isolated out-of-band management bus across the top carries ZTP and telemetry. Leaf-attached NVMe-oF/NFS GPU storage sits to the right. 分離 OOB 管理バス OOB Mgmt 分離されたネットワーク ZTP · テレメトリ GPU Rack 1 8× GPU nodes RoCEv2 / RDMA GPU Rack 2 8× GPU nodes RoCEv2 / RDMA GPU Rack 3 8× GPU nodes RoCEv2 / RDMA Leaf-01 OcNOS-DC 64 × 400G Tomahawk 4 PFC / DCBX / ZTP LOSSLESS RoCEv2 MLAG PEER Leaf-02 OcNOS-DC 32 × 400G Tomahawk 3 PFC / DCBX / ZTP LOSSLESS RoCEv2 eBGP ECMP フルメッシュ Spine-01 OcNOS-DC 51.2 Tbps · DLB eBGP · ECMP · DLB Spine-02 OcNOS-DC 51.2 Tbps · DLB eBGP · ECMP · DLB GPU Storage NVMe-oF / NFS RDMA-optimized OcNOS-DC — AI ファブリック — 水平 CLOS · PFC · ECN · DLB · 800G
OcNOS-DC leaf/spine
OcNOS-DC spine (DLB)
GPU servers / storage

Hover nodes for capability and platform details · Full HCL: 40+ validated platforms at ipinfusion.com/hcl

600+Production OcNOS networks
26 yrZebOS routing stack in service
24×7Carrier-grade global SLA
Inside the fabric

Four layers of losslessness — correct on Day 1.

Most AI fabric failures trace to one misconfigured PFC priority group or an ECN threshold tuned for cloud, not RDMA. OcNOS-DC ships RoCEv2 buffer profiles validated per Broadcom ASIC — so your first AllReduce runs lossless without a tuning sprint.

PFC + ECN — priority-group lossless control

PFC pauses per-priority traffic before buffers overflow; ECN marks packets early for sender-side slowdown. No drops, no port-wide stall. PFC over L3 for routed multi-row fabrics.

DLB — flowlet-level adaptive routing

Static-hash ECMP collides when 8 NICs hash to the same spine. DLB watches live queue depth and rebinds flowlets to less-loaded paths sub-ms — the AllReduce stops dragging on the slowest link.

DCBX — server config auto-pushed over LLDP

The leaf pushes correct PFC and ETS config to the GPU server automatically — no silent loss of losslessness when a node gets re-imaged, the most common production failure mode.

gNMI on-change telemetry — sub-second visibility

PFC pauses, ECN marking, DCQCN thresholds, and buffer depths as gNMI on-change sensor paths — straight into Prometheus / Grafana / OpenTelemetry. Catch congestion before it stalls a job.

ai-leaf01 — gNMI ロスレスファブリックテレメトリ ストリーミング
$gnmic subscribe --path /qos/pfc/ \
--mode ON_CHANGE --encoding proto
RoCEv2 Priority Group 3 — real-time
et-0/0/1 PG3 PFC-Rx: 0 Tx: 0 ドロップ: 0
et-0/0/2 PG3 PFC-Rx: 0 Tx: 0 ドロップ: 0
et-0/0/3 PG3 PFC-Rx: 0 Tx: 0 ドロップ: 0
$gnmic subscribe --path /interfaces/counters/
et-0/0/1 in: 780 Gbps out: 776 Gbps
et-0/0/2 in: 795 Gbps out: 791 Gbps
→ Telegraf → Prometheus → Grafana
✓ ロスレス — 0 ドロップ — ファブリック正常
検証済み AI ファブリックプラットフォーム
AIS800-64D
Edgecore — Spine
800GTH5
S9321-64E
UfiSpace — Spine
800GTH5
AS9736-64D
Edgecore — Leaf
400G / 25.6T
AS9716-32D
Edgecore
400G / 12.8T

40+ validated platforms — HCL 全件 →

Ultra Ethernet · UEC 1.0 Aligned

The fabric profile is ready before the NICs are. That's the point.

RoCEv2 is the production transport in 2026; UEC is what comes next. The UEC 1.0 fabric profile adds packet spray, multi-path RDMA, and out-of-order-friendly forwarding — closing the single-hash limit that kept earlier RoCE a step behind InfiniBand on multi-rail collectives. OcNOS-DC tracks the UEC 1.0 fabric profile today, while UEC NICs roll out. The point isn't leading the standard — everyone is aligning to it. It's that the switch you buy this quarter won't need replacing when your UEC NIC arrives.

Packet spray

Single flow uses every parallel path simultaneously instead of being pinned to one ECMP hash. Multi-rail bandwidth is no longer left on the table.

Multi-path RDMA

Reorder buffers handle out-of-order delivery in hardware. Modern congestion control replaces NACK-based loss recovery on tail latency.

Same hardware, forward path

The TH4 and TH5 platforms validated for OcNOS-DC today extend into UEC. No fork. No second SKU line. One fabric, two transport generations.

Read the Ultra Ethernet deep-dive →
If you're picking a fabric in 2026

Where OcNOS-DC sits — honestly, by name.

The race has converged on a shared floor: lossless RoCEv2, DCQCN, adaptive routing, UEC alignment. Everyone ships these. The real differentiator is solution shape — vertical lock-in vs. open NOS, locked vs. open hardware, closed-loop IB vs. standards Ethernet. Pick the trade-off you can live with for five years.

Solution shape Examples Trade-off
Closed vertical AI stack NVIDIA Spectrum-X + Quantum + ConnectX Excellent integrated performance. NIC, switch, and fabric software locked to one vendor — and to one GPU roadmap.
Locked merchant-silicon NOS Arista EOS · Cisco NX-OS · Juniper Junos Same Broadcom silicon underneath. Per-port licensing premium. Telemetry and tuning constrained to the vendor's own pipeline.
Cell-based proprietary chassis fabric DriveNets Network Cloud Different architecture — scheduled cell fabric, not Ethernet NOS. Strong at hyperscale; not portable to standard switches.
Closed-loop InfiniBand NVIDIA Quantum InfiniBand Best-in-class for tight collectives today. Separate cabling, separate operations, single-vendor ecosystem. UEC closes the gap on Ethernet.
Open NOS, no AI hardening Community SONiC Open hardware, free software, no SLA. xCCL-tuned defaults, deadlock watchdog, and tuning maturity are left entirely to the operator.
Open NOS, AI-hardened, UEC-aligned OcNOS-DC on Edgecore / UfiSpace Same Broadcom silicon. xCCL-tuned DCQCN out of the box, sub-ms DLB, GLB on the 7.1 roadmap, PFC deadlock watchdog. UEC 1.0 fabric profile. 24/7 carrier-grade SLA. No NIC, GPU, or hardware lock-in.

Every row ships a real product — including OcNOS-DC. The question is rarely a missing feature; it's the trade-off you'll live with.

Wait — so what's an "AI fabric" exactly?

What it actually is — and where it stops.

An AI cluster is three layers. The fabric moves bytes between switches; the NIC terminates RDMA; the scheduler decides what runs where. "AI-aware fabric" usually means one vendor bundled all three under one SKU. OcNOS-DC owns the fabric, exposes every threshold, and stays out of the layers above. Here's the boundary, named.

Layer 1 · Fabric

What OcNOS-DC owns.

  • Lossless RoCEv2 transport — PFC + ECN + ETS + DCBX
  • DCQCN with xCCL-validated default thresholds, every knob YANG-modeled
  • DLB sub-ms flowlet rebinding on live ASIC queue depth
  • GLB fabric-wide path scoring (OcNOS 7.1)
  • PFC deadlock watchdog — per-port, per-priority
  • UEC 1.0 fabric-profile alignment — packet-spray-friendly forwarding
  • gNMI on-change telemetry, OpenConfig YANG, sub-second cadence
Shipping today on Edgecore / UfiSpace TH4 + TH5. GLB on the OcNOS 7.1 train.
Layer 2 · NIC + Transport

Your NIC vendor's job.

  • xCCL collective implementation and tuning
  • RDMA verbs, queue pairs, retransmit logic
  • UEC packet spray endpoint + reorder buffer (UEC NICs)
  • GPU-direct memory access, NVLink coordination
  • Per-flow rate limiting and end-host congestion response
NVIDIA ConnectX, BlueField, AMD Pensando, Intel Mt. Evans, Cornelis, future UEC silicon. OcNOS interoperates with all of them — and never replaces the choice.
Layer 3 · Cluster Scheduler

Your orchestration platform's job.

  • Training-job placement, gang scheduling, gradient-sync windows
  • Epoch / training-phase awareness
  • Tenant isolation, queue priority, resource quotas
  • xCCL ring topology assignment, rail-group affinity
  • Cross-job interference detection
Slurm, Kubernetes, Run:ai, NVIDIA Base Command, in-house schedulers. OcNOS-DC streams gNMI telemetry into them — it doesn't try to replace them.
Why the line is here: a fabric that owns layers 2 and 3 can never be swapped — NIC locked to switch, scheduler to NIC, GPU roadmap to vendor. InfiniBand owned all three for fifteen years and operators paid for it. OcNOS-DC ships every fabric mechanism a 2026 workload needs, validates it against xCCL traffic, and stops at the wire. That's why "AI-aware fabric" is the wrong question — the right one is whether the fabric does its job well enough that the NIC and scheduler don't have to fight it.
Going deeper

Every mechanism on this page has its own deep-dive.

The page above is for picking a fabric. These are for tuning one — packet captures, ASIC behavior, YANG paths, and where each feature ships in the release train.

AI Fabric · Lossless

RoCEv2 + PFC + ECN + DCQCN

The lossless RDMA transport layer for GPU collectives. Buffer profiles pre-tuned per Broadcom ASIC, xCCL-class DCQCN defaults, sub-µs jitter under load.

Read deep-dive →
AI Fabric · Local

Adaptive Dynamic Load Balancing (DLB)

Sub-millisecond flowlet rebinding using live ASIC queue-depth telemetry. Closes the ECMP hash-collision gap on AllReduce elephant flows.

Read deep-dive →
AI Fabric · Fabric-wide OcNOS 7.1

Global Load Balancing (GLB)

End-to-end path scoring across leaf · spine · super-spine for clusters up to 16k GPU. The multi-hop adaptive layer DLB cannot see alone.

Read deep-dive →
AI Fabric · Frontier UEC 1.0

Ultra Ethernet (UEC)

Packet spray, multi-path RDMA, out-of-order delivery, modern congestion control. The standards-based open answer to InfiniBand.

Read deep-dive →
AI Fabric · Reference Designs

Topologies — 1k / 4k / 16k GPU

Rail-only and rail-optimized designs map the fabric shape directly onto the xCCL 8-rail multi-NIC pattern. 3-stage Clos for scale-out beyond 1k GPU. Port counts on TH4 / TH5.

Read deep-dive →
AI Fabric · Congestion Control

DCQCN — RDMA Congestion Control

WRED ECN marking, CNP feedback, quantized rate control. xCCL-class defaults out of the box; every threshold YANG-modeled for tuning.

Read deep-dive →
AI Fabric · Survival

Watchdog — PFC Deadlock Detection

Per-port, per-priority watchdog detects paused-queue cycles and auto-drains the affected queue before training jobs hang.

Read deep-dive →
AI Fabric · Decision Guide

InfiniBand vs Ethernet for AI

Workload-specific decision guide. Where modern Ethernet (RoCEv2 + DLB + UEC) closes the gap, where IB still wins, and how to pick.

Read deep-dive →
Observability

gNMI ストリーミングテレメトリ

gNMI Subscribe over gRPC, OpenConfig YANG, dial-out collectors. Integrations with Telegraf, Prometheus, and Grafana.

Read deep-dive →
What people are actually building

Three cluster shapes. Three fabric stories.

Framed by what the job feels, not by switch features. Pick the shape closest to yours; the deep-dives have the configs.

SHAPE 01 · LLM PRE-TRAINING

The multi-week LLM pre-training run.

AllReduce dominates the network. Every GPU must hold >90% utilization in-collective and survive microbursts without restarting a nine-day run.

Mechanisms: DCQCN + DLB + PFC Watchdog. Rail-optimized below 1k GPU; 3-stage Clos with GLB above.
成果: AllReduce at line rate, zero collective restarts, JCT inside schedule.

SHAPE 02 · LIVE INFERENCE

The high-throughput inference fleet behind a public API.

Real-time inference where p99 tail latency drives the SLO. Inference must never queue behind batch retraining, and ops needs per-flow visibility the moment latency drifts.

Mechanisms: ETS strict-priority + gNMI on-change telemetry into Prometheus / OpenTelemetry.
成果: p99 held inside SLO; regressions caught in milliseconds, not the support queue.

SHAPE 03 · GPU-AS-A-SERVICE

The neocloud renting H100 / H200 / Blackwell to tenants.

A multi-tenant GPU cloud. Each tenant needs isolated lossless RoCEv2 paths — without a separate fabric segment per customer or a second NOS image.

Mechanisms: EVPN-VXLAN isolation + lossless RoCEv2 on one OcNOS-DC instance.
成果: per-tenant isolation, one ops model, one SLA, one image to upgrade.

Talk to a network architect

Bring your topology. We'll show you the path.

Every IPI architecture review is led by a network engineer running production OcNOS — no slides, no sales theatre. Bring your GPU count, NIC choice, and target JCT; we'll map it to topology, SKUs, and configs that ship today.

Questions an AI cluster architect actually asks

The honest FAQ.

Is OcNOS-DC actually "AI-native" — or just RoCEv2 with extras?
No merchant-silicon Ethernet NOS is literally AI-native — none reason about xCCL (NCCL / RCCL / oneCCL) collectives or schedule jobs at the switch; that lives in the NIC and scheduler. OcNOS-DC implements every fabric mechanism a 2026 AI workload needs — lossless RoCEv2, DCQCN with xCCL-validated defaults, sub-ms DLB, GLB (OcNOS 7.1), PFC deadlock watchdog, UEC 1.0 alignment — and stays out of the layers above. "AI-aware fabric" usually just means one vendor sells NIC + switch + scheduler as one locked SKU.
Where does OcNOS-DC stop, and where do the NIC and cluster scheduler take over?
OcNOS-DC owns layer 1 — lossless RDMA transport, congestion control, adaptive routing, deadlock recovery, telemetry. The NIC owns layer 2 (xCCL, RDMA verbs, packet spray, GPU-direct memory); the scheduler owns layer 3 (job placement, gradient-sync windows, tenant isolation). OcNOS-DC streams gNMI telemetry into layer 3 but never tries to be the scheduler — that separation keeps your NIC, GPU, and orchestration swappable.
How does OcNOS AI Fabric compare to NVIDIA Spectrum-X, SONiC, Arista, Cisco, or DriveNets?
Spectrum-X is a closed NVIDIA NIC + switch + software stack — excellent performance, single-vendor lock-in. Arista, Cisco, and Juniper run similar RoCEv2 features on locked hardware with proprietary licensing. Community SONiC is open but ships no AI-hardened defaults, watchdog, or SLA. DriveNets DDC is a proprietary cell fabric, not an Ethernet NOS. OcNOS-DC: open NOS on the same Broadcom silicon, UEC-aligned, xCCL-tuned DCQCN, 24/7 SLA — same technical floor, no lock-in.
What does Ultra Ethernet (UEC) 1.0 mean for OcNOS AI Fabric?
UEC 1.0 brings packet spray, multi-path RDMA, and out-of-order delivery to Ethernet — the open answer to InfiniBand. Production fabrics run RoCEv2 + DCQCN + DLB today, all fully supported; UEC parallelizes every flow across paths instead of pinning it to one ECMP hash. OcNOS-DC tracks the UEC 1.0 fabric profile so the switch you buy today moves to UEC NICs without a NOS or hardware swap. See the Ultra Ethernet deep-dive.
RoCEv2 とは何か、なぜロスレスイーサネットファブリックが必要なのか?
RoCEv2 enables direct GPU-to-GPU memory transfer with no CPU overhead for collectives like AllReduce and AllGather. RDMA has no retransmit — one dropped packet restarts the operation across every GPU — so a lossless fabric (PFC + ECN) is a hard requirement in production. OcNOS-DC ships RoCEv2 buffer profiles and DCQCN defaults aligned to xCCL collective patterns.
How does OcNOS-DC guarantee zero packet loss — and what protects against PFC deadlock?
Three mechanisms: PFC pauses per-priority traffic before buffers overflow, ECN marks packets early to slow senders, and ETS keeps RDMA flows ahead of lower-priority traffic. On top, a per-port, per-priority deadlock watchdog detects paused-queue cycles and auto-drains the queue before jobs hang — the failure mode that used to force mid-job switch power-cycles. PFC over L3 is supported across routed boundaries.
What is DLB, and what changes with GLB in OcNOS 7.1?
Standard ECMP pins a flow to one uplink for its lifetime, causing elephant-flow collisions during AllReduce. DLB uses live ASIC queue-depth telemetry to rebind flowlets to less-loaded paths sub-ms, closing the gap at the local hop. GLB (OcNOS 7.1) extends this end-to-end — spines publish path-quality telemetry back to ingress leaves so routing uses the full multi-hop score, scaling cleanly to clusters up to 16k GPU.
What scale does OcNOS AI Fabric support — and what are the validated reference designs?
OcNOS-DC supports 400G and 800G leaf-spine fabrics. Tomahawk 5 spines (Edgecore AIS800-64D, UfiSpace S9321-64E) deliver 51.2 Tbps / 64 × 800G; Tomahawk 4 leaves run 400G / 25.6 Tbps with deep HBM buffer; Trident 4 covers smaller 100G/400G fabrics. Reference designs cover rail-only, rail-optimized, and 3-stage Clos topologies up to 16k GPU — see the AI Fabric Topologies deep-dive.
OcNOS-DC は AI ファブリック運用向けの自動化とテレメトリに対応しているのか?
Yes. DCBX automates server-to-switch RoCEv2 config, ZTP (IPv4/IPv6) handles zero-touch onboarding, and gNMI streams on-change telemetry over OpenConfig YANG. PFC pauses, ECN marking, DCQCN thresholds, and buffer depths are gNMI sensor paths consumable by Prometheus, InfluxDB, Telegraf, Grafana, or any OpenTelemetry pipeline. Ansible playbooks and a Terraform provider cover Day-0 through Day-2.
Resource

OcNOS 800G AI Fabric Solution Brief

Complete a short form and the PDF is delivered instantly by our resource centre.

Download PDF