UEC 1.0 Aligned · DCQCN · DLB · GLB (OcNOS 7.1) · Up to 16k GPU

An open AI fabric — built for what your training job actually feels.

At thousands of accelerators you don't measure switches in Tbps — you measure job completion time, GPU utilization, and tail latency under microbursts. OcNOS-DC moves those numbers on open merchant silicon with a 24/7 carrier-grade SLA: the same technical floor as the closed AI stacks, none of the lock-in.

Book an Architecture Review View 800G Hardware

Up to 16k GPUsReference design ceiling

Sub-ms DLBFlowlet rebinding

UEC 1.0Fabric-profile aligned

24/7 SLACarrier-grade global

Protocols:

RoCEv2 PFC / DCQCN ECMP / UCMP BGP 400G / 800G EVPN-VXLAN Lossless Fabric gNMI Streaming Telemetry OpenConfig NETCONF / YANG Adaptive Routing (DLB) FLFM GLB UEC 1.0 Model-Based OS

16k GPU

Reference design ceiling

DCQCN

xCCL-tuned, every threshold YANG-modeled

DLB + GLB

Flowlet local + fabric-wide adaptive routing

UEC 1.0

Fabric-profile aligned · open answer to IB

Architecture Briefs

Take it offline. Read it on a plane.

Two short downloads that go deeper than this page: the lossless AI fabric architecture and the EVPN-VXLAN data center reference.

Solution Brief

OcNOS 800G Ethernet-Based Lossless AI Fabric

Non-blocking RoCEv2 fabric on Tomahawk 4/5 spines — SKU tiers, validated platforms, and deployment architecture.

Get the brief

Solution Brief

EVPN-VXLAN Data Center Fabric

Carrier-grade leaf-spine data center fabric: symmetric IRB, Type-2/Type-5 routes, distributed anycast gateway.

Get the brief

The builder's question

"Will my training job actually finish faster?"

At scale, traditional network metrics lose their meaning. What matters is Job Completion Time, GPU utilization, and tail latency under microbursts — because every minute a multi-billion-dollar cluster waits on a synchronization step is capital burned.

The lossless, low-latency performance AI needs no longer requires a closed, proprietary stack. On open merchant silicon with a carrier-grade SLA, OcNOS-DC matches the technical floor of closed architectures with no vendor lock-in — congestion management, sub-millisecond dynamic routing, and Ultra Ethernet alignment, tuned for the bursty patterns of collective traffic. GPUs spend their time processing data, not waiting on the network.

Every threshold is exposed, so your team can tune it against real xCCL (NCCL / RCCL / oneCCL) traffic. Below: each workload pattern, the mechanism that handles it, and what the operator gets back.

AllReduce / AllGather

Every GPU talks to every other GPU at once.

Static ECMP pins elephant flows to one spine link — hot spots, idle uplinks, slow sync.
→ DLB rebinds flowlets sub-ms on live queue depth.
→ GLB (OcNOS 7.1) scores leaf · spine · super-spine.

Result: no hash-collision hot spots; AllReduce holds near line rate.

Microburst / Incast

N senders converge on one queue in microseconds.

A drop restarts the collective; a pause storm blocks the line. Either way the run stalls.
→ DCQCN (xCCL-tuned ECN + CNP) caps rate before the drop.
→ PFC Watchdog auto-drains stuck queues per-port.

Result: jobs survive bursts; deadlocks self-recover — no 3 a.m. power-cycle.

Multi-rail / Scale-out

One flow needs every parallel path simultaneously.

Hash-pinned single-path ECMP leaves multi-rail bandwidth idle.
→ UEC 1.0: packet spray + multi-path RDMA + out-of-order delivery.
→ The switch you buy today stays when UEC NICs land.

Result: tail-latency outliers shrink as UEC NICs roll out — the open answer to InfiniBand.

~55% → 90%+

Reference benchmark. DLB lifts fabric utilization from ~55% on static ECMP to 90%+ on the same hardware — no extra uplinks. Local at each hop; system-wide across the AllReduce. (Industry-published Broadcom flowlet-rebalancing figure, replicable on TH4/TH5.)

DLB deep-dive →

What it looks like in a rack row

800G spine-leaf, lossless from rack to rack.

A 3-stage Clos: eBGP unnumbered underlay, ECMP at every tier, PFC/ECN per priority group, isolated out-of-band bus for ZTP and telemetry. Hover any node for switch, port count, and ASIC.

OcNOS-DC leaf/spine

OcNOS-DC spine (DLB)

GPU servers / storage

Hover nodes for capability and platform details · Full HCL: 40+ validated platforms at ipinfusion.com/hcl

600+Production OcNOS networks

26 yrZebOS routing stack in service

24×7Carrier-grade global SLA

Inside the fabric

Four layers of losslessness — correct on Day 1.

Most AI fabric failures trace to one misconfigured PFC priority group or an ECN threshold tuned for cloud, not RDMA. OcNOS-DC ships RoCEv2 buffer profiles validated per Broadcom ASIC — so your first AllReduce runs lossless without a tuning sprint.

PFC + ECN — priority-group lossless control

PFC pauses per-priority traffic before buffers overflow; ECN marks packets early for sender-side slowdown. No drops, no port-wide stall. PFC over L3 for routed multi-row fabrics.

DLB — flowlet-level adaptive routing

Static-hash ECMP collides when 8 NICs hash to the same spine. DLB watches live queue depth and rebinds flowlets to less-loaded paths sub-ms — the AllReduce stops dragging on the slowest link.

DCBX — server config auto-pushed over LLDP

The leaf pushes correct PFC and ETS config to the GPU server automatically — no silent loss of losslessness when a node gets re-imaged, the most common production failure mode.

gNMI on-change telemetry — sub-second visibility

PFC pauses, ECN marking, DCQCN thresholds, and buffer depths as gNMI on-change sensor paths — straight into Prometheus / Grafana / OpenTelemetry. Catch congestion before it stalls a job.

ai-leaf01 — gNMI lossless fabric telemetry STREAMING

$gnmic subscribe --path /qos/pfc/ \

--mode ON_CHANGE --encoding proto

RoCEv2 Priority Group 3 — real-time

et-0/0/1 PG3 PFC-Rx: 0 Tx: 0 Drop: 0

et-0/0/2 PG3 PFC-Rx: 0 Tx: 0 Drop: 0

et-0/0/3 PG3 PFC-Rx: 0 Tx: 0 Drop: 0

$gnmic subscribe --path /interfaces/counters/

et-0/0/1 in: 780 Gbps out: 776 Gbps

et-0/0/2 in: 795 Gbps out: 791 Gbps

→ Telegraf → Prometheus → Grafana

✓ lossless — 0 drops — fabric healthy

Validated AI Fabric Platforms

AIS800-64D

Edgecore — Spine

800GTH5

S9321-64E

UfiSpace — Spine

800GTH5

AS9736-64D

Edgecore — Leaf

400G / 25.6T

S9321-64EO

UfiSpace — Spine (OSFP)

800GTH5

40+ validated platforms — view full HCL →

Ultra Ethernet · UEC 1.0 Aligned

The fabric profile is ready before the NICs are. That's the point.

RoCEv2 is the production transport in 2026; UEC is what comes next. The UEC 1.0 fabric profile adds packet spray, multi-path RDMA, and out-of-order-friendly forwarding — closing the single-hash limit that kept earlier RoCE a step behind InfiniBand on multi-rail collectives. OcNOS-DC tracks the UEC 1.0 fabric profile today, while UEC NICs roll out. The point isn't leading the standard — everyone is aligning to it. It's that the switch you buy this quarter won't need replacing when your UEC NIC arrives.

Packet spray

Single flow uses every parallel path simultaneously instead of being pinned to one ECMP hash. Multi-rail bandwidth is no longer left on the table.

Multi-path RDMA

Reorder buffers handle out-of-order delivery in hardware. Modern congestion control replaces NACK-based loss recovery on tail latency.

Same hardware, forward path

The TH4 and TH5 platforms validated for OcNOS-DC today extend into UEC. No fork. No second SKU line. One fabric, two transport generations.

Read the Ultra Ethernet deep-dive →

If you're picking a fabric in 2026

Where OcNOS-DC sits — honestly, by name.

The race has converged on a shared floor: lossless RoCEv2, DCQCN, adaptive routing, UEC alignment. Everyone ships these. The real differentiator is solution shape — vertical lock-in vs. open NOS, locked vs. open hardware, closed-loop IB vs. standards Ethernet. Pick the trade-off you can live with for five years.

Solution shape Examples Trade-off

Closed vertical AI stack NVIDIA Spectrum-X + Quantum + ConnectX Excellent integrated performance. NIC, switch, and fabric software locked to one vendor — and to one GPU roadmap.

Locked merchant-silicon NOS Arista EOS · Cisco NX-OS · Juniper Junos Same Broadcom silicon underneath. Per-port licensing premium. Telemetry and tuning constrained to the vendor's own pipeline.

Cell-based proprietary chassis fabric DriveNets Network Cloud Different architecture — scheduled cell fabric, not Ethernet NOS. Strong at hyperscale; not portable to standard switches.

Closed-loop InfiniBand NVIDIA Quantum InfiniBand Best-in-class for tight collectives today. Separate cabling, separate operations, single-vendor ecosystem. UEC closes the gap on Ethernet.

Open NOS, no AI hardening Community SONiC Open hardware, free software, no SLA. xCCL-tuned defaults, deadlock watchdog, and tuning maturity are left entirely to the operator.

Open NOS, AI-hardened, UEC-aligned OcNOS-DC on Edgecore / UfiSpace Same Broadcom silicon. xCCL-tuned DCQCN out of the box, sub-ms DLB, GLB on the 7.1 roadmap, PFC deadlock watchdog. UEC 1.0 fabric profile. 24/7 carrier-grade SLA. No NIC, GPU, or hardware lock-in.

Every row ships a real product — including OcNOS-DC. The question is rarely a missing feature; it's the trade-off you'll live with.

Wait — so what's an "AI fabric" exactly?

What it actually is — and where it stops.

An AI cluster is three layers. The fabric moves bytes between switches; the NIC terminates RDMA; the scheduler decides what runs where. "AI-aware fabric" usually means one vendor bundled all three under one SKU. OcNOS-DC owns the fabric, exposes every threshold, and stays out of the layers above. Here's the boundary, named.

Layer 1 · Fabric

What OcNOS-DC owns.

Lossless RoCEv2 transport — PFC + ECN + ETS + DCBX
DCQCN with xCCL-validated default thresholds, every knob YANG-modeled
DLB sub-ms flowlet rebinding on live ASIC queue depth
GLB fabric-wide path scoring (OcNOS 7.1)
PFC deadlock watchdog — per-port, per-priority
UEC 1.0 fabric-profile alignment — packet-spray-friendly forwarding
gNMI on-change telemetry, OpenConfig YANG, sub-second cadence

Shipping today on Edgecore / UfiSpace TH4 + TH5. GLB on the OcNOS 7.1 train.

Layer 2 · NIC + Transport

Your NIC vendor's job.

xCCL collective implementation and tuning
RDMA verbs, queue pairs, retransmit logic
UEC packet spray endpoint + reorder buffer (UEC NICs)
GPU-direct memory access, NVLink coordination
Per-flow rate limiting and end-host congestion response

NVIDIA ConnectX, BlueField, AMD Pensando, Intel Mt. Evans, Cornelis, future UEC silicon. OcNOS interoperates with all of them — and never replaces the choice.

Layer 3 · Cluster Scheduler

Your orchestration platform's job.

Training-job placement, gang scheduling, gradient-sync windows
Epoch / training-phase awareness
Tenant isolation, queue priority, resource quotas
xCCL ring topology assignment, rail-group affinity
Cross-job interference detection

Slurm, Kubernetes, Run:ai, NVIDIA Base Command, in-house schedulers. OcNOS-DC streams gNMI telemetry into them — it doesn't try to replace them.

Why the line is here: a fabric that owns layers 2 and 3 can never be swapped — NIC locked to switch, scheduler to NIC, GPU roadmap to vendor. InfiniBand owned all three for fifteen years and operators paid for it. OcNOS-DC ships every fabric mechanism a 2026 workload needs, validates it against xCCL traffic, and stops at the wire. That's why "AI-aware fabric" is the wrong question — the right one is whether the fabric does its job well enough that the NIC and scheduler don't have to fight it.

Going deeper

Every mechanism on this page has its own deep-dive.

The page above is for picking a fabric. These are for tuning one — packet captures, ASIC behavior, YANG paths, and where each feature ships in the release train.

AI Fabric · Lossless

RoCEv2 + PFC + ECN + DCQCN

The lossless RDMA transport layer for GPU collectives. Buffer profiles pre-tuned per Broadcom ASIC, xCCL-class DCQCN defaults, sub-µs jitter under load.

Read deep-dive → AI Fabric · Local

Adaptive Dynamic Load Balancing (DLB)

Sub-millisecond flowlet rebinding using live ASIC queue-depth telemetry. Closes the ECMP hash-collision gap on AllReduce elephant flows.

Read deep-dive → AI Fabric · Fabric-wide OcNOS 7.1

Global Load Balancing (GLB)

End-to-end path scoring across leaf · spine · super-spine for clusters up to 16k GPU. The multi-hop adaptive layer DLB cannot see alone.

Read deep-dive → AI Fabric · Frontier UEC 1.0

Ultra Ethernet (UEC)

Packet spray, multi-path RDMA, out-of-order delivery, modern congestion control. The standards-based open answer to InfiniBand.

Read deep-dive → AI Fabric · Reference Designs

Topologies — single-pod to 16k GPU

Rail-only and rail-optimized designs map the fabric shape directly onto the xCCL 8-rail multi-NIC pattern. 3-stage Clos for multi-pod scale-out to the 16k-GPU ceiling. Port counts on TH4 / TH5.

Read deep-dive → AI Fabric · Congestion Control

DCQCN — RDMA Congestion Control

WRED ECN marking, CNP feedback, quantized rate control. xCCL-class defaults out of the box; every threshold YANG-modeled for tuning.

Read deep-dive → AI Fabric · Survival

Watchdog — PFC Deadlock Detection

Per-port, per-priority watchdog detects paused-queue cycles and auto-drains the affected queue before training jobs hang.

Read deep-dive → AI Fabric · Decision Guide

InfiniBand vs Ethernet for AI

Workload-specific decision guide. Where modern Ethernet (RoCEv2 + DLB + UEC) closes the gap, where IB still wins, and how to pick.

Read deep-dive → Observability

gNMI Streaming Telemetry

gNMI Subscribe over gRPC, OpenConfig YANG, dial-out collectors. Integrations with Telegraf, Prometheus, and Grafana.

Read deep-dive →

What people are actually building

Three cluster shapes. Three fabric stories.

Framed by what the job feels, not by switch features. Pick the shape closest to yours; the deep-dives have the configs.

SHAPE 01 · LLM PRE-TRAINING

The multi-week LLM pre-training run.

AllReduce dominates the network. Every GPU must hold high in-collective utilization and survive microbursts without restarting a nine-day run.

Mechanisms: DCQCN + DLB + PFC Watchdog. Rail-optimized for single-pod; 3-stage Clos with GLB for multi-pod scale-out.
Outcome: AllReduce at line rate, zero collective restarts, JCT inside schedule.

SHAPE 02 · LIVE INFERENCE

The high-throughput inference fleet behind a public API.

Real-time inference where p99 tail latency drives the SLO. Inference must never queue behind batch retraining, and ops needs per-flow visibility the moment latency drifts.

Mechanisms: ETS strict-priority + gNMI on-change telemetry into Prometheus / OpenTelemetry.
Outcome: p99 held inside SLO; regressions caught in milliseconds, not the support queue.

SHAPE 03 · GPU-AS-A-SERVICE

The neocloud renting H100 / H200 / Blackwell to tenants.

A multi-tenant GPU cloud. Each tenant needs isolated lossless RoCEv2 paths — without a separate fabric segment per customer or a second NOS image.

Mechanisms: EVPN-VXLAN isolation + lossless RoCEv2 on one OcNOS-DC instance.
Outcome: per-tenant isolation, one ops model, one SLA, one image to upgrade.

Talk to a network architect

Bring your topology. We'll show you the path.

Every IPI architecture review is led by a network engineer running production OcNOS — no slides, no sales theatre. Bring your GPU count, NIC choice, and target JCT; we'll map it to topology, SKUs, and configs that ship today.

Book an Architecture Review Download OcNOS VM Free

After you've picked the AI fabric

Connect it to everything else.

AI is one segment of the data center. DC Fabric and DCI extend the same OcNOS image into compute, storage, and remote sites — same NOS, same CLI, same SLA.

Questions an AI cluster architect actually asks

The honest FAQ.

Is OcNOS-DC actually "AI-native" — or just RoCEv2 with extras?

No merchant-silicon Ethernet NOS is literally AI-native — none reason about xCCL (NCCL / RCCL / oneCCL) collectives or schedule jobs at the switch; that lives in the NIC and scheduler. OcNOS-DC implements every fabric mechanism a 2026 AI workload needs — lossless RoCEv2, DCQCN with xCCL-validated defaults, sub-ms DLB, GLB (OcNOS 7.1), PFC deadlock watchdog, UEC 1.0 alignment — and stays out of the layers above. "AI-aware fabric" usually just means one vendor sells NIC + switch + scheduler as one locked SKU.

Where does OcNOS-DC stop, and where do the NIC and cluster scheduler take over?

OcNOS-DC owns layer 1 — lossless RDMA transport, congestion control, adaptive routing, deadlock recovery, telemetry. The NIC owns layer 2 (xCCL, RDMA verbs, packet spray, GPU-direct memory); the scheduler owns layer 3 (job placement, gradient-sync windows, tenant isolation). OcNOS-DC streams gNMI telemetry into layer 3 but never tries to be the scheduler — that separation keeps your NIC, GPU, and orchestration swappable.

How does OcNOS AI Fabric compare to NVIDIA Spectrum-X, SONiC, Arista, Cisco, or DriveNets?

Spectrum-X is a closed NVIDIA NIC + switch + software stack — excellent performance, single-vendor lock-in. Arista, Cisco, and Juniper run similar RoCEv2 features on locked hardware with proprietary licensing. Community SONiC is open but ships no AI-hardened defaults, watchdog, or SLA. DriveNets DDC is a proprietary cell fabric, not an Ethernet NOS. OcNOS-DC: open NOS on the same Broadcom silicon, UEC-aligned, xCCL-tuned DCQCN, 24/7 SLA — same technical floor, no lock-in.

What does Ultra Ethernet (UEC) 1.0 mean for OcNOS AI Fabric?

UEC 1.0 brings packet spray, multi-path RDMA, and out-of-order delivery to Ethernet — the open answer to InfiniBand. Production fabrics run RoCEv2 + DCQCN + DLB today, all fully supported; UEC parallelizes every flow across paths instead of pinning it to one ECMP hash. OcNOS-DC tracks the UEC 1.0 fabric profile so the switch you buy today moves to UEC NICs without a NOS or hardware swap. See the Ultra Ethernet deep-dive.

What is RoCEv2 and why does it require a lossless Ethernet fabric?

RoCEv2 enables direct GPU-to-GPU memory transfer with no CPU overhead for collectives like AllReduce and AllGather. RDMA has no retransmit — one dropped packet restarts the operation across every GPU — so a lossless fabric (PFC + ECN) is a hard requirement in production. OcNOS-DC ships RoCEv2 buffer profiles and DCQCN defaults aligned to xCCL collective patterns.

How does OcNOS-DC guarantee zero packet loss — and what protects against PFC deadlock?

Three mechanisms: PFC pauses per-priority traffic before buffers overflow, ECN marks packets early to slow senders, and ETS keeps RDMA flows ahead of lower-priority traffic. On top, a per-port, per-priority deadlock watchdog detects paused-queue cycles and auto-drains the queue before jobs hang — the failure mode that used to force mid-job switch power-cycles. PFC over L3 is supported across routed boundaries.

What is DLB, and what changes with GLB in OcNOS 7.1?

Standard ECMP pins a flow to one uplink for its lifetime, causing elephant-flow collisions during AllReduce. DLB uses live ASIC queue-depth telemetry to rebind flowlets to less-loaded paths sub-ms, closing the gap at the local hop. GLB (OcNOS 7.1) extends this end-to-end — spines publish path-quality telemetry back to ingress leaves so routing uses the full multi-hop score, scaling cleanly to clusters up to 16k GPU.

What scale does OcNOS AI Fabric support — and what are the validated reference designs?

OcNOS-DC supports 400G and 800G leaf-spine fabrics. Tomahawk 5 spines (Edgecore AIS800-64D, UfiSpace S9321-64E) deliver 51.2 Tbps / 64 × 800G; Tomahawk 4 leaves run 400G / 25.6 Tbps with deep HBM buffer; Trident 4 covers smaller 100G/400G fabrics. Reference designs cover rail-only, rail-optimized, and 3-stage Clos topologies up to 16k GPU — see the AI Fabric Topologies deep-dive.

Does OcNOS-DC support automation and telemetry for AI fabric operations?

Yes. DCBX automates server-to-switch RoCEv2 config, ZTP (IPv4/IPv6) handles zero-touch onboarding, and gNMI streams on-change telemetry over OpenConfig YANG. PFC pauses, ECN marking, DCQCN thresholds, and buffer depths are gNMI sensor paths consumable by Prometheus, InfluxDB, Telegraf, Grafana, or any OpenTelemetry pipeline. Ansible playbooks and a Terraform provider cover Day-0 through Day-2.