AI Fabric & Lossless RoCEv2
Your GPU cluster is only as fast as the network connecting it. OcNOS-DC delivers a production-grade 800G lossless RoCEv2 fabric on validated open hardware — with the carrier-grade SLA your AI investment demands.
One dropped packet stalls every GPU in the job.
RDMA has no retransmission. A single packet drop in an AllReduce restarts the entire collective operation across every GPU in the cluster. Your network must be lossless — or your cluster is running slower than it should be.
OcNOS-DC ships pre-tuned for RoCEv2 on every supported Broadcom ASIC. PFC ECN ETS DCBX DLB — correct from Day 1, on open hardware.
800G Spine-Leaf AI Fabric — Lossless RoCEv2
A 3-stage Clos fabric with eBGP unnumbered underlay, ECMP at every tier, and PFC/ECN tuned per priority group. ZTP provisions each rack-level leaf switch automatically at boot.
Hover nodes for capability and platform details · Full HCL: 40+ validated platforms at ipinfusion.com/hcl
Four layers of losslessness — built into OcNOS-DC.
Most AI fabric failures come from a single misconfigured PFC priority group or an ECN threshold set for cloud workloads rather than RDMA. OcNOS-DC ships with buffer profiles validated for RoCEv2 on each supported Broadcom ASIC — so Day-1 configuration is correct, not trial-and-error.
PFC + ECN — Priority-group lossless control
PFC (Priority Flow Control) pauses per-priority traffic before buffer overflow. ECN marks packets early to trigger sender-side slowdown. Together they prevent drops without stalling the entire port. OcNOS-DC supports PFC over L3 for routed AI fabrics.
Dynamic Load Balancing (DLB) — flow-level ECMP
Standard consistent-hash ECMP creates hotspots when many GPU-to-GPU flows collide on the same spine link. DLB in OcNOS-DC monitors real-time queue depth and non-disruptively reassigns elephant flows to less-loaded paths — maximizing fabric utilization during AllReduce.
DCBX — automated server-to-switch configuration
DCB Exchange Protocol (DCBX) runs over LLDP and pushes the correct PFC and ETS configuration from OcNOS-DC leaf switches to attached GPU servers automatically — eliminating the risk of manual misconfiguration that silently breaks losslessness.
gNMI on-change telemetry — PFC counter visibility
PFC pause counters, ECN marking rates, and per-priority buffer depths are exposed as gNMI sensor paths with on-change subscriptions. Feed directly to Prometheus and Grafana to detect congestion events in milliseconds — before they cascade into training stalls.
Validated AI Fabric Platforms
40+ validated platforms — view full HCL →
Why operators are moving AI fabric to open hardware.
Proprietary AI switch vendors charge a premium for switching ASICs that are, in most cases, the same Broadcom merchant silicon available in open ODM hardware. OcNOS-DC gives you the same lossless RoCEv2 performance — without the lock-in.
❌ Proprietary Vendor
Hardware and software bundled — you pay the vendor margin on both, every refresh cycle.
PFC/ECN profiles are vendor-tuned and not exposed to operators — you trust defaults you cannot inspect.
Single-vendor ECMP implementation — no DLB, or DLB locked to a specific proprietary protocol.
Proprietary telemetry stack — data only flows into the vendor's own observability products.
Support requires separate hardware and software contracts from the same vendor.
✓ OcNOS-DC on Open Hardware
Hardware from Edgecore or UfiSpace. Software from IP Infusion. One combined SLA — two vendor relationships eliminated.
Broadcom Tomahawk buffer profiles fully configurable and documented. Validated PFC/ECN settings ship with OcNOS-DC for each platform.
DLB (Dynamic Load Balancing) standard in OcNOS-DC — monitors real-time queue depth, reassigns flows non-disruptively.
gNMI with on-change subscriptions — all PFC/ECN/buffer data feeds into standard Prometheus, InfluxDB, or OpenTelemetry pipelines.
Single IP Infusion support contract covers software, TAC, and hardware RMA coordination globally, 24/7.
Where OcNOS AI Fabric is deployed today.
GPU-Dense AI Training Clusters
Large-scale GPU training clusters running distributed jobs require a non-blocking lossless fabric with consistent latency across all GPU-to-GPU paths. OcNOS-DC delivers PFC/ECN and DLB on 800G spine-leaf topologies, ensuring AllReduce operations complete without collective restarts.
AI Inference at Scale
High-throughput inference clusters serving real-time API endpoints require predictable low-latency paths between GPU nodes. OcNOS-DC's ETS scheduling ensures inference traffic is never queued behind batch jobs, and streaming telemetry provides per-flow visibility to detect latency regression in production.
GPU-as-a-Service / Cloud AI
Cloud providers offering GPU compute to tenants need multi-tenant fabric isolation alongside lossless RoCEv2. OcNOS-DC combines EVPN-VXLAN tenant isolation from OcNOS-DC's data center fabric feature set with the RoCEv2 lossless stack — both in a single NOS instance on the same hardware.
Resources for the AI fabric.
Architecture detail, SKUs, and validated platforms for the OcNOS lossless RoCEv2 fabric.
OcNOS 800G Ethernet-Based Lossless AI Fabric
Non-blocking RoCEv2 fabric on Tomahawk 4/5 spines — SKU tiers, validated platforms, and deployment architecture.
Download → Solution Brief · PDFEVPN-VXLAN Data Center Fabric
Carrier-grade leaf-spine data center fabric: symmetric IRB, Type-2/Type-5 routes, distributed anycast gateway.
Download → Customer StoriesProduction AI & DC deployments
Real OcNOS data center and AI fabric deployments from operators running carrier-grade workloads in production.
Browse →Bring your topology. We'll show you the path.
Every IPI demo is led by a network architect with production OcNOS deployments — no slides, no sales theatre. Just your specific AI fabric topology and real configuration walkthroughs.
Complete your DC strategy.
AI Fabric is the compute layer. DC Fabric and DCI extend your open networking strategy across the full data center and between sites.