PFC Deadlock Detection & Recovery
PFC pause is what makes RoCEv2 lossless. Under rare topology and routing conditions, PFC can also create a circular dependency where every switch is paused waiting for the next switch in a cycle — and traffic stops indefinitely. OcNOS-DC ships a watchdog that detects the cycle in milliseconds and drains the affected queue automatically, before training jobs hang.
A 3-Switch Pause Cycle
Three switches in a circular dependency. Each is paused on its lossless priority queue, waiting for the next switch to drain. Without intervention, the cycle is stable forever. The OcNOS watchdog fires after the configured timeout, drains queue-3 on switch-A, and the cycle collapses.
How PFC creates a deadlock
PFC is a hop-by-hop pause: switch-A asserts pause to its upstream when its lossless ingress queue fills past the threshold, and the upstream stops sending. This works fine on a tree topology where there's a single direction of traffic flow. On a leaf-spine fabric with multiple paths, ECMP rerouting around a link failure can — under specific conditions — create a circular path where every switch is paused waiting for the next.
Once the cycle forms, it's stable: there's enough memory to hold the paused frames, the routing protocol thinks everything is fine, and PFC keeps reasserting on every switch. Without intervention, the affected lossless priority is hung indefinitely. RoCEv2 traffic stops, NCCL collectives time out, the training job stalls.
The OcNOS-DC watchdog
Per-port, per-priority timer
A timer runs per ingress port and per lossless priority. If the priority is paused continuously for the configured interval (typically 100–400 ms), the watchdog fires.
Automatic queue drain
On fire, the affected ingress queue is drained — frames are dropped temporarily so the cycle collapses. The dropped frames trigger NCCL retransmits, but the alternative is an indefinite hang.
Auto-restore after recovery
After the configured restore interval, normal PFC operation resumes on the affected priority. No operator intervention required; the fabric is back to lossless within seconds.
gNMI counters
Watchdog fire events, drain durations, and per-priority pause counters stream over gNMI for closed-loop fabric monitoring. SREs see deadlock events as alerts, not as silent training stalls.
Operator-configurable timers
Detection timeout, drain duration, and restore interval are CLI-configurable per port and per priority. Defaults work for most fabrics; the operator can shorten timers on high-stakes clusters.
DC-PLUS license tier
Part of the OcNOS-DC PLUS SKU alongside the rest of the lossless RoCEv2 stack. Confirmed on Broadcom Tomahawk 4 and Tomahawk 5 platforms.
When you'll see this fire
In a well-designed fabric with proper topology and routing, PFC deadlocks are rare — most operators never see one in years of running. The watchdog matters because "rare" doesn't mean "never": a link failure during a routing convergence window, a misconfiguration of PFC priorities on a single port, or a transient congestion event on an unusual traffic pattern can all create the conditions. Without the watchdog, when it does fire, the AI cluster stops and the operations team has hours of debugging ahead. With the watchdog, you get a brief retransmission burst and a logged event.
The bottom line
- Lossless safety net. The watchdog is the difference between "PFC is theoretically risky on production AI fabrics" and "PFC is safe to deploy at scale."
- Sub-second recovery. Detection + drain + restore typically completes inside one second. NCCL retransmits a small window of RDMA traffic; the job continues.
- Standard configuration. The watchdog is on by default in OcNOS-DC's lossless template. You don't need to remember to turn it on.
- Observable. Every fire is logged, counted, and streamed via gNMI. Closed-loop monitoring with your existing observability stack.
- Tunable for stakes. Lower the timers on critical training clusters; defaults are fine for general DC fabric.