PFC Deadlock Detection & Recovery

PFC pause is what makes RoCEv2 lossless. Under rare topology and routing conditions, PFC can also create a circular dependency where every switch is paused waiting for the next switch in a cycle — and traffic stops indefinitely. OcNOS-DC ships a watchdog that detects the cycle in milliseconds and drains the affected queue automatically, before training jobs hang.

A 3-Switch Pause Cycle

Three switches in a circular dependency. Each is paused on its lossless priority queue, waiting for the next switch to drain. Without intervention, the cycle is stable forever. The OcNOS watchdog fires after the configured timeout, drains queue-3 on switch-A, and the cycle collapses.

PFC deadlock cycle and watchdog recovery Three switches arranged in a triangle. Pause arrows point clockwise from each switch to the next, indicating each switch is paused waiting for the downstream switch. A watchdog icon at switch A shows the OcNOS deadlock timer firing to drain the queue and break the cycle. PFC pause (CoS 3) PFC pause PFC pause WD watchdog fires drain queue 3 Switch-Apaused on Q3waiting on B Switch-Bpaused on Q3waiting on C Switch-Cpaused on Q3waiting on A PFC DEADLOCK · WATCHDOG TIMEOUT · QUEUE DRAIN · AUTO-RECOVERY

How PFC creates a deadlock

PFC is a hop-by-hop pause: switch-A asserts pause to its upstream when its lossless ingress queue fills past the threshold, and the upstream stops sending. This works fine on a tree topology where there's a single direction of traffic flow. On a leaf-spine fabric with multiple paths, ECMP rerouting around a link failure can — under specific conditions — create a circular path where every switch is paused waiting for the next.

Once the cycle forms, it's stable: there's enough memory to hold the paused frames, the routing protocol thinks everything is fine, and PFC keeps reasserting on every switch. Without intervention, the affected lossless priority is hung indefinitely. RoCEv2 traffic stops, NCCL collectives time out, the training job stalls.

The OcNOS-DC watchdog

Detection

Per-port, per-priority timer

A timer runs per ingress port and per lossless priority. If the priority is paused continuously for the configured interval (typically 100–400 ms), the watchdog fires.

Recovery

Automatic queue drain

On fire, the affected ingress queue is drained — frames are dropped temporarily so the cycle collapses. The dropped frames trigger NCCL retransmits, but the alternative is an indefinite hang.

Restoration

Auto-restore after recovery

After the configured restore interval, normal PFC operation resumes on the affected priority. No operator intervention required; the fabric is back to lossless within seconds.

Telemetry

gNMI counters

Watchdog fire events, drain durations, and per-priority pause counters stream over gNMI for closed-loop fabric monitoring. SREs see deadlock events as alerts, not as silent training stalls.

Tunable

Operator-configurable timers

Detection timeout, drain duration, and restore interval are CLI-configurable per port and per priority. Defaults work for most fabrics; the operator can shorten timers on high-stakes clusters.

Scope

DC-PLUS license tier

Part of the OcNOS-DC PLUS SKU alongside the rest of the lossless RoCEv2 stack. Confirmed on Broadcom Tomahawk 4 and Tomahawk 5 platforms.

When you'll see this fire

In a well-designed fabric with proper topology and routing, PFC deadlocks are rare — most operators never see one in years of running. The watchdog matters because "rare" doesn't mean "never": a link failure during a routing convergence window, a misconfiguration of PFC priorities on a single port, or a transient congestion event on an unusual traffic pattern can all create the conditions. Without the watchdog, when it does fire, the AI cluster stops and the operations team has hours of debugging ahead. With the watchdog, you get a brief retransmission burst and a logged event.

The bottom line

  • Lossless safety net. The watchdog is the difference between "PFC is theoretically risky on production AI fabrics" and "PFC is safe to deploy at scale."
  • Sub-second recovery. Detection + drain + restore typically completes inside one second. NCCL retransmits a small window of RDMA traffic; the job continues.
  • Standard configuration. The watchdog is on by default in OcNOS-DC's lossless template. You don't need to remember to turn it on.
  • Observable. Every fire is logged, counted, and streamed via gNMI. Closed-loop monitoring with your existing observability stack.
  • Tunable for stakes. Lower the timers on critical training clusters; defaults are fine for general DC fabric.

Validating lossless behaviour on a new fabric? Start with the watchdog.

Request a Technical Demo →