BCM56996 · TSMC 7 nm · On-package HBM deep buffer

Broadcom Tomahawk 4 Tomahawk 4 Switch 25.6 Tbps · 64 × 400G · the deep-buffer 400G generation.

One open platform validated on OcNOS-DC: Edgecore AS9736-64D. The HBM deep-buffer variant of Tomahawk 4 — the silicon for 400G AI fabrics where buffer headroom matters more than 800G port count, and for DCI/aggregation roles where bursts run deep.

25.6Tbps
Switch Capacity
64×400G
Native Port Radix
~70GB
HBM Deep Buffer
7nm
TSMC N7 Process
50GPAM4
SerDes Per Lane
01
The Switch
Open hardware running Tomahawk 4

One platform. One purpose: deep-buffer 400G.

Edgecore AS9736-64D — a 2RU 64×400G QSFP-DD switch on the BCM56996 deep-buffer Tomahawk 4. ONIE pre-loaded, runs the same OcNOS-DC image as the TH5 spines and TD4 leaves. One validated platform, one architectural niche the rest of the portfolio doesn't cover.

Edgecore· DCS520 platform family
Deep-buffer 400G AI fabric · DCI

AS9736-64D

Validated on OcNOS-DC · ONIE pre-loaded
Ports
64 × QSFP-DD (400G)Breakout: 2×200 / 4×100 / 8×50 (up to 256 logical ports)
Form
2RU · 21.5 kg
Power
~2100 W typical · hot-swap redundant AC~33 W per QSFP-DD cage
CPU
Intel Xeon D-class · 4 GB RAM
▌ Pick this when

400G AI fabric for 256–1k GPU clusters where deep buffer matters more than 800G ports — and for 400G aggregation / DCI roles where the HBM absorbs bursts that smaller-buffer switches drop.

You are here · 25.6 Tbps

Tomahawk 4 — 64 × 400G

Pick when 400G NICs anchor the cluster, deep-buffer headroom is on the requirements list, or the box must absorb DCI/aggregation bursts that a smaller-buffer chip would drop.

Step up · 51.2 Tbps

Tomahawk 5 — 64 × 800G

Pick when the cluster needs 800G ports natively, GPU count exceeds ~1k, or the radix that collapses a Clos tier is worth the per-port premium. Tomahawk 5 page →

Smaller box · 12.8 Tbps

Trident 4 — DC leaf

Pick when the role is DC leaf at 100G/400G with smaller capacity envelope. Different chip family, same OcNOS-DC image, much cheaper per port. (Trident 4 page coming.)

02
Inside the Silicon
What HBM-backed deep buffer buys you

Tomahawk 4 — and the variant that put HBM on the package.

Standard Tomahawk 4 (BCM56990) is a 25.6 Tbps switch with on-die shared buffer in the few-hundred-megabyte range — the same class as TH3 and TH5. The HBM variant — BCM56996, the chip in the AS9736-64D — adds on-package High-Bandwidth Memory as a deep-buffer extension pool. Roughly 70 GB of buffer attached at HBM bandwidth, addressable by the same forwarding pipeline.

Why that matters: lossless RoCEv2 normally relies on PFC (priority flow control) propagating backpressure upstream when a queue fills. With HBM headroom, transient AllReduce micro-bursts and DCI long-flow congestion absorb into the deep pool instead of triggering pause storms. PFC still arms — but it triggers far less often, and when it does, deadlock cycles have time to resolve before the watchdog drains them.

Specs cross-checked against Broadcom's BCM56990/56996 product page and the live OcNOS feature matrix.

ProcessTSMC N7 SeriesStrataXGS BufferOn-die + HBM RoutingCognitive · DLB ShippingSince 2020

· What 64 × 400G looks like

BCM56996 die25.6 Tbps
+ On-package HBM~70 GB deep buffer
512 lanes × 50G PAM4 = 25.6 Tbps. Eight lanes per cage → 400G. The buffer extension is the differentiator.
Four design choices that matter

Why TH4 stays in the AI fabric conversation even after TH5 ships.

Three of these four choices are shared with TH3 and TH5. The HBM extension is the one that makes the BCM56996 variant unique.

PRINCIPLE 02

50G PAM4 SerDes — 512 lanes.

The same lane count as TH3 (50G NRZ) and TH5 (100G PAM4) — TH4 sits at the middle generation. Eight lanes per QSFP-DD cage gives 400G native; breakout extends to 200G/100G/50G for mixed-speed deployments.

512 lanes · 50G PAM4
PRINCIPLE 03

Hardware adaptive routing.

Broadcom Cognitive Routing — flowlet-aware load-balancing in the ASIC, no controller round-trip. OcNOS-DC turns this on as DLB Reactive-Path Rebalance. With the HBM headroom, hash-collision rebinding plus burst absorption work together.

DLB · flowlet rebinding
PRINCIPLE 04

Mature 7 nm silicon.

Shipping in volume since 2020 — four-plus years of bug fixes, predictable behaviour, and a known thermal envelope. For brownfield refresh of a TH3 fabric, this is the boring-and-predictable choice.

TSMC N7 · 4+ years shipping
03
Generation Jump
Tomahawk 3 → Tomahawk 4

Capacity doubled. Process shrank. HBM appeared.

TH3 (12.8 Tbps · 32×400G · 16 nm · 25G NRZ) was the workhorse of the pre-AI-fabric era. TH4 doubled the spec sheet — and the BCM56996 variant added the architectural twist that's still its differentiator.

Switching capacity
12.8 Tbps 25.6 Tbps

Doubled at the same rack footprint. 2RU stayed 2RU.

Native port radix
32 × 400G 64 × 400G

Twice the ports at the same speed — fits Clos designs without an extra tier.

Process node
16 nm 7 nm

Two-step shrink. Power-per-port headroom for 400G optics without active per-port cooling.

SerDes per lane
25G NRZ 50G PAM4

Same 512 lanes, twice the per-lane speed. Doubling came from existing infrastructure.

The next jump: TH5 doubles again to 51.2 Tbps and 64 × 800G with 100G PAM4 SerDes — but TH5 went back to standard shared-buffer, leaving TH4G's HBM deep buffer as a one-generation feature. Tomahawk 5 page →
04
What OcNOS-DC Ships
OcNOS-DC on this silicon

Same image as the TH5 spine. HBM-aware buffer profiles.

OcNOS-DC runs identically across TH3, TH4, and TH5 platforms. On TH4 it does one thing extra: maps NCCL-aligned DCQCN profiles onto the HBM extension pool so RoCEv2 lossless rides through bursts that a non-deep-buffer fabric would have to PFC-pause through.

Lossless RoCEv2 · HBM-backed

PFC + ECN pre-tuned to NCCL — and the deep pool absorbs what's left.

Standard PFC + ETS + Dynamic ECN configuration plus HBM-aware buffer profiles. Most AllReduce micro-bursts never reach the PFC threshold because the HBM headroom takes them. Tail latency stays bounded under the synchronised many-to-one traffic that takes shallow-buffer fabrics down.

Adaptive Routing

DLB rebinds flowlets in the ASIC.

Cognitive Routing on TH4 runs the same DLB Reactive-Path Rebalance OcNOS-DC ships on TH5. The combination — HBM headroom + flowlet rebinding — handles ECMP hash collision and burst absorption in the same forwarding pass.

PFC Deadlock Watchdog

Per-port, per-priority. Auto-drain.

Detects paused-queue cycles before they hang training jobs. With HBM headroom many would-be deadlocks never form — but the watchdog still arms.

Streaming Telemetry

HBM occupancy on the wire.

gNMI on-change for buffer depth (on-die and HBM extension), ECN marks, PFC pause counts. Visibility into the deep pool — not a black box.

Real Network

BGP · OSPF · IS-IS · EVPN-VXLAN.

Full carrier-grade Layer 3 stack on the same silicon. The TH4 spine is also a real router — operate it like the rest of your network, not like a black box.

Validated feature surface

Same OcNOS-DC image as TH5 — every feature lights up where the silicon supports it.

Layer 3 routing · L1/L2 · AI/ML fabric primitives · Multicast · QoS · Security · Hardware · Management. Per-platform validation visible on the public matrix.

RoCEv2 / PFC DCQCN DLB EVPN-VXLAN BGP / OSPF / IS-IS gNMI / NETCONF ZTP HBM telemetry
Day-0 to Day-2

ZTP. gNMI on-change. NETCONF + YANG. DCBX.

Bring up the AS9736-64D in the rack with zero-touch provisioning. Stream every counter — including HBM occupancy — to your observability stack. Tune every threshold via YANG-modelled config. No glue scripts.

ZTP IPv4/IPv6 gNMI NETCONF OpenConfig YANG DCBX LLDP Ansible Terraform provider
Who builds this stack

Three operator profiles. One silicon for all three.

The 64×400G + HBM combination puts AS9736-64D in three different conversations — AI fabric, DCI, and brownfield refresh. Same switch, different framing of the same architectural question.

AI Cluster Operator · 256–1k GPU

400G NIC fabric without paying for 800G silicon.

"Our cluster is 400G NICs. We don't need 800G ports yet — but we do need the deep buffer. AllReduce on shallow-buffer fabrics keeps tripping PFC."

TH4 spines on AS9736-64D, RoCEv2 with NCCL-tuned DCQCN, HBM-aware buffer profiles, sub-millisecond DLB rebinding. Three-tier Clos at 1k GPU — same OcNOS-DC image as the TH5 deployment next door.

DC · Deep-Buffer Spine
DCI · Deep-Aggregation Architect

Long-flow congestion without losing packets.

"Our DCI box has to absorb bursts from cross-DC TCP flows that run for minutes. Standard switches drop. Chassis routers cost ten times what this should."

~70 GB HBM extension pool sized for long-flow burst absorption. EVPN-VXLAN inter-DC, full L3 stack, per-tenant gNMI telemetry. Open hardware at merchant-silicon economics.

DC · DCI · Aggregation
Brownfield · TH3 Refresh

Doubled capacity, same operations model.

"We have a TH3 fabric in production. We need more capacity, but we don't want to redesign the NOS layer or retrain the network team."

Same OcNOS-DC image runs on TH3 and TH4. Brownfield refresh keeps configs, automation, and gNMI pipelines intact. The capacity doubles. The operational model stays.

DC · Refresh
Frequently Asked

The questions architects actually ask.

One platform: the Edgecore AS9736-64D — a 2RU 64×400G QSFP-DD switch built on the Broadcom BCM56996 (Tomahawk 4 with on-package HBM deep buffer). Ships ONIE pre-loaded, runs the same OcNOS-DC image as the TH5 spines and the TD4 leaves. The validated platform set is one switch — but it is the deep-buffer 400G switch in the OcNOS portfolio.
Two reasons. First, BCM56996 has on-package HBM deep buffer — TH5 went back to a standard shared-buffer architecture. For 400G aggregation and DCI roles where flows queue deeply, TH4 absorbs bursts a TH5 (or TH3) drops. Second, at the 256–1k GPU scale on 400G NICs, a TH4 fabric is cheaper per port than TH5 with no architectural compromise — three-tier Clos still fits, and the OcNOS-DC feature surface is identical.
On-package HBM extends the chip's effective packet buffer from a few hundred megabytes to roughly 70 GB. In an AI fabric: AllReduce micro-bursts can absorb into HBM rather than triggering tail-drop or PFC pause storms. In a DCI/aggregation role: long-lived TCP flows survive transient congestion without retransmits. It changes the lossless story from "PFC + ECN + careful tuning" to "PFC + ECN + headroom that hides most of the failure modes."
Pick TH5 (AIS800-64D) when 800G ports are on the BoM, GPU count is above ~1k, or you want the radix that collapses one Clos tier. Pick TH4 (AS9736-64D) when 400G NICs are the cluster anchor, deep buffer is the architectural choice (DCI, deep-aggregation, mixed-flow fabrics), or the per-port budget rules out 800G silicon. Both run the same OcNOS-DC image — mixing them in a multi-tier fabric is a supported deployment.
Yes. TH4 has the same Cognitive Routing primitives as TH5 — flowlet-aware load-balancing in the ASIC, no controller round-trip. OcNOS-DC turns this on as DLB Reactive-Path Rebalance. Combined with the HBM deep buffer, a TH4 fabric resolves elephant-flow hash collisions and rides through the resulting transient queue depth without dropping. PFC deadlock detection & recovery, DCQCN, and ETS are all available.
Capacity doubled twice (12.8 → 25.6 → 51.2 Tbps). Process shrank twice (16 → 7 → 5 nm). Per-lane SerDes doubled twice (25G NRZ → 50G PAM4 → 100G PAM4). Lane count stayed at 512 across the family. Buffer architecture: TH3 standard shared, TH4 added HBM (BCM56996 variant only), TH5 returned to standard shared. OcNOS-DC supports all three with the same image — brownfield refresh keeps configs and gNMI pipelines intact.
The 64×400G radix is overkill for sub-1 Tbps SP edge or cell-site gateway — pick Qumran (Q2C, Q2C+) or Qumran 2A/2U for those. For pure DC leaf at 100G/25G it's also the wrong shape — pick Trident 4 (TD4) at 12.8 Tbps. And if the cluster genuinely needs 800G ports today, TH4 forces an extra Clos tier — pick TH5. The TH4 sweet spot is "400G is enough, deep buffer is required."

Designing a deep-buffer 400G fabric? Let's size it together.

30-minute architecture session with an OcNOS network architect. Bring your GPU count, NIC speed, and burst-pattern expectations — leave with a sized BoM around AS9736-64D and a placement plan vs the TH5 / TD4 alternatives.