← Back to Library

GPU Networking Basics Part 3: Scale-Out

AI networking is heating up, and Chipstrat is here to make sense of it. If you’re new to the series, start with Part 1 and Part 2. In Part 3 we go deeper on scale-out networking for AI and why it matters for training at cluster scale.

We will explain why low latency and low jitter govern iteration time in distributed training. We will show where traditional Ethernet falls short and why InfiniBand became the default fabric for HPC-style, lockstep workloads.

We will also clear up common misconceptions. (No worries, I had these coming into this article too!) For example, Mellanox did not invent InfiniBand. InfiniBand is not proprietary, but rather an open standard born from an industry consortium. Mellanox, and later Nvidia, have long supported Ethernet for scale-out via RoCE and its evolution. Along the way we will define what “open” actually means in networking. And more!

We will then contrast InfiniBand with modern Ethernet-for-AI stacks such as Nvidia Spectrum-X and the Ultra Ethernet Consortium’s 1.0 spec.

Then behind the paywall we’ll examine Nvidia’s monster networking business. We will compare Nvidia’s mix across InfiniBand, NVLink, and Spectrum-X with Broadcom and Arista to show why networking is an important piece of Nvidia’s expanding TAM.

But first, context. I’ll walk through a simple example to make it clear why AI networking is a different beast than networking of the past.

If you already know the basics, feel free to skip ahead! Many readers have said they value starting in the shallow end before diving deep, so we’ll ease in there first.

AI Networking is Different

So what are the networking needs of an AI workload?

BTW: when I say AI training here, I mean LLMs and transformer variants driving the Generative AI boom.

LLM Training 🤝 Networking

At its core, LLM training is a distributed computing workload, with thousands of machines working together on a single problem.

The idea of distributed computing isn’t new. Anyone remember Folding@Home, which harnessed volunteer PCs to run protein simulations?

Vijay Pande: So in 2000 we had the idea of how we could actually use lots and lots of computers to solve our problems—instead of waiting for a million days on one computer to get the problem done in 10 days on 100,000 computers. But then you reach a sort of fork in the road where you decide whether you just want to

...
Read full article on Chipstrat →