← Back to Library
Wikipedia Deep Dive

InfiniBand

Based on Wikipedia: InfiniBand

The Hidden Plumbing of Artificial Intelligence

When we talk about the artificial intelligence revolution, we tend to focus on the flashy parts: the graphics processing units, the massive data centers, the neural networks with billions of parameters. But there's a quieter story happening inside those data centers, one about how all those powerful chips actually talk to each other. That story is InfiniBand.

Think of it this way: you could have the fastest computers ever built, but if they can't exchange information quickly enough, they become a traffic jam. InfiniBand is the express highway that prevents that jam.

What InfiniBand Actually Does

InfiniBand is a networking standard designed for one purpose: moving enormous amounts of data between computers with almost no delay. The technical term for that delay is latency, and reducing it matters enormously when thousands of processors need to coordinate on a single calculation.

Your home network uses Ethernet, the same basic technology that has connected computers since the 1970s. Ethernet is versatile and cheap, but it was designed for general-purpose networking—sending emails, loading web pages, streaming video. InfiniBand was designed specifically for the brutal demands of supercomputing, where processors need to share results millions of times per second.

The difference is like comparing a city bus system to a Formula One pit crew. Both move things around, but one is optimized for raw speed and precision.

A Brief History of Going Faster

InfiniBand emerged from corporate warfare. In the late 1990s, computer engineers realized they had a problem: processors were getting faster and faster, but the connections between them weren't keeping pace. The metaphor they used was "bottleneck"—like pouring water through a funnel, no matter how much water you have, it can only flow as fast as the narrowest point allows.

Two rival camps formed to solve this problem. One group, led by Intel with support from Sun Microsystems and Dell, developed something called Next Generation Input/Output. The other group, backed by Compaq, IBM, and Hewlett-Packard, created Future I/O. Both were trying to replace the aging Peripheral Component Interconnect bus—the internal highway that moved data between a computer's processor and its various components.

In 1999, these rivals did something unusual in the tech industry: they merged their efforts. The InfiniBand Trade Association was born, bringing together hardware giants and software companies like Microsoft. The first official specification arrived in 2000, promising to replace not just the internal connections in computers but also the Ethernet cables connecting machines and the Fiber Channel links connecting storage systems.

It was an ambitious vision. Perhaps too ambitious.

The Dot-Com Crash and the Survivor

The early 2000s were unkind to ambitious technology ventures. When the dot-com bubble burst, companies became cautious about investing in revolutionary new standards. By 2002, Intel announced it would focus on developing PCI Express instead—an incremental improvement to existing technology rather than a wholesale replacement. Microsoft abandoned InfiniBand development to extend Ethernet. The grand vision narrowed.

But InfiniBand didn't die. It found its niche in the one market willing to pay premium prices for premium performance: supercomputers.

In 2003, Virginia Tech built a supercomputer called System X that ranked among the three most powerful computers on Earth. It used InfiniBand to connect its processors. The academic and scientific computing community had discovered a tool that made their work dramatically faster, and they weren't going to give it up just because the mainstream market had lost interest.

The Rise of Mellanox

A company called Mellanox had been founded in 1999 to build the original Next Generation Input/Output technology. When the industry merged around InfiniBand, Mellanox pivoted. By 2001, they were shipping InfiniBand products at ten gigabits per second—remarkably fast for the era.

Over the next two decades, Mellanox would become synonymous with InfiniBand. Through a series of acquisitions and the exit of competitors, they achieved something rare in technology: near-monopoly status in a critical infrastructure market.

This consolidation wasn't always organic. Cisco, the networking giant, saw InfiniBand as a threat to its Ethernet business. According to industry observers, Cisco pursued a "buy to kill" strategy, acquiring InfiniBand switching companies like Topspin and then shutting them down. The goal was to eliminate competition to Ethernet rather than to develop InfiniBand products.

By 2010, Mellanox had merged with Voltaire, another InfiniBand vendor. Intel acquired the InfiniBand division of QLogic in 2012, but eventually exited the market. This left Mellanox as essentially the only independent supplier.

Then, in 2019, Nvidia acquired Mellanox for seven billion dollars.

Why Nvidia Wanted the Plumbing

On the surface, Nvidia's acquisition seems puzzling. Nvidia makes graphics processing units—the chips that have become essential for training artificial intelligence models. Why would they want a networking company?

The answer lies in understanding how modern AI systems work. Training a large language model requires distributing calculations across thousands of GPUs simultaneously. Those GPUs need to constantly share intermediate results with each other. The faster they can share, the faster the training completes.

When you're spending millions of dollars on a training run that might take weeks, shaving even ten percent off that time represents enormous value. InfiniBand provides that edge.

By owning Mellanox, Nvidia controls both the processing power and the communication links in modern AI data centers. It's like owning both the engines and the transmission system in a fleet of racing cars.

How Fast Is Fast?

InfiniBand speeds have increased dramatically over the years, following a naming convention that started sensibly and became increasingly arbitrary.

The original speeds were called Single Data Rate, Double Data Rate, and Quad Data Rate—abbreviated SDR, DDR, and QDR. This made intuitive sense: each generation doubled or quadrupled the throughput. But as speeds continued climbing, the industry ran out of obvious names. They settled on three-letter codes that no longer corresponded to any obvious pattern: FDR for "Fourteen Data Rate" (actually delivering about fifty-six gigabits per second), EDR for "Enhanced Data Rate," HDR for "High Data Rate," and most recently NDR for "Next Data Rate."

The current generation, NDR, delivers four hundred gigabits per second over a single connection. To put that in perspective: four hundred gigabits per second means you could transfer the entire contents of a typical laptop's hard drive in less than a second. Modern data centers use thousands of these connections simultaneously.

The Secret Sauce: Remote Direct Memory Access

Raw speed isn't InfiniBand's only advantage. The technology also provides something called Remote Direct Memory Access, or RDMA.

Normally, when one computer wants to send data to another, the process involves multiple layers of software. The application asks the operating system to send data. The operating system copies the data to a network buffer. The network hardware sends the data. On the receiving end, the network hardware receives the data, copies it to a buffer, notifies the operating system, and the operating system eventually makes it available to the application.

All that copying and notification takes time and consumes processor cycles. RDMA eliminates most of it. With RDMA, one computer can read from or write to another computer's memory directly, without involving either computer's processor in the transfer. The hardware handles everything.

This matters enormously for AI workloads. When GPUs are working on a training run, you want them spending their time on mathematics, not waiting for data or managing network traffic. RDMA lets the networking happen almost invisibly in the background.

The Physical Reality

InfiniBand connections use a specific type of cable and connector. The most common connector is called QSFP, which stands for Quad Small Form-factor Pluggable. The "quad" refers to the fact that these connectors bundle four individual lanes together, multiplying the bandwidth.

For short distances—up to about ten meters—InfiniBand can run over copper cables. For longer distances, it uses fiber optic cables, which can extend up to ten kilometers. Modern data centers typically use a mix: copper for connections within a single rack of equipment, and fiber for connections between racks or between buildings.

The newest generation, NDR, has introduced a larger connector called OSFP (Octal Small Form-factor Pluggable) that bundles even more lanes together. The naming continues to be somewhat misleading—despite the "octal" prefix suggesting eight lanes, the connectors are used in various configurations that don't necessarily involve eight of anything.

The Competition

InfiniBand isn't the only option for connecting computers in data centers. Its main competitors are Ethernet, Fiber Channel, and Intel's Omni-Path.

Ethernet is everywhere. It's the technology that connects most computers to most networks. Over the years, Ethernet has gotten much faster—modern standards reach one hundred gigabits per second and beyond. Ethernet's advantage is ubiquity: the equipment is cheaper, more engineers understand it, and it integrates easily with the broader internet.

Fiber Channel is specialized for storage networks, connecting computers to disk arrays and storage systems. It offers good performance but lacks InfiniBand's extremely low latency.

Intel's Omni-Path was a direct competitor to InfiniBand, designed specifically for supercomputing. Intel launched it in 2015 but essentially abandoned development in 2019—the same year Nvidia acquired Mellanox. Some industry observers suspect Intel saw the writing on the wall and decided it couldn't compete with an integrated Nvidia-Mellanox offering.

The Software Layer

Hardware is only part of the story. To use InfiniBand, computers need software that knows how to talk to the hardware.

Unusually for a networking standard, InfiniBand doesn't define a specific application programming interface. Instead, it defines a set of "verbs"—abstract operations like "open device" or "post send." Each vendor implements these verbs in their own way, which theoretically could create compatibility problems.

In practice, the industry coalesced around a single implementation: the Open Fabrics Enterprise Distribution, developed by the OpenFabrics Alliance. This software stack runs on Linux, FreeBSD, Windows, and several other operating systems. Because virtually everyone uses the same software, the theoretical flexibility in the standard hasn't caused significant fragmentation.

Linux kernel support for InfiniBand was integrated in 2005, version 2.6.11. This early adoption helped establish InfiniBand as the default choice for Linux-based supercomputers.

InfiniBand Today

Between 2014 and 2016, InfiniBand was the most common interconnect technology in the TOP500 list—the ranking of the world's most powerful supercomputers. Since then, high-speed Ethernet has regained some ground, particularly in systems where cost matters more than absolute peak performance.

But in the specific domain of AI training, InfiniBand remains dominant. The largest clusters built by companies like Meta, Google, and OpenAI rely on InfiniBand to connect thousands of GPUs. When you hear about systems with tens of thousands of GPUs working in concert, InfiniBand is almost certainly the invisible infrastructure making that coordination possible.

The irony is striking. InfiniBand was originally envisioned as a replacement for everything—internal computer connections, Ethernet, storage networks. That vision failed. But in failing to become everything, InfiniBand became essential for something: the most demanding computational workloads humanity has ever attempted.

The Monopoly Question

Nvidia's control of both GPUs and InfiniBand networking creates an interesting market dynamic. If you want to build a large-scale AI training system, you probably need Nvidia GPUs. And if you're using Nvidia GPUs at scale, you probably need InfiniBand networking. And if you need InfiniBand networking, you're buying from Nvidia.

This vertical integration has advantages—Nvidia can optimize their GPUs and networking hardware to work together seamlessly. But it also raises concerns about competition. Some in the industry worry that Nvidia's dominance makes it difficult for alternatives to emerge.

Oracle, the database giant, reportedly considered engineering its own InfiniBand hardware in 2016. Nothing came of it. The technical barriers to entry are formidable, and the market may not be large enough to justify the investment for a company that isn't already in the semiconductor business.

Looking Forward

The future of InfiniBand is tied to the future of large-scale computing. As AI models continue growing, the demands on interconnect technology will only increase. The next generation of InfiniBand, whatever it's called, will need to deliver even higher bandwidth and lower latency.

At the same time, alternative technologies continue developing. Ultra Ethernet, a new standard specifically designed for AI workloads, aims to bring Ethernet's cost advantages to the high-performance computing market. Whether it can match InfiniBand's performance remains to be seen.

For now, InfiniBand occupies a peculiar position in the technology landscape: invisible to most people, essential to the systems that generate the AI capabilities everyone talks about. The next time you interact with a large language model, remember that somewhere, data is flowing through InfiniBand connections at speeds measured in hundreds of billions of bits per second, enabling the mathematics that makes the magic work.

The plumbing matters.

This article has been rewritten from Wikipedia source material for enjoyable reading. Content may have been condensed, restructured, or simplified.