Wikipedia Deep Dive

Hopper (microarchitecture)

14 min read

Based on Wikipedia: Hopper (microarchitecture)

The Chip That Made Billionaires Beg

Picture this: Larry Ellison, one of the richest people on Earth, sitting at a sushi dinner with Elon Musk and Nvidia's CEO Jensen Huang. What are they doing? Begging. Actually begging for computer chips.

"I guess begging is the best way to describe it," Ellison later admitted. "An hour of sushi and begging."

The object of their desperation? The Nvidia H100, a processor so valuable that by early 2024, Nvidia was reportedly shipping them to data centers in armored cars. Individual chips were selling for over forty thousand dollars on eBay. This wasn't gold or diamonds—it was a piece of silicon about the size of your palm.

So what makes this chip so special that it turned titans of industry into supplicants?

What Exactly Is Hopper?

Hopper is what engineers call a "microarchitecture"—essentially the blueprint for how a computer chip organizes and processes information. Think of it like the floor plan of a factory. Two factories might make the same product, but one might be laid out far more efficiently, with shorter paths between workstations and smarter automation. That's what a microarchitecture does for chips.

Nvidia named this architecture after Grace Hopper, a remarkable figure in computing history. Hopper was a United States Navy rear admiral who became one of the first programmers of the Harvard Mark I, an early electromechanical computer that weighed five tons. She's credited with popularizing the term "debugging" after a moth was found trapped in a relay of the Mark II computer. The name choice was fitting—Grace Hopper pioneered how humans communicate with machines, and this chip architecture was designed to revolutionize how machines learn to think.

The Hopper architecture was leaked on Twitter in November 2019 and officially unveiled in March 2022. It built upon Nvidia's previous designs—Turing and Ampere—but represented a significant leap forward in several key areas.

The Numbers Behind the Magic

The flagship chip using the Hopper architecture is the H100. Let's put its scale in perspective.

It contains eighty billion transistors. A transistor is the fundamental building block of all digital electronics—a tiny switch that can be either on or off, representing the ones and zeros of computing. Eighty billion of these switches, each smaller than a virus, all working in concert on a piece of silicon.

To manufacture something this intricate, Nvidia uses a process called TSMC N4, made by Taiwan Semiconductor Manufacturing Company. The "N4" refers to four nanometers—roughly the width of twenty atoms lined up in a row. At this scale, the laws of physics start behaving strangely. Electrons can "tunnel" through barriers they shouldn't be able to cross. Heat becomes incredibly difficult to manage. Yet somehow, humans have figured out how to mass-produce these devices reliably.

The chip can connect to up to eighty gigabytes of a special type of memory called HBM3, which stands for High Bandwidth Memory, third generation. This memory can transfer data at three terabytes per second. To put that in human terms: you could transfer the entire contents of the Library of Congress—around seventeen million books—in about one second.

Why Graphics Chips Became Intelligence Chips

Here's something that might seem strange: why is a graphics processor—originally designed to render video games—now the most sought-after tool for artificial intelligence?

The answer lies in how these chips think differently from regular processors.

Your laptop's main processor, called a Central Processing Unit or CPU, is like a brilliant mathematician who works alone. It can solve incredibly complex problems, but it handles them one at a time, sequentially. First this calculation, then the next, then the next.

A Graphics Processing Unit, or GPU, is more like a massive auditorium filled with thousands of average mathematicians who can all work simultaneously. None of them individually is as clever as the lone genius, but when you need to perform the same calculation on millions of different numbers at once—which is exactly what rendering graphics requires—the auditorium wins by a landslide.

It turns out that training artificial intelligence works the same way. Neural networks, the mathematical structures that underpin modern AI, require performing vast numbers of similar calculations across enormous datasets. The same architecture that made video games beautiful makes AI possible.

The Transformer Engine: Hopper's Secret Weapon

The Hopper architecture introduced something called the transformer engine, and understanding this requires a brief detour into how AI actually works.

Modern AI systems like ChatGPT are built on a design called a "transformer," introduced in a famous 2017 research paper with the modest title "Attention Is All You Need." Transformers revolutionized AI by allowing systems to process entire sequences of data simultaneously rather than word by word, and to focus "attention" on the most relevant parts of their input.

But transformers are computationally ravenous. Training a large language model might require performing quintillions of mathematical operations—that's a one followed by eighteen zeros.

The transformer engine in Hopper addresses this hunger through a clever trick involving numerical precision.

When computers store numbers, they use a format called floating-point, which is essentially scientific notation. The number 3,140,000 might be stored as 3.14 × 10⁶. The precision of this representation—how many decimal places you keep—determines both accuracy and computational cost. More precision means more accurate calculations but slower performance.

Hopper's transformer engine can dynamically adjust this precision on the fly. When it detects that a calculation can tolerate less precision without significantly affecting the final result, it automatically downshifts from higher-precision formats like FP16 (sixteen bits of precision) to FP8 (eight bits). It can even redistribute the bits between the significant digits and the exponent to maximize accuracy for each specific situation.

This is like a car with an automatic transmission that's been perfected to an almost supernatural degree, always finding the exact right gear for maximum efficiency.

Clusters, Warps, and the Strange Vocabulary of Parallel Computing

To truly understand what makes Hopper special, we need to venture into the peculiar terminology of GPU computing.

Nvidia GPUs organize their work through structures called streaming multiprocessors, abbreviated as SM. Think of each SM as a small factory within the larger chip. The H100 contains up to 144 of these factories, each capable of running independent operations.

Within each streaming multiprocessor, work is organized into "warps"—groups of thirty-two threads that execute the same instruction simultaneously. The word comes from weaving, where a warp is the set of threads held in tension on a loom. In computing, these threads move in lockstep through their calculations.

Hopper introduced something new: thread block clusters. This allows multiple streaming multiprocessors to work together more intimately, sharing data through what Nvidia calls "distributed shared memory." Imagine our factories not just operating independently but being able to pass materials directly to each other through pneumatic tubes rather than routing everything through a central warehouse.

The practical effect is faster communication and better coordination for complex tasks.

The Tensor Memory Accelerator

One of Hopper's more esoteric innovations is the Tensor Memory Accelerator, or TMA.

A tensor, in this context, is a multi-dimensional array of numbers. A one-dimensional tensor is just a list. A two-dimensional tensor is a table or matrix. Three dimensions gives you a cube of numbers. AI systems routinely work with tensors of four, five, or even more dimensions.

Moving these massive data structures between different types of memory within the chip was traditionally slow and cumbersome. The TMA allows tensors of up to five dimensions to be transferred asynchronously—meaning the main computation can continue while data moves in the background. It's like having a robotic arm that can rearrange your workshop while you keep working, never breaking your concentration.

The Bioinformatics Connection

Here's a fascinating detail that reveals the breadth of Hopper's intended applications: the chip includes specialized hardware for running the Smith-Waterman algorithm.

Smith-Waterman is a technique from computational biology, used to find similarities between DNA or protein sequences. If you're trying to determine whether two genes are related, or searching for a known pattern within a genome, Smith-Waterman helps find the best match even when there are gaps, insertions, or deletions in the sequences.

The algorithm is computationally intensive—comparing two sequences of length n might require n² calculations. With modern genomic datasets containing billions of base pairs, this adds up quickly.

By building Smith-Waterman support directly into the hardware, Nvidia made Hopper chips dramatically faster for genomics research. The chip also accelerates the related Needleman-Wunsch algorithm, another sequence alignment technique. It's a reminder that AI chips aren't just about chatbots and image generators—they're transforming fields as diverse as drug discovery and evolutionary biology.

The Power Problem

All this computational power comes at a cost, measured in watts.

The H100 in its SXM5 configuration—a specialized socket designed for data centers—has a thermal design power of 700 watts. That's roughly equivalent to a small space heater running continuously. A data center with thousands of these chips faces enormous cooling challenges.

Nvidia also produces the GH200, which combines an H100 GPU with a 72-core Grace CPU—that's Nvidia's own processor, also named after Grace Hopper—on a single module. This combined unit can draw up to 1,000 watts. For reference, a typical desktop computer draws about 200-400 watts under load.

The good news is that Hopper's architecture is designed for efficiency at the application level. Its asynchronous features mean less time waiting for data, which means less energy wasted on idle transistors. The chip can achieve high utilization rates, squeezing more work out of each watt consumed.

The China Complication

The H100's dominance has made it a flashpoint in the technological rivalry between the United States and China.

In late 2022, the U.S. government imposed regulations limiting the export of advanced AI chips to China, citing national security concerns about the potential military applications of artificial intelligence. Nvidia responded by creating a modified version called the H800, with reduced interconnect bandwidth to comply with the restrictions while still serving the Chinese market.

But in late 2023, the U.S. government tightened the restrictions further, targeting the A800 and H800 among others. Nvidia adapted again with the H20, an even more limited variant of the Hopper design. Despite its constraints, the H20 had become the most prominent AI chip in the Chinese market by 2025.

This cat-and-mouse dynamic—regulations followed by compliance-engineered variants followed by tighter regulations—illustrates how a piece of semiconductor technology has become entangled in geopolitics.

The Mechanics of Memory

Let's dive deeper into how Hopper manages data, because this is where much of its performance advantage lives.

Chips have multiple levels of memory, each with different speeds and sizes. The fastest memory is closest to the processing cores but holds the least data. The slowest memory can store gigabytes but takes many more clock cycles to access. Managing this hierarchy efficiently is crucial for performance.

Hopper increases the capacity of its L1 cache—the fastest tier—to 256 kilobytes per streaming multiprocessor. This cache is shared between regular data caching and something called shared memory, which programmers can manage directly. A configuration option allows developers to decide how to divide this space based on their application's needs.

The chip also features automatic inline compression. When data has patterns or redundancy, the chip can compress it on the fly, effectively increasing the bandwidth between memory and processors. Crucially, this happens transparently—programmers don't need to manage it explicitly. The system automatically selects appropriate compression algorithms based on the data characteristics.

NVLink and the Challenge of Connecting Giants

Modern AI systems rarely use just one chip. Large language models are trained across clusters of thousands of GPUs, all of which need to communicate with each other constantly.

NVLink is Nvidia's proprietary interconnect technology, essentially a private superhighway between chips that's much faster than standard connections like PCIe (Peripheral Component Interconnect Express, the typical way expansion cards connect to computers).

Hopper introduced a new generation of NVLink with increased bandwidth. But speed is only part of the challenge. The chip also includes sophisticated mechanisms for managing memory consistency—ensuring that when one chip updates a value, other chips see the update correctly.

One subtle improvement involves "fence" operations. When a chip needs to ensure all its writes are visible to other chips before proceeding, it issues a fence. Previously, this could cause the chip to wait on memory operations that didn't actually matter for the communication at hand. Hopper can intelligently narrow the scope of these fences, waiting only for the writes that truly need to complete.

The SXM5 Advantage

The H100 comes in different physical configurations, and not all H100s are created equal.

The SXM5 version uses a specialized socket designed for Nvidia's data center systems. This socket provides higher memory bandwidth than a standard PCIe slot—think of it as the difference between a garden hose and a fire hose. Applications that need to move massive amounts of data between the chip and its memory see substantially better performance in the SXM5 configuration.

This creates an interesting market dynamic. Organizations can buy H100s in the more common PCIe form factor and install them in standard servers, but they'll get inferior performance compared to buying Nvidia's complete systems with SXM5. It's a gentle lock-in that encourages customers to buy the full package.

The Scarcity Economy

Why were these chips so hard to get in 2023 and 2024?

The answer involves the entire supply chain, from physics to geopolitics.

Manufacturing chips at the four-nanometer scale requires equipment that only a handful of companies can produce. ASML, a Dutch company, is the sole supplier of the extreme ultraviolet lithography machines needed for this process. Each machine costs over $150 million and is itself a marvel of engineering, using plasma-generated light to etch patterns smaller than the wavelength of visible light.

TSMC, which actually manufactures the chips, has limited capacity for its most advanced processes. Nvidia competes for this capacity with Apple, AMD, Qualcomm, and others.

Then came the AI boom of 2023. Suddenly, every major technology company wanted thousands of H100s immediately. Cloud providers needed them to offer AI services. Startups needed them to train models. Research labs needed them to advance the science. Demand exploded far beyond any reasonable supply projection.

Nvidia's response was to ramp up production as quickly as physics and factory capacity allowed. But the gap between supply and demand remained vast, creating the conditions for Larry Ellison's sushi-table begging session.

Looking at the Numbers

A comparison of Nvidia's accelerator evolution shows the pace of change:

Each generation roughly doubled or tripled performance for AI workloads while increasing power consumption more modestly. The H100, despite consuming about forty percent more power than its predecessor, the A100, delivered performance improvements of two to three times for many AI tasks, and up to six times for transformer-specific workloads.

These numbers explain the frantic demand. If you're in a race to train the best AI model, and newer chips let you train several times faster, you simply cannot afford to fall behind.

What Comes Next

By the time you read this, Nvidia will likely have announced or released successor architectures. The company typically maintains a roughly two-year cadence between major architecture releases.

But Hopper will remain historically significant as the architecture that powered the most dramatic phase of the AI boom. The models that captured public imagination in 2023 and 2024—the chatbots that could write essays, the image generators that could dream up photorealistic scenes, the coding assistants that could write and debug software—were trained on clusters of Hopper chips.

In a very real sense, Hopper is the hardware foundation of the AI revolution. When future historians look back at this period, they'll talk about the software breakthroughs and the cultural shifts. But underneath it all, there were these remarkable pieces of silicon, eighty billion transistors each, running at seven hundred watts, being begged for over sushi by billionaires who understood that whoever controlled the chips controlled the future.