How Google’s Tensor Processing Unit (TPU) Works?

By Alex Xu · ByteByteGo Newsletter ·Jan 5, 2026 · 13 min read

Deep Dives

Explore related topics with these Wikipedia articles, rewritten for enjoyable reading:

Systolic array 10 min read
Central architecture of TPU explained in detail in the article. Readers likely know general computing but not this specialized parallel computing architecture invented in 1978 by H.T. Kung.
Von Neumann architecture 10 min read
Article contrasts TPU design against this fundamental computer architecture and explains the 'Von Neumann bottleneck' as a key motivation for TPU development.
AlphaGo versus Lee Sedol 14 min read
Article opens with this historic 2016 match as the public debut moment for TPU technology. The match details and cultural significance provide rich context beyond the brief mention.

4 Key Insights for Scaling LLM Applications (Sponsored)

LLM workflows can be complex, opaque, and difficult to secure. Get the latest ebook from Datadog for practical strategies to monitor, troubleshoot, and protect your LLM applications in production. You’ll get key insights into how to overcome the challenges of deploying LLMs securely and at scale, from debugging multi-step workflows to detecting prompt injection attacks.

Disclaimer: The details in this post have been derived from the details shared online by the Google Engineering Team. All credit for the technical details goes to the Google Engineering Team. The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them

When DeepMind’s AlphaGo defeated Go world champion Lee Sedol in March 2016, the world witnessed a big moment in artificial intelligence. The match was powered by hardware Google had been running in production for over a year, but had never publicly acknowledged.

The Tensor Processing Unit, or TPU, represented something more profound than just another fast chip. It marked a fundamental shift in computing philosophy: sometimes doing less means achieving more.

Ever since then, Google’s TPU family has evolved through seven generations since 2015, scaling from single-chip processing image recognition queries to 9216-chip supercomputers training the largest language models in existence. In this article, we look at why Google built custom silicon, and how it works, revealing the physical constraints and engineering trade-offs they had to make.

The Need for TPU

In 2013, Google’s infrastructure team ran a calculation. If Android users adopted voice search at the scale Google anticipated, using it for just three minutes per day, the computational demand would require doubling the company’s entire global data center footprint.

This was a problem with no obvious solution at the time. Building more data centers filled with traditional processors was economically unfeasible. More critically, Moore’s Law has been slowing for years. For decades, the semiconductor industry had relied on the observation that transistor density doubles roughly every two years, delivering regular performance improvements without architectural changes. However, by 2013, this trend was weakening. Google couldn’t simply wait for Intel’s next generation of CPUs to solve its problem.

The root cause of this situation

...

Read full article on ByteByteGo Newsletter →

This excerpt is provided for preview purposes. Full article content is available on the original publication.