AWS Trainium3 Deep Dive | A Potential Challenger Approaching
Deep Dives
Explore related topics with these Wikipedia articles, rewritten for enjoyable reading:
-
Factorio
11 min read
Linked in the article (10 min read)
-
High Bandwidth Memory
9 min read
The article heavily discusses HBM3E specifications including pin speeds (5.7Gbps vs 9.6Gbps), stack heights (12-high), and memory capacity (144GB per chip). Understanding HBM architecture explains why these specifications matter for AI accelerator performance and the technical tradeoffs involved.
Trainium3: A New Challenger Approaching!
Hot on the heels of our 10K word deep dive on TPUs, Amazon launched Trainium3 (Trn3) general availability and announced Trainium4 (Trn4) at its annual AWS re:Invent. Amazon has had the longest and broadest history of custom silicon in the datacenter. While they were behind in AI for quite some time, they are rapidly progressing to be competitive. Last year we detailed Amazon’s ramp of its Trainium2 (Trn2) accelerators aimed at internal Bedrock workloads and Anthropic’s training/inference needs.
Since then, through our datacenter model and accelerator model, we detailed the huge ramp that led to our blockbuster call that AWS would accelerate on revenue.
Today, we are publishing our next technical bible on the step-function improvement that is the Trainium3 chip, microarchitecture, system and rack architecture, scale up, profilers, software platform, and datacenters ramps. This is the most detailed piece we've written on an accelerator and its hardware/software, on desktop there is a table of contents that makes it possible to review specific sections.
Amazon Basics GB200 aka GB200-at-Home
With Trainium3, AWS remains laser-focused on optimizing performance per total cost of ownership (perf per TCO). Their hardware North Star is simple: deliver the fastest time to market at the lowest TCO. Rather than committing to any single architectural design, AWS maximizes operational flexibility. This extends from their work with multiple partners on the custom silicon side to the management of their own supply chain to multi-sourcing multiple component vendors.
On the systems and networking front, AWS is following an “Amazon Basics” approach that optimizes for perf per TCO. Design choices such as whether to use a 12.8T, 25.6T or a 51.2T bandwidth scale-out switch or to select liquid vs air cooling are merely a means to an end to provide the best TCO for the given client and the given datacenter.
For the scale-up network, while Trn2 only supports a 4x4x4 3D Torus mesh scaleup topology, Trainium3 adds a unique switched fabric that is somewhat similar to the GB200 NVL36x2 topology with a few key differences. This switched fabric was added because a switched scaleup topology has better absolute performance and perf per TCO for frontier Mixture-of-Experts (MoE) model architectures.
Even for the switches used in this scale-up architecture, AWS has decided to not decide: they will go with three different scale-up switch solutions over the lifecycle of Trainium3, starting with ...
This excerpt is provided for preview purposes. Full article content is available on the original publication.