Google's TPU Bet: Why Broadcom and MediaTek Are Splitting the Chip Design for AI
Google has fundamentally changed how it builds custom AI chips by splitting its eighth-generation Tensor Processing Unit (TPU) into two distinct designs for the first time in the program's decade-long history. The TPU 8t handles large-scale model training, while the TPU 8i is optimized for low-latency inference and reasoning workloads. More significantly, this split marks the end of Broadcom's exclusive role in TPU design since 2015, as MediaTek has joined as a silicon design partner.
The partnership restructuring carries major implications for the semiconductor industry. Both chips are fabricated on Taiwan Semiconductor Manufacturing Company (TSMC) N3 process technology with HBM3E memory and will be available to Google Cloud customers later this year. According to Bank of America analyst Vivek Arya, the dual-sourcing arrangement could reduce per-chip cost by up to 30% compared to solely sourcing from Broadcom, whose role is secured through at least 2031 per an April 6 Securities and Exchange Commission (SEC) filing.
Why Is Google Splitting Its TPU Design Strategy?
The decision to create separate training and inference chips reflects a fundamental shift in how companies approach AI infrastructure. Training requires sustained compute power to process massive datasets and adjust billions of parameters, while inference demands low-latency responses for real-time applications. By optimizing each chip for its specific workload, Google can deliver better performance per dollar spent on infrastructure.
The TPU 8t delivers 12.6 FP4 PFLOPs (floating-point operations per second at reduced precision) with 216 gigabytes of HBM3e memory running at 6,528 gigabytes per second. The TPU 8i offers 10.1 FP4 PFLOPs with 288 gigabytes of HBM3e at 8,601 gigabytes per second and 384 megabytes of on-chip SRAM. For context, NVIDIA's Vera Rubin R200 reaches 35 FP4 PFLOPs for training, and Advanced Micro Devices (AMD) MI455X reaches 40 FP4 PFLOPs, meaning Google's individual chips lag behind in raw compute per socket by roughly a 3-to-1 margin.
However, Google compensates for this gap through architectural advantages at scale. A TPU 8t superpod packs 9,600 chips into a single cluster with two petabytes of shared HBM, connected by a proprietary inter-chip interconnect running at double the previous generation's bandwidth. Google claims 121 FP4 ExaFLOPs from a single superpod, with the new Virgo Network fabric tying up to 134,000 TPU 8t chips into a single non-blocking data center fabric with 47 petabytes per second of bisection bandwidth, extending past 1 million chips across multiple sites.
How Does the TPU 8i Differ From Previous Generations?
The TPU 8i represents a radical departure in network topology design. It abandons the 3D Torus interconnect that has been inside TPU pods since the second generation, replacing it with a topology Google calls "Boardfly," inspired by the 2008 Kim/Dally Dragonfly paper. Boardfly uses a three-tier hierarchy: four-chip building blocks connected into 32-chip groups by copper cabling, with 36 groups linked by optical circuit switches into a pod of up to 1,024 active chips.
This architectural change delivers measurable benefits. In a 1,024-chip 3D Torus configuration, the worst-case packet path traverses 16 hops. Boardfly cuts that to seven, a 56% reduction in network diameter that directly benefits mixture-of-experts (MoE) models, where token routing requires frequent all-to-all communication across unpredictable chip pairs.
The TPU 8i also replaces the SparseCore embedding accelerators that Google has used since TPU v4 with a new fixed-function block called the Collectives Acceleration Engine (CAE). The CAE offloads reduction and synchronization operations during autoregressive decoding, cutting on-chip collective latency by up to five times. Combined with tripled SRAM, which holds more of the KV cache on-chip during long-context inference, Google claims 80% better performance per dollar over Ironwood for large MoE models at low-latency targets.
How the Broadcom and MediaTek Partnership Changes the Supply Chain
- Design Responsibility Split: MediaTek handles the design of the TPU 8i inference chip while Broadcom manages the design of the 8t training chip, ending Broadcom's exclusive role in the TPU program since 2015.
- Cost Reduction Potential: The dual-sourcing arrangement could reduce per-chip cost by up to 30% compared to solely sourcing from Broadcom, according to Bank of America analyst Vivek Arya.
- Manufacturing Scale: TrendForce reported that MediaTek initially booked 20,000 TSMC CoWoS wafers for the program, with allocation potentially scaling to 150,000 by 2027.
- Long-Term Commitment: Broadcom's role is secured through at least 2031 per an April 6 SEC filing, which also formalized a 3.5 gigawatt TPU capacity commitment from Anthropic starting in 2027.
The partnership restructuring signals Google's confidence in MediaTek's ability to handle complex AI chip design while simultaneously reducing dependency on a single supplier. This approach mirrors broader industry trends toward diversifying semiconductor supply chains, particularly for mission-critical AI infrastructure.
What Does This Mean for Google's AI Customers?
Google's TPU strategy extends far beyond internal use. Meta has signed a separate multi-year, multi-billion-dollar TPU rental agreement, estimated to involve 500,000 to 800,000 TPU chips by 2027 if initial testing meets expectations. Apple is routing Gemini-powered Siri workloads to Google Cloud on TPU infrastructure, valued at roughly 1 billion dollars per year. The SEC filing also formalized a 3.5 gigawatt TPU capacity commitment from Anthropic starting in 2027, sitting on top of the one gigawatt of Anthropic capacity already coming online this year under a separate Google Cloud agreement.
Interestingly, Google also announced Vera Rubin NVLink72 instances running over the same Virgo Network fabric at Cloud Next, making clear that TPUs are not intended to act as a direct replacement for NVIDIA silicon. Instead, Google is positioning TPUs as a complementary option for customers seeking alternatives or cost-effective solutions for specific workloads.
The TPU 8 generation demonstrates that while individual NVIDIA GPUs remain faster in raw compute per socket, Google holds an advantage with pod-level throughput at mass scale. Training workloads consume thousands of accelerators, not one, and NVIDIA's current-generation GPUs top out at 576 accelerators in a single NVLink deployment. This architectural difference could prove decisive for large-scale AI infrastructure buildouts, where total system cost and efficiency matter more than single-chip performance.