Custom AI ASICs are eating Nvidia's market — Google TPUs, Broadcom's $73B backlog, and the race to build chips for inference
At a glance:
- Nvidia still holds roughly 70% of the AI chip market, but custom ASIC shipments are projected to grow 44.6% YoY in 2026, with ASIC-based AI servers reaching 27.8% of the market.
- Broadcom reported $8.4B in AI semiconductor revenue in Q1 FY2026 (up 106% YoY) and guided $10.7B for Q2, backed by a $73B AI backlog and six confirmed XPU customers including Google, OpenAI, Meta, ByteDance, and Fujitsu.
- Every major hyperscaler — Google, Amazon, Meta, Microsoft, and OpenAI — is now running custom silicon roadmaps, with TSMC's CoWoS packaging capacity emerging as the single most binding constraint on the entire ecosystem.
The Broadcom bet that ties the market together
Broadcom has arguably become the linchpin of the custom AI ASIC ecosystem. In Q1 FY2026, the company reported $8.4 billion in AI semiconductor revenue — a 106% year-over-year jump — and guided investors to $10.7 billion for Q2. CEO Hock Tan told shareholders the company has "line of sight to achieve AI revenue from chips in excess of $100 billion in 2027," backed by a disclosed $73 billion AI backlog.
The company confirmed six major XPU customers: Google, which has been a partner since 2014 with seven generations of co-designed TPUs; OpenAI, which signed a multi-year collaboration in October 2025 for 10 gigawatts of custom accelerators, with first deployment targeting the second half of 2026 using both 3nm and 2nm designs; Meta, ByteDance, and Fujitsu. Analysts have also flagged Apple and Arm/SoftBank as potential future engagements. Notably, OpenAI was widely reported to be behind a separate $10 billion order, but Broadcom semiconductor president Charlie Kawwas joked on CNBC that "OpenAI has not given me that PO yet," leaving the identity of the mystery customer officially unconfirmed.
Arm is separately developing a custom CPU for OpenAI's Broadcom-built accelerator, a contract that could be worth billions to SoftBank. The technical backbone is Broadcom's 3.5D XDSiP platform, which uses face-to-face 3D stacking via TSMC's SoIC process combined with 2.5D CoWoS integration. The platform supports packages exceeding 6,000 mm² of silicon with up to 12 HBM stacks — well beyond the roughly 2,500 mm² limit of conventional 2.5D designs. In February, Broadcom began shipping the industry's first 2nm compute SoC on this platform, integrating four N2 compute dies, one I/O die, and six HBM modules.
On the networking side, Broadcom's Tomahawk 6 switch chip entered volume production in March as the industry's first 102.4 Tbps Ethernet part. Its companion Jericho 4 fabric chip (51.2 Tbps) began shipping last August and is designed to interconnect over one million XPUs across data centers. Nvidia's competing Spectrum-X1600 isn't expected in volume until the second half of 2026.
Google's TPU v7 Ironwood and the inference cost advantage
Google's TPU program is the most mature custom AI silicon effort among the hyperscalers. The latest generation, TPU v7 codenamed Ironwood, was announced at Cloud Next in April 2025 and entered preview in November. Each chip delivers 4,614 FP8 TFLOPS with 192 GB of HBM3E memory at 7.37 TB/s bandwidth. It's manufactured on TSMC's N3P process in a dual-chiplet design co-developed with Broadcom and MediaTek, and features two TensorCores with doubled 256x256 MXU arrays plus four SparseCores.
The 9,216-chip superpod configuration delivers 42.5 FP8 exaflops with 1.77 PB of aggregate HBM. SemiAnalysis estimates that TPUs achieve higher sustained model FLOP utilization of roughly 90% for transformers versus 70% to 80% for GPUs, narrowing or erasing the real-world performance gap. Google claims the total cost of ownership per Ironwood chip is roughly 44% lower than a GB200 server from its own procurement perspective.
Google is now selling TPU access aggressively beyond its own services. Anthropic committed to up to one million TPUs in the largest deal in Google Cloud history back in October, while Meta entered talks for multi-billion-dollar TPU deployments in February this year. The current-generation TPU v6e Trillium remains widely available on Google Cloud at $2.70 per chip-hour on demand, delivering roughly four-times better price-performance than H100 instances for LLM workloads, according to Google's own benchmarks. Google's Axion ARM CPU, based on Neoverse V2 and reportedly manufactured on TSMC 3nm, complements TPUs for general-purpose cloud workloads.
Amazon Trainium3, Trainium4, and the $11 billion Rainier campus
AWS has matched Google's pace with an aggressive custom silicon roadmap developed by Annapurna Labs, the Israeli chip design house Amazon acquired in 2015. Trainium3, which went generally available at re:Invent in December, is AWS's first 3nm chip. Each Trainium3 delivers 2.517 PFLOPS FP8 with 144 GB HBM3E at 4.9 TB/s bandwidth — roughly double the compute and 1.5 times the memory of its predecessor. The new Trn3 UltraServer packs 144 chips delivering 362 FP8 petaflops with 20.7 TB of memory, a 4.4 times improvement over Trn2 UltraServers.
AWS CEO Matt Garman said at re:Invent 2025 that the company had "already deployed more than 1 million Trainium processors" and was selling them as fast as production allowed. CEO Andy Jassy called it "already a multibillion-dollar business." The Project Rainier facility in Indiana, an $11 billion, 2.2 GW campus, had roughly 500,000 Trainium2 chips running for Anthropic by October 2025, and AWS also confirmed an OpenAI deal to supply 2 GW of Trainium computing capacity.
Trainium4 was announced in December 2025 for late 2026 or early 2027 availability, promising three times FP8 performance, six times FP4 throughput, and four times memory bandwidth over Trainium3, with an estimated 288 GB of memory. One notable feature is support for Nvidia NVLink Fusion, enabling hybrid clusters that mix Trainium and Nvidia GPUs. AWS's Graviton5 ARM CPU (192 cores, TSMC 3nm, Neoverse V3) was also announced at re:Invent 2025.
Meta's MTIA roadmap and the Nvidia partnership that coexists
Meta disclosed one of the most ambitious custom chip roadmaps in the industry in March, unveiling four new MTIA generations (300 through 500) for deployment through 2027, in addition to the already-shipping MTIA 100 and 200. The company has deployed hundreds of thousands of MTIA chips for inference across Facebook and Instagram.
The specifications reveal a rapid escalation in capability:
- MTIA 400 delivers 6 PFLOPS FP8 and 18 PFLOPS MX4 with 288 GB HBM at 9.2 Tbps bandwidth in a 1,200W envelope.
- MTIA 500, scheduled for 2027 mass deployment, scales to 10 PFLOPS FP8 and 30 PFLOPS MX4 with up to 512 GB HBM at 27.6 Tbps in a 2x2 chiplet configuration, consuming 1,700W.
From the MTIA 300 to the 500, HBM bandwidth increases 4.5 times and compute scales 25 times, with a new chip roughly every six months. Meta has been explicit that MTIA is not a replacement for Nvidia GPUs. The company expanded its Nvidia partnership in February for "millions of AI chips," including Grace Blackwell and future Vera Rubin platforms, in a deal reportedly worth tens of billions. Custom silicon handles optimized inference at a massive scale, while Nvidia handles frontier model training.
With $115-135 billion in 2026 capex guidance, Meta is buying everything it can from both sources. The MTIA chips are fabricated on TSMC's advanced nodes: MTIA 100 on 7nm, MTIA 200 on 5nm, and the 300-series onward reportedly moving to 3nm with CoWoS packaging.
Microsoft Maia 200, Tesla's Dojo shutdown, and other players
Microsoft's custom silicon program took a significant step forward in January with the deployment of Maia 200, manufactured on TSMC 3nm with over 140 billion transistors. The chip delivers more than 10 PFLOPS FP4 and 5 PFLOPS FP8 with 216 GB HBM3E at 7 TB/s bandwidth in a 750W envelope. Microsoft claims it offers 30% better performance per dollar than the best hardware in its existing fleet and calls it "the most performant first-party silicon from any hyperscaler." Maia 200 currently serves GPT-5.2 models for OpenAI and powers Microsoft 365 Copilot workloads from its Des Moines data center.
The path to Maia 200 was far from smooth. The original Maia 100, built on TSMC 5nm, was reportedly designed more for image processing than generative AI and never powered production AI services at scale. Maia 200 was delayed roughly six months due to design changes requested by OpenAI that caused simulation instability, plus chip team turnover. CEO Satya Nadella has emphasized that Microsoft will continue purchasing Nvidia and AMD chips alongside Maia. Microsoft's Cobalt 200 Arm CPU (TSMC 3nm, 132 Neoverse V3 cores) was announced at Ignite 2025 and is now live in Azure data centers.
Tesla's Dojo project met a very different fate. Despite years of development and an innovative D1 chip (TSMC 7nm, 50 billion transistors, 362 TFLOPS BF16, with a unique 354-core mesh architecture), Tesla disbanded the Dojo team in August. Lead architect Peter Bannon departed, and roughly 20 engineers left to found DensityAI. Elon Musk explained that "once it became clear that all paths converged to AI6, I had to shut down Dojo." Tesla is now focusing on AI5 and AI6 inference chips, with AI6 backed by a $16.5 billion Samsung fabrication deal, while relying on Nvidia hardware for current training needs.
Among other contenders, Intel's Gaudi 3 has struggled with software maturity and missed targets. Shipment goals were cut by more than 30% in 2024, and the Habana Labs brand is being absorbed into Intel's broader accelerator efforts under CEO Lip-Bu Tan. In China, Huawei's Ascend 910C (SMIC 7nm, roughly 800 TFLOPS FP16, 128 GB HBM) targets 600,000 units in 2026 but faces yield challenges at around 20%. Cambricon, meanwhile, plans to triple output to 500,000 chips.
TSMC is the bottleneck that matters most
TSMC is the indispensable enabler across all these custom AI ASIC efforts. The foundry generated $122.4 billion in 2025 revenue, up 36% year-over-year, and forecasts a 60% compound annual growth rate for AI chip revenue through 2029.
Its CoWoS advanced packaging capacity is scaling from roughly 65,000-75,000 wafers per month in 2025 to a target of 120,000-130,000 wafers per month in 2026, with capital expenditure of up to $56 billion planned for the year. The 2nm node entered mass production at the back end of last year, with capacity fully booked and targeting over 60,000 WPM by the end of the year. Allocation breaks down as follows:
- Nvidia has secured roughly 60% of CoWoS allocation (c. 595,000 wafers)
- Broadcom about 15% (c. 150,000 wafers)
- AMD approximately 11% (c. 105,000 wafers)
Every custom ASIC in this story depends on CoWoS or its successor CoWoS-L for HBM integration, and TSMC's packaging capacity is now a more binding constraint than wafer fabrication itself.
The driving factor behind custom ASIC adoption is the rapid growth of inference workloads. Deloitte projects inference to account for two-thirds of all AI compute this year. With custom silicon offering up to a 65% TCO advantage over conventional GPUs for inference at production scale, it's easy to see why so many hyperscalers are pursuing purpose-built chips. Broadcom and Marvell together control roughly 95% of the ASIC co-design market, so the question is no longer whether custom silicon will take share from Nvidia, but how quickly it erodes Nvidia's pricing power as these programs reach full production scale.
FAQ
What is Broadcom's AI chip revenue outlook and backlog?
How does Google's TPU v7 Ironwood compare to Nvidia's Blackwell in real-world performance?
Which companies depend on TSMC's CoWoS packaging, and how tight is capacity?
More in the feed
Prepared by the editorial stack from public data and external sources.
Original article