Cerebras Just Quietly Became a Different Company. Here's Why That Matters.
Cerebras Systems has undergone a dramatic transformation that most observers missed. The Abu Dhabi-backed AI chip maker was once dismissed as a geopolitically risky bet dependent on a single customer. But according to its April 2026 IPO filing, that story is largely over. The company's revenue concentration has collapsed from 85-87% from one customer in 2024 to just 24% today, replaced by major deals with OpenAI, Amazon Web Services (AWS), and a backlog of $24.6 billion in orders.
What Changed at Cerebras Between 2024 and 2026?
The shift happened quickly and with major implications. In December 2025, OpenAI signed a $10 billion to $20 billion compute agreement with Cerebras, including a $1 billion working capital loan. AWS is deploying Cerebras hardware directly into its own data centers, with a commercial launch planned for the second half of 2026. These aren't just revenue events; they're validation signals from customers with every incentive to evaluate whether the technology actually delivers on its promises.
OpenAI, in particular, runs inference at a scale where even small efficiency gains translate directly to product quality and profit margins. The company wouldn't purchase Cerebras hardware based on a compelling sales pitch alone. AWS integration is equally significant; Amazon doesn't integrate third-party silicon into its infrastructure unless it clears a high bar for reliability, software compatibility, and supply chain predictability. The fact that AWS has scheduled a commercial launch for H2 2026 suggests the company has already completed or is well along on rigorous evaluation.
How Does Cerebras' Wafer-Scale Architecture Actually Work?
To understand why Cerebras matters, you need to understand why the semiconductor industry stopped building giant single chips around 2010. The reticle limit, the maximum area a lithography system can expose in a single shot, sits around 800 square millimeters. Anything larger requires stitching, which introduces yield-killing complexity. So AMD, Intel, and NVIDIA all converged on the same approach: design multiple smaller dies, test them individually, and connect them via high-speed interconnects. This is called the chiplet approach.
Cerebras went the opposite direction. The WSE-3 (Wafer Scale Engine 3) is a full 300-millimeter wafer treated as a single processor. It contains 46,225 square millimeters of silicon, 4 trillion transistors, 900,000 cores, and 44 gigabytes of on-chip SRAM (static random-access memory). To put that in perspective, the die is 57 times larger than an NVIDIA H100 GPU.
The architectural payoff is substantial. The WSE-3 delivers 21 petabytes per second of on-chip memory bandwidth, compared to roughly 3.35 terabytes per second for the H100's HBM3 (high-bandwidth memory). That's a difference of four orders of magnitude. This matters because modern large language model (LLM) inference is not compute-bound; it's memory-bandwidth-bound. The bottleneck is moving model weights from memory to arithmetic units fast enough to continuously feed the computation. Every GPU cluster in the world fights this problem through HBM, NVLink, and elaborate model-parallel scheduling. Cerebras sidesteps it entirely because the memory and the compute are on the same piece of silicon.
What Do the Real-World Performance Numbers Show?
The inference benchmarks are concrete enough to analyze. On Meta's Llama 4 Maverick, a 17 billion active parameter, 400 billion total parameter mixture-of-experts model, the CS-3 achieved 2,500 tokens per second per user against approximately 1,000 tokens per second for an NVIDIA DGX B200 node. That's a 2.5-fold advantage on a current-generation model from the dominant lab in open-weight AI. On the 120 billion parameter GPT-OSS model, the gap is 2,700 versus 900 tokens per second, roughly 3-fold. The more aggressive 21-fold claim applies to Llama 3 70B under specific reasoning workload conditions, a smaller model with characteristics that particularly favor on-chip memory saturation.
For context, NVIDIA's own blog recently highlighted breaking 1,000 tokens per second per user on Llama 4 Maverick as a milestone. That framing implicitly validates the Cerebras 2,500 figure as a real and meaningful gap. Whether a 2.5-fold to 3-fold throughput advantage is economically decisive depends on whether the hardware costs less than 2.5 to 3 times as much to deliver those tokens.
Why the Yield Question Remains Unanswered
Here is where the analysis runs into an honest limitation. The standard yield model for a semiconductor die makes clear that a 46,225 square millimeter wafer should have a monolithic yield approaching zero. Cerebras's answer is architectural redundancy: each of the 900,000 cores is small enough (approximately 0.05 square millimeters) that when a defect kills a core, the on-chip fabric routes the workload to a redundant neighbor. The company claims this achieves a 100-fold improvement in defect tolerance compared to conventional processors, enabling commercially viable yields.
This approach is technically coherent. NAND flash manufacturers have used analogous fault-mapping for decades; dies ship with known-bad blocks, and the controller remaps around them. The question isn't whether the principle is valid. The question is what the cost is and how that cost compares to chiplet alternatives at equivalent performance. Redundant cores occupy silicon area. The $18,500 wafer cost for TSMC N5 (a cutting-edge manufacturing process) is publicly available. What isn't public: the average number of functional cores per wafer after redundancy mapping, the distribution of that number across wafers, or the implied cost per functional petaflop before system integration. These numbers exist internally, and their absence from the IPO filing is entirely understandable; yield data is among the most competitively sensitive information a semiconductor company holds.
Steps to Evaluate Cerebras' True Competitive Position
- Verify Customer Validation: Check whether OpenAI and AWS actually deploy Cerebras hardware at scale in their production systems, not just pilot programs. This will confirm whether major cloud providers genuinely believe the technology delivers better economics than alternatives.
- Monitor Backlog Conversion: Track whether the $24.6 billion backlog converts to actual revenue over the next 12 to 24 months. A backlog is a promise; revenue is proof the technology works in practice.
- Assess Yield Transparency: Watch for any disclosure of yield data, cost per functional core, or total cost of ownership comparisons in future earnings reports or technical publications. This will resolve the central unanswered question about whether the architecture is economically viable.
- Evaluate Software Maturity: Examine whether the software ecosystem around Cerebras hardware matures beyond the initial OpenAI and AWS deployments. Adoption by smaller companies will indicate whether the technology is truly accessible or remains limited to well-resourced customers.
What Makes This Story Different From Previous Cerebras Coverage?
Almost every skeptical take on Cerebras was built on the assumption that the company was dependent on a single patron. That assumption is no longer operative. The question worth asking now is fundamentally different: does the architecture actually work at scale, and is the niche it wins large enough to support a $23 billion company ?
The combination of OpenAI and AWS as customers doesn't just diversify revenue; it shifts how the rest of the market perceives the technology. These are credible signal-senders. OpenAI runs inference at a scale where a 2-fold improvement in tokens per second per user is a direct product quality and margin metric. AWS integration means that when the service launches in H2 2026, it becomes accessible to every company already running on AWS without a new procurement process. That's distribution Cerebras couldn't have bought directly.
The engineering consequence of the wafer-scale approach is striking: a 175 billion parameter model can reportedly be trained with 565 lines of code on a Cerebras system versus roughly 20,000 lines on a 4,000-GPU cluster. That's not just a performance claim; it's a human capital argument that most benchmark comparisons ignore entirely. Simpler code means faster development cycles, fewer bugs, and lower engineering costs for customers building AI systems.
The IPO filing describes a company that looks nothing like the one the prevailing narrative assumed still existed. The single-patron story is largely over, and it was over before the filing made it official. What remains is a genuine technical question: whether a company built on a radically different architectural approach can sustain growth when the rest of the industry has explicitly rejected that approach. The answer will determine whether Cerebras becomes a foundational player in AI infrastructure or a niche vendor with impressive benchmarks but limited market reach.