DeepSeek V4 Went From Broken to 100x Faster in 43 Days. Here's How the Open-Source Community Did It.
When DeepSeek V4 launched on April 24, 2026, it arrived as a computational powerhouse on paper but a sluggish performer in practice. The 1.6-trillion-parameter model featured cutting-edge architecture and impressive benchmark scores, yet Day 0 inference was unpolished. What happened next reveals how the global open-source community can outpace traditional corporate optimization timelines. By Day 26, engineers had achieved a more than 100x performance improvement across multiple hardware platforms, transforming a brilliant but impractical model into a genuinely usable system.
What Made DeepSeek V4's Launch Performance So Poor?
DeepSeek V4 arrived in two variants: V4-Pro with 1.6 trillion total parameters and roughly 49 billion active parameters per token, and V4-Flash with 284 billion total parameters and 13 billion active. Both featured a 1-million-token context window, meaning they could process roughly 750,000 words at once. The model's architecture included innovations like Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) that drew genuine praise from developers.
But the infrastructure wasn't ready. CUDA vLLM and CUDA SGLang, the frameworks that run large language models on Nvidia GPUs, worked "out of the box" in the most literal sense: the model could load and run, but slowly. TensorRT-LLM, Nvidia's official inference engine, didn't even function properly at launch. The SemiAnalysis team had to submit a pull request to fix Nvidia's own open-source kernel code. AMD's ROCm framework also struggled with compatibility. The developer community consensus was blunt: DeepSeek V4 still trailed cutting-edge closed-source models by three to six months, particularly in complex reasoning tasks.
How Did the Open-Source Community Achieve 100x Speed Gains?
The acceleration came from three converging forces. First, framework optimizations hit master branch rapidly. SGLang and vLLM, the backbone frameworks of the global machine learning ecosystem, became the focus of intense optimization work. Both teams eventually launched their own companies, Inferact and RadixArk, raising hundreds of millions of dollars to continue building open-source inference infrastructure. These weren't hobbyist projects; they were production-grade engines that prioritized iteration speed over feature polish.
Second, batch invariance and key-value cache compression changed the game. vLLM v0.22 introduced batch invariance, delivering 28.9% latency improvements while preserving accuracy. Rust frontends replaced Python inference hot paths, eliminating bottlenecks in the code that runs most frequently. Most critically, KV cache compression via CSA and HCA required only 27% of the single-token inference computing operations and just 10% of the key-value cache compared to DeepSeek V3.2 in the 1-million-token context setting. That 10% figure represents the difference between a model that bankrupts your cloud computing bill and one that scales affordably.
Which Hardware Platforms Benefited Most From the Optimization Sprint?
The 43-day optimization period tested DeepSeek V4 across four distinct hardware platforms, each with its own performance story. The Nvidia GB300 NVL72 is not a traditional server but a supercomputer in a rack: 72 Nvidia Blackwell Ultra GPUs, 36 Grace CPUs, 37 terabytes of GPU memory, and 130 terabytes per second of NVLink bandwidth. When DeepSeek V4 launched, CoreWeave contributed two spare GB300 NVL72 racks to the open-source community, running around the clock to drive improvements. Compared to Hopper-based platforms, the GB300 NVL72 delivered up to a 50x overall increase in AI factory output performance, a 10x boost in user responsiveness, and a 5x improvement in throughput per megawatt.
Nvidia's B200 served as the reliable middle child of the Blackwell family, with 180 gigabytes of HBM3E memory and 6.1 terabytes per second of bandwidth. For training, B200 remained competitive with AMD's MI355X. However, for inference, the gap widened. The GB200 NVL72 cluster, consisting of 72 B200 GPUs, delivered up to 28 times the throughput of a comparable MI355X cluster in one DeepSeek-R1 benchmark.
AMD's MI355X emerged as the surprising underdog. With 288 gigabytes of HBM3E memory compared to B200's 180 gigabytes, and 8 terabytes per second of bandwidth, the MI355X delivered roughly 20 to 30% higher inference throughput on DeepSeek R1 and Llama 3 70B than B200 in vLLM and SGLang benchmarks. AMD claimed 1.4x higher throughput than B200 when serving DeepSeek-R1 at scale. However, this advantage was regime dependent; for dense architectures and smaller mixture-of-experts models, B200 still led. When scaling to frontier-class mixture-of-experts models like DeepSeek-R1 beyond a single node, all 8-GPU systems hit a scaling ceiling due to communication bottlenecks.
How to Understand the Hardware Performance Differences
- Memory Capacity Matters for Large Models: AMD's MI355X advantage stemmed from its 288 gigabytes of memory versus B200's 180 gigabytes, allowing it to handle larger mixture-of-experts models more efficiently without offloading data to slower storage.
- Bandwidth Determines Real-World Speed: The MI355X's 8 terabytes per second bandwidth proved critical for mixture-of-experts inference, where data movement between GPU memory and compute cores becomes the bottleneck rather than raw computing power.
- Architecture Alignment Affects Optimization: Huawei's Ascend 950DT was co-designed in part for DeepSeek V4 inference, with the model's architecture and Huawei's accelerator roadmaps aligned from the start, demonstrating how hardware-software co-design can unlock performance gains.
Huawei's Ascend 950DT represented China's answer to Nvidia's dominance. The 950DT variant included 144 gigabytes of in-house HBM with nearly 4 terabytes per second of bandwidth, specifically targeting exascale FP8 workloads in Huawei's Atlas SuperPods. This wasn't an afterthought; the model's architecture and Huawei's accelerator roadmaps were aligned from the start.
What Does This Performance Sprint Mean for AI Infrastructure?
The 43-day optimization cycle revealed a fundamental truth about open-source development: when you're working in the open, a bottleneck gets identified, fixed, and merged in days, not quarters. The InferenceX engineering team, under the technical leadership of HaiShaw, pulled multiple all-nighters to measure and improve DeepSeek V4's performance across every major framework. The result wasn't incremental; it was transformative.
This performance improvement matters because it demonstrates that brilliant model architecture alone isn't enough. The infrastructure that runs models, the frameworks that optimize them, and the community that debugs them are equally critical. DeepSeek V4 arrived as a Formula 1 car in a million pieces. The chassis was revolutionary, the engine specs were insane, but on race day, it wasn't turning laps. By Day 26, it was racing at full speed. That transformation came not from the model creators but from the global open-source ecosystem rallying together to solve a shared problem.