Xiaomi Just Did What Cerebras and Groq Spent Millions Building: 1,000-Token-Per-Second AI on Commodity GPUs
Xiaomi has achieved what specialized inference chip companies spent hundreds of millions of dollars building: ultra-fast AI model serving at 1,000 tokens per second, running on standard GPUs that any developer can rent from cloud providers today. The company released MiMo-V2.5-Pro-UltraSpeed on a single 8-GPU commodity node, peaking near 1,200 tokens per second on a 1-trillion-parameter model. That's roughly 15 times faster than ChatGPT and Claude, without a single custom chip.
Why Does Token Speed Matter So Much?
Tokens are the chunks of text that AI models read and write, roughly three-quarters of a word each. The speed at which a model generates them determines how fast users see responses, how many requests a server can handle at once, and whether the model is fast enough for real-time applications. To put Xiaomi's achievement in perspective, GPT-5.5 (what most ChatGPT users interact with) generates 68 tokens per second. Claude Opus 4.6 lands around 71 tokens per second. Gemini Flash hits 192 tokens per second. MiMo-V2.5-Pro-UltraSpeed does 1,000, on a model that matches Claude Opus on coding benchmarks.
At 68 tokens per second, applications with hard latency requirements cannot work. Fraud detection systems need to flag suspicious transactions in milliseconds. Real-time trading signals need to execute before market conditions shift. Live agent loops need to reason and respond without noticeable delay. At 1,000 tokens per second, all of these become viable.
How Did Xiaomi Achieve This Without Custom Silicon?
The speedup comes from three coordinated techniques that Xiaomi calls "extreme model-system codesign." Neither technique alone reaches 1,000 tokens per second, but together they create a synergy that does.
- FP4 Quantization: Instead of running the model at full numerical precision, Xiaomi compresses the expert layers (which make up most of the 1 trillion parameters) down to 4-bit. Memory footprint drops, bandwidth pressure drops, and speed increases. The critical detail is that only the expert layers are compressed while everything else stays at full precision, keeping quality loss near zero.
- DFlash Speculative Decoding: Normal speculative decoding has a small draft model guess the next few tokens, then the large model verifies them in parallel. DFlash skips sequential drafting entirely and fills a whole block of masked positions in a single forward pass. In coding tasks, the large model accepts an average of 6.3 out of 8 proposed tokens per verification round, meaning six tokens are confirmed in one step instead of one.
- TileRT Inference Engine: This purpose-built inference engine keeps the entire compute pipeline continuously resident inside the GPU, eliminating per-operator launch overhead and execution gaps that normally slow down inference.
What About Cerebras and Groq's Custom Chips?
Two well-funded companies built entire businesses around solving the inference speed bottleneck with custom silicon. Cerebras designed a wafer-scale chip the size of a dinner plate, packing 44 gigabytes of on-chip memory to eliminate the bandwidth bottleneck that slows GPU inference. It achieved 969 tokens per second on Meta's Llama 3.1 405B model, which is impressive, but that model is less than half the size of Xiaomi's 1-trillion-parameter MiMo-V2.5-Pro.
Groq's custom Language Processing Unit (LPU) architecture tops out around 300 to 750 tokens per second depending on the model. The critical limitation of both approaches is that neither runs on hardware you can rent from standard cloud providers like Amazon Web Services (AWS) or Microsoft Azure. Xiaomi's approach does, which fundamentally changes the economics of deploying ultra-fast inference at scale.
How Much Does This Cost, and When Can You Use It?
Xiaomi is pricing the UltraSpeed service at 3 times the standard MiMo-V2.5-Pro rate for roughly 10 times the generation speed. The API trial runs from June 9 to June 23, 2026, on an application basis with priority given to enterprise and professional developers.
For context on the underlying model's cost, MiMo-V2.5-Pro already costs approximately $0.43 per million input tokens and $0.87 per million output tokens, compared to Claude Opus at $5 input and $25 output per million tokens. UltraSpeed accelerates that same MiMo V2.5 Pro model, not a stripped-down version.
Xiaomi has also open-sourced the MiMo-V2.5-Pro-FP4-DFlash checkpoint on Hugging Face, and TileRT has open-sourced select modules on GitHub, allowing the community to test and verify the underlying technology.
What Does This Mean for the Inference Chip Industry?
If Xiaomi's speed claims hold up under independent scrutiny, the company has accomplished something that required hundreds of millions of dollars in custom silicon investment from Cerebras and Groq, using software running on standard hardware that any developer can access today. One important caveat is that independent third-party speed verification is not yet public. The numbers come from Xiaomi's own benchmarks and demos. Community testing through the open-sourced checkpoint will be the real verification process over the coming weeks.
The broader implication is significant for the AI infrastructure market. Xiaomi has been steadily building AI capability that most of the industry was not paying attention to. MiMo-V2.5-Pro already matched Claude Opus on coding benchmarks at a fraction of the cost before UltraSpeed was announced. If the speed claims hold, Xiaomi has demonstrated that software optimization and model-system codesign can compete with purpose-built hardware, at least for inference workloads.
Steps to Test Xiaomi's MiMo-V2.5-Pro Technology
- Apply for API Access: Submit an application to Xiaomi's MiMo-V2.5-Pro-UltraSpeed API trial between June 9 and June 23, 2026. Enterprise and professional developers receive priority consideration.
- Download the Open-Source Checkpoint: Access the MiMo-V2.5-Pro-FP4-DFlash model from Hugging Face to run inference locally and verify speed claims independently.
- Explore TileRT Modules: Review the open-sourced TileRT inference engine components on GitHub to understand how the optimization techniques work under the hood.
The phone brand's entry into frontier-level AI inference represents a shift in where AI capability comes from next. Rather than waiting for specialized hardware makers to build custom chips, Xiaomi has shown that aggressive software optimization on commodity GPUs can deliver comparable or superior performance at lower cost and with wider accessibility.