DeepSeek V4 Pro Is Powerful and Cheap, But Still Playing Catch-Up to U.S. AI Leaders
DeepSeek V4 Pro, China's most advanced AI model, performs at the level of OpenAI's GPT-5 released eight months earlier, according to official U.S. government testing. The Center for Artificial Intelligence Standardization and Innovation (CAISI), part of the National Institute of Standards and Technology (NIST) under the U.S. Department of Commerce, released its evaluation in May 2026, revealing both the strengths and limitations of the Chinese AI company's latest offering.
The evaluation matters because it provides an independent, government-backed assessment of how Chinese AI compares to American models. DeepSeek released V4 Pro in late April 2026 with 1.6 trillion parameters, making it significantly more powerful than previous Chinese AI systems. However, the CAISI testing showed a notable performance gap that challenges some of DeepSeek's own claims about its capabilities.
How Do Independent AI Benchmarks Measure Real-World Performance?
CAISI conducted rigorous testing across multiple domains to evaluate DeepSeek V4 Pro fairly. Rather than relying on a single benchmark, the agency used nine different tests spanning five critical areas of AI capability. This approach, inspired by Item Response Theory, provides a more complete picture of what an AI model can actually do in practice.
- Cybersecurity Skills: CTF-Archive-Diamond measures practical hacking abilities, including the capacity to identify vulnerabilities and disrupt systems
- Software Engineering: SWE-Bench Verified and PortBench test programming capabilities and software portability across different platforms
- Scientific Reasoning: FrontierScience and GPQA-Diamond evaluate research-level scientific thinking and expert-level knowledge in specialized domains
- Abstract Reasoning: ARC-AGI-2 semi-private benchmark tests general problem-solving abilities that don't rely on specific training data
- Mathematics: OTIS-AIME-2025, PUMaC 2024, and SMT 2025 use extremely difficult competition-level math problems to assess reasoning under pressure
Why Does DeepSeek's Self-Reported Performance Differ From Government Testing?
DeepSeek claimed its V4 Pro performed comparably to Claude Opus 4.6 and GPT-5.4, models released just two months before V4 Pro's launch. However, CAISI's independent testing told a different story. The government agency found that V4 Pro actually matched GPT-5, which OpenAI released in August 2025, eight months earlier.
This discrepancy highlights a common pattern in AI development: companies often report benchmark scores that exceed what independent evaluators find. DeepSeek's self-reported scores were substantially higher than CAISI's results, suggesting either different testing methodologies or optimized performance on specific benchmarks that don't translate to broader capabilities.
Where Does DeepSeek V4 Pro Actually Win Against American AI?
While DeepSeek V4 Pro trails in raw performance, it dominates in one critical metric: cost-effectiveness. CAISI found that DeepSeek V4 Pro is 41 to 53 percent more cost-efficient than OpenAI's GPT-5.4 mini, the most affordable comparable American model.
The pricing advantage is substantial. DeepSeek V4 Pro charges approximately $1.74 per million input tokens without caching and $0.0145 per million tokens with caching. Output tokens cost about $3.48 per million. By comparison, GPT-5.4 mini costs $0.75 per million input tokens without caching, $0.075 with caching, and $4.50 per million output tokens. For developers and organizations processing large volumes of text, these differences compound quickly into significant savings.
CAISI also noted that DeepSeek V4 Pro achieved a score approximately 200 points higher than Kimi K2.5, which previously held the record for the highest-scoring Chinese-made AI. In CAISI's benchmark methodology, a 200-point increase in overall score means the probability of solving a specific task is roughly three times higher, representing a meaningful leap in capability.
What Does This Mean for the AI Competition Between Nations?
The CAISI evaluation reveals a nuanced picture of AI development in 2026. Chinese AI is advancing rapidly and becoming increasingly cost-competitive, but the United States still maintains a performance lead. The eight-month gap suggests that American AI companies are innovating faster, though DeepSeek's lower costs could appeal to price-sensitive users and organizations with limited budgets.
The evaluation also exposed some technical challenges. CAISI noted that PortBench, one of the software portability tests, is not yet supported in their cost comparison methodology. Additionally, ARC-AGI-2 had technical issues when evaluating GPT-5.4 mini, preventing a complete cost-efficiency comparison across all benchmarks.
For businesses and developers, the takeaway is clear: if you need the absolute highest performance and can afford it, American models like GPT-5.4 still lead. But if you're budget-conscious and can tolerate slightly lower performance, DeepSeek V4 Pro offers compelling value. The competitive pressure from Chinese AI is forcing American companies to innovate faster and consider their pricing strategies more carefully.