DeepSeek V4 Is the Cheapest Frontier AI Model Yet, But It Still Can't Beat Claude on Code
DeepSeek V4 is the most affordable frontier-tier AI model available, with pricing that undercuts competitors by 100x on some tasks, but hands-on testing shows it still trails Claude and GPT-5.5 for demanding coding work. The Chinese AI lab released two versions on April 24, 2026: V4-Pro with 1.6 trillion parameters and V4-Flash with 284 billion parameters. Both come with a 1 million token context window (enough to process roughly 100,000 words at once) and MIT licensing, making them open-weight models anyone can download and run locally.
How Much Cheaper Is DeepSeek V4 Than Other AI Models?
The pricing gap is dramatic. V4-Flash costs just $0.14 per million input tokens, while V4-Pro runs $1.74 per million input tokens. Compare that to Claude Opus 4.7 at $25 per million tokens or GPT-5.5 at $30 per million tokens. One developer ran a full code audit of a TypeScript endpoint for $0.09 on V4-Pro, a task that would cost $9 to $13 on Claude Opus 4.7, according to a real-world test documented in the source material.
- V4-Flash Input Cost: $0.14 per million tokens, the cheapest frontier-tier model available anywhere
- V4-Pro Input Cost: $1.74 per million tokens, cheaper than all major closed-source competitors
- Cache Hit Discount: 99% off on cached tokens, which dramatically reduces costs for workflows that reuse system prompts or long context windows
- Competitive Comparison: GPT-5.5 costs $30 per million tokens, Claude Opus 4.7 costs $25 per million tokens
How Does V4 Actually Perform on Real Coding Tasks?
A professional developer tested V4-Pro on three real-world scenarios: auditing a React Router 7 codebase with TypeScript, building a poker simulation with complex logic, and generating web designs from scratch. The results were mixed. On the codebase audit, V4-Pro found some legitimate issues but also flagged false positives, such as style nitpicks and refactoring suggestions that would actually make code worse. It missed a couple of problems that both GPT-5.5 and Claude caught immediately.
On the poker simulation test, V4 produced working code with correct statistics and reasonable structure, but Claude and GPT-5.5 delivered cleaner separation between components and more idiomatic code. The tester described V4's output as "a competent junior engineer's first pass that you'd then refactor," while the competitors' versions felt like something a senior engineer would commit directly.
For web design, V4 showed interesting behavior. It produced a coffee roaster website that was safe and competent but used a cookie-cutter template approach. However, a pop culture shop design came out "genuinely good" with striking layout and confident typography. This suggests V4 defaults to safe templates unless the prompt subject pulls it toward distinctive design patterns.
What Do the Benchmarks Actually Say About V4's Capabilities?
V4-Pro performs well on some standardized tests but reveals weaknesses on harder evaluations. On SWE-Bench Verified (a coding benchmark), V4-Pro scored 80.6%, essentially tied with Claude Opus 4.6 at 80.8%. However, on SWE-Bench Pro, the more realistic and difficult version, V4-Pro landed around 55%, behind Claude Opus 4.7 at 64.3%, Kimi K2.6 at 58.6%, and GLM-5.1 at 58.4%.
The U.S. government's CAISI evaluation at NIST tested V4-Pro on non-public benchmarks and placed it closer to GPT-5 (about 8 months old) than to more recent models like GPT-5.4 or Opus 4.6. This suggests some public benchmark overfitting may be occurring, which is normal but worth knowing when evaluating real-world performance.
One significant limitation emerged in the Artificial Analysis Omniscience evaluation: V4 hallucinates at a 94% rate when uncertain, meaning it provides answers even when it doesn't actually know the information. For workflows involving retrieval-augmented generation (RAG), a technique that grounds AI responses in external documents, this requires explicit safeguards.
What Makes V4's Architecture Different From Previous DeepSeek Models?
V4 isn't just a scaled-up version of V3. DeepSeek built a hybrid attention system combining Clustered Sparse Attention (CSA) and Head-wise Clustered Attention (HCA) that uses about 27% of the per-token compute of V3.2 at 1 million token context. The lab also trained directly in FP4 (a lower-precision format) instead of quantizing the model after training, which improves efficiency.
The reasoning capabilities also changed. Instead of choosing between separate chat and reasoner models like before, V4 offers three reasoning modes: Non-Think, Think High, and Think Max. Tool calls now work inside thinking mode, something the previous R1 model couldn't do. However, V4 lacks multimodal capabilities, meaning it only processes text input and output. For vision tasks, users need alternatives like Kimi K2.6 or Gemini.
Who Should Actually Use DeepSeek V4?
The tester's conclusion captures V4's positioning: "V4 is competent on everything, outstanding on nothing. Which is exactly the right shape for a value model." For high-volume work where cost matters more than perfection, V4 is hard to beat. At $0.14 per million input tokens for Flash, running thousands of routine tasks becomes economical. For demanding coding work, complex reasoning, or tasks where quality is non-negotiable, Claude Opus 4.7 and GPT-5.5 remain better choices despite their higher cost.
The context window advantage also matters. V4-Pro's 1 million token window matches the largest competitors and enables processing of entire codebases, long documents, or extended conversations without truncation. Combined with the 99% discount on cache hits, this makes V4 particularly attractive for agentic workflows that repeatedly send large system prompts or context.