DeepSeek V4 Abandons Its Own Breakthrough: Why the AI Lab Ditched MLA for a Hybrid Approach

FrontierNews.ai AI Research Desk

DeepSeek V4 Abandons Its Own Breakthrough: Why the AI Lab Ditched MLA for a Hybrid Approach

DeepSeek's latest model, V4, made a surprising architectural choice: it abandoned MLA (Multi-head Latent Attention), the attention mechanism that DeepSeek itself pioneered and used in models V2 and V3. The decision to switch to a hybrid attention system combining CSA (Compressed Sparse Attention) and HCA (Heavy Compressed Attention) caught the AI research community off guard, signaling that the field's assumptions about model architecture maturity may have been premature.

Just months before V4's release, the consensus among AI researchers was that advanced open-source model architectures had largely converged on MLA, with only minor refinements expected ahead. Other leading Chinese AI labs, including Kimi and Zhipu, continued using MLA in their latest models. V4's pivot suggests that model architecture still has substantial room for improvement, challenging the notion that the field had found an optimal design pattern.

Why Did DeepSeek Drop Its Own Innovation?

The technical reason for abandoning MLA reveals the complexity of building frontier AI systems. V4's new hybrid attention approach uses token-wise compression mechanisms that achieve compression ratios as aggressive as 4:1 or even 128:1. If MLA had been layered on top of these compression techniques, the implementation would become extremely complex. The hybrid system alternates between CSA, which compresses sequences and identifies critical tokens from long contexts, and HCA, which aggressively compresses token information while maintaining a global view of the context.

"V4 abandoned MLA and returned to MQA (Multi-Query Attention), which is closer to the original Multi-Head Attention. Compared to V3's MLA, the new approach is a token-wise compression mechanism that achieves large-scale compression ratios of 4:1 or even 128:1 by mixing CSA and HCA. If MLA were retained on top of these compression layers, the implementation would become extremely complex. This is likely one reason V4 dropped MLA," explained Liu Yifeng, a PhD student at UCLA studying model architecture who has interned at Kimi and ByteDance Seed.
Liu Yifeng, PhD Student, UCLA

The shift illustrates a broader principle in AI development: system-level coupled optimization is harder than point innovation. DeepSeek introduced four tightly coupled innovations simultaneously in V4, each demanding extraordinary engineering depth to deploy together.

What Are the Four Major Innovations in V4?

Hybrid Attention Mechanism: The new CSA and HCA combination replaces MLA, enabling more aggressive compression of token sequences while maintaining context awareness across long documents.
mHC Residual Connections: An improved residual connection architecture that enhances how information flows through the model's layers, building on ByteDance Seed's earlier HC design.
Muon Optimizer: A new optimization algorithm that has become a litmus test for model teams' engineering capability, requiring significant infrastructure investment to deploy at scale.
FP4 Training: A lower-precision training approach that reduces computational requirements while maintaining model quality, pushing the boundaries of efficient large-scale training.

These innovations work in concert rather than independently. The Muon optimizer, for instance, has become so technically demanding that it now serves as a marker of which AI labs have the infrastructure sophistication to compete at the frontier. Both Kimi and DeepSeek have invested heavily in deploying the Muon optimizer, and other leading labs are following suit.

How Are AI Labs Adapting to V4's Architecture?

The infrastructure implications of V4's changes are substantial. Open-source inference frameworks like SGLang, which optimize how AI models run on hardware, had to rebuild core components to support V4's new design. Prefix caching, speculative decoding, and other optimization techniques all required reworking to function with the hybrid attention scheme and HashTop-K MoE (a new routing strategy that uses token ID hash values for fixed expert assignment).

"DeepSeek remains the infrastructure whale. Every year, their releases give the infrastructure optimization community another year of work. Last year's MLA, DeepSeekMoE, and similar innovations kept us busy for a full year before open-source frameworks could handle them well. V4 switches to an entirely new hybrid attention scheme, meaning that prefix caching, speculative decoding, and related pipelines all need to be rebuilt," noted Zhao Chenyang, a core developer of SGLang and now at RadixArk AI.
Zhao Chenyang, Core Developer, SGLang; RadixArk AI

The engineering burden of supporting V4 extends to GPU kernel development. TileLang, a domain-specific language from Peking University, has emerged as the default tool for writing optimized GPU kernels in frontier AI labs, filling a gap between Triton and CUDA. This represents a shift in how Chinese AI labs approach infrastructure, with open-source tools becoming central to competitive advantage.

What Does V4's Scale Mean for Performance?

V4 is substantially larger than its predecessor, with 1.6 trillion parameters compared to V3's 670 billion. The model uses an extremely aggressive activation ratio of approximately 3%, meaning only 3% of the model's parameters are active for any given input. This is the most aggressive activation ratio in the industry, though the efficiency gains are partially offset by increased per-query token consumption.

Early users report noticeable improvements in math reasoning, coding capability, and agent instruction following compared to V3, with fewer hallucinations. However, coding ability still lags behind closed-source models like Claude Opus 4.6, performing comparably to open-source peers like Zhipu's GLM-5.1 and Kimi's K2.6. DeepSeek has aggressively priced V4 to encourage adoption, offering a 90% discount on top of an original 25% promotional rate for cached input processing.

Why Is DeepSeek Shifting Its Narrative Away From Cost?

A significant strategic shift accompanies V4's release: DeepSeek no longer discloses training costs. In previous releases, the company emphasized cost efficiency as a core competitive advantage, publicizing how much cheaper it was to train DeepSeek models compared to competitors. V4 marks a pivot toward emphasizing pure model capability instead. This suggests DeepSeek has moved beyond competing primarily on cost and now positions itself as a capability leader.

The broader pattern emerging from V4's development reveals a divergence between Chinese and US AI labs. Chinese companies have converged on engineering optimization and cost-performance improvements, while US closed-source models pursue new capability frontiers and higher pricing strategies. Chinese open-source models, particularly from DeepSeek, Kimi, and others, are becoming the most active and committed investors in the open-source large model ecosystem.

V4's architectural choices suggest that the AI research community's assumptions about model design maturity were premature. The fact that DeepSeek abandoned its own signature innovation in favor of a more complex hybrid system indicates that frontier AI development remains in a phase of rapid architectural evolution, with system-level optimization challenges still outpacing point innovations.

" }

Your AI & Tech News Engine

Breaking News

ChatGPT's Goblin Problem: How AI Models Learn Quirks We Never Intended