NVIDIA's New Speech Recognition Model Challenges OpenAI's Whisper With Real-Time Multilingual Transcription
NVIDIA has released Nemotron 3.5 ASR, an open-weights speech recognition model that transcribes 40 language variants in real time from a single checkpoint, offering a streaming alternative to OpenAI's Whisper that processes audio with minimal latency and no separate language-switching overhead. The 600-million-parameter model uses a cache-aware architecture that processes each audio frame exactly once, eliminating the computational waste of traditional buffered streaming approaches.
How Does Nemotron 3.5 ASR Differ From Whisper and Other Speech Recognition Tools?
The core innovation lies in Nemotron's streaming-native design. Unlike Whisper, which processes audio in offline batches and lacks real-time capabilities, Nemotron 3.5 ASR handles live audio transcription with latency ranging from 80 milliseconds for ultra-low-delay voice agents to 1.12 seconds for maximum accuracy, all from the same model checkpoint. The model achieves this through a Cache-Aware FastConformer encoder paired with an RNNT (Recurrent Neural Network Transducer) decoder that caches encoder states and reuses them as new audio arrives, eliminating redundant processing.
Compared to commercial alternatives like Deepgram's Nova-3, AssemblyAI's Universal-3 Pro, and ElevenLabs' Scribe v2, Nemotron 3.5 ASR offers a significant advantage: it is open-weights and self-hostable. This means organizations can deploy it on their own infrastructure without relying on proprietary APIs or paying per-minute usage fees. Deepgram and AssemblyAI charge usage-based pricing around $0.0077 per minute, while ElevenLabs charges approximately $0.28 per hour. Nemotron 3.5 ASR is free to self-host, though hosted versions are available through DeepInfra on a usage-based model.
What Languages Does the Model Support, and How Does Language Detection Work?
Nemotron 3.5 ASR covers 40 language-locales in a single checkpoint, including English, Spanish, German, French, Arabic, Japanese, Korean, Mandarin, Hindi, and Thai, plus several European and Nordic languages. This breadth rivals Whisper's roughly 99 languages, but with a critical difference: Nemotron achieves it through prompt-based language identification conditioning rather than requiring separate per-language models.
Users can set the target language explicitly for best accuracy, or enable automatic language detection by setting target_lang=auto. In auto mode, the model emits a language tag after terminal punctuation, allowing a single deployment to handle mixed-language traffic without a separate language identification component. This flexibility is particularly valuable for contact centers, healthcare settings, and international customer service teams that encounter multiple languages in a single session.
How to Deploy Nemotron 3.5 ASR for Your Use Case
- Choose Your Latency Setting: Select from five predefined latency modes at inference time without retraining. The 80-millisecond ultra-low mode suits voice agents requiring immediate responses, while the 1.12-second high-accuracy mode works best for batch transcription and archival.
- Enable Language Auto-Detection or Specify a Target Language: For mixed-language environments, use target_lang=auto to let the model identify the language automatically. For single-language deployments or when accuracy is critical, specify the target language explicitly to maximize performance.
- Fine-Tune for Your Domain or Accent: Because the weights are open, teams can fine-tune the base model for specific languages, accents, or domains using public corpora like Common Voice and FLEURS. NVIDIA demonstrated 31 to 32 percent relative error reductions on Greek and Bulgarian after fine-tuning.
- Deploy on Your Infrastructure or Use a Hosted Provider: Self-host the model on your own GPUs for complete control and zero per-minute costs, or use DeepInfra's hosted version with usage-based pricing.
What Performance Improvements Has NVIDIA Demonstrated?
NVIDIA published concrete fine-tuning results on Greek and Bulgarian, two languages not heavily represented in the base training data. When fine-tuned on public corpora and evaluated at the 80-millisecond latency setting, Greek word error rate (WER) dropped from 35 percent to 24 percent, a 32 percent relative improvement. Bulgarian WER fell from 22 percent to 15 percent, a 31 percent relative improvement. These results demonstrate that the open-weights model is practical for organizations seeking to optimize transcription quality for underrepresented languages.
The model also handles punctuation and capitalization natively, eliminating the need for a separate post-processing step that many competing systems require. This built-in capability reduces deployment complexity and produces production-ready text immediately.
What Are the Key Technical Advantages of the Cache-Aware Design?
Traditional buffered streaming approaches re-process overlapping audio windows at every step, repeating computation and adding latency. Nemotron 3.5 ASR avoids this inefficiency by caching encoder self-attention and convolution activations, then reusing those cached states as new audio frames arrive. Each audio frame is processed exactly once, with no overlap. On an NVIDIA H100 GPU, this cache-aware design reportedly achieves 17 times the concurrent streams of buffered approaches, a substantial efficiency gain for high-volume transcription workloads.
The latency-accuracy tradeoff is controlled by a single inference parameter, att_context_size, which maps to chunk sizes of 80 milliseconds, 160 milliseconds, 320 milliseconds, 560 milliseconds, and 1.12 seconds. Teams can adjust this setting at inference time without retraining the model, providing flexibility to balance responsiveness and accuracy based on real-world deployment requirements.
When Will the Production Version Be Available?
NVIDIA has announced a production NIM (NVIDIA Inference Microservice) with gRPC streaming support, but this version is not yet released. The open-weights model is available now on Hugging Face under the OpenMDW-1.1 license, allowing immediate self-hosting and experimentation.
The release of Nemotron 3.5 ASR signals a shift in the speech recognition landscape. While Whisper remains the dominant open-source model for offline transcription, Nemotron 3.5 ASR addresses a critical gap: real-time, multilingual transcription without the latency penalties or per-minute costs of proprietary APIs. For organizations building voice agents, live captioning systems, or multilingual customer service platforms, the model offers a compelling alternative that combines the flexibility of open weights with the performance characteristics of commercial systems.