NVIDIA's New 550B AI Model Cracks the Long-Running Agent Problem with Verifiable Rewards
NVIDIA has released Nemotron 3 Ultra, a 550 billion parameter AI model designed to solve a critical problem for long-running agents: how to keep them accurate and fast as they plan, call tools, and reason across dozens of steps without exploding inference costs. The model introduces a novel post-training technique called RLVR (Reinforcement Learning with Verifiable Reward) that trains agents across multiple environments simultaneously, addressing a challenge that has plagued AI development as agents grow more complex.
The core innovation lies in how NVIDIA tackled the efficiency problem. As AI agents run longer, they generate more tokens, which drives up computational costs. Nemotron 3 Ultra achieves up to 6 times higher inference throughput than comparable open-source large language models (LLMs) while maintaining comparable accuracy. On an 8,000 word input with 64,000 word output setting, the model reaches 5.9 times the throughput of GLM-5.1, a competing 754 billion parameter model.
What Makes RLVR Different From Traditional AI Training?
RLVR stands for Reinforcement Learning with Verifiable Reward, and it represents a departure from how most AI models are trained today. Traditional reinforcement learning relies on reward signals that are often sparse and difficult to define across different tasks. NVIDIA's approach trains the model across many environments at once: terminal use, software engineering, search, math, code, and safety tasks. The key insight is that each environment has its own reward structure, and RLVR learns to navigate this complexity.
The challenge with multi-environment training is that the learning signal gets diluted as the number of environments grows. To solve this, NVIDIA trained more than ten domain-specialized teacher models, each with its own training pipeline. During a process called Multi-teacher On-Policy Distillation (MOPD), the student model generates its own rollouts across domains, and each rollout is scored by the matching teacher with dense, token-level guidance. This creates a much richer learning signal than sparse rewards alone could provide.
How Does Nemotron 3 Ultra Achieve Its Speed Advantage?
The model uses a hybrid Mamba-Attention architecture instead of a pure Transformer design. Mamba layers handle long sequences with sub-quadratic scaling, meaning the computational cost doesn't explode as sequences get longer. A few Attention layers are kept for precise recall over large contexts. This hybrid approach is particularly important for agents because Mamba's per-step decoding cost stays constant as sequence length grows, which is why throughput gains widen on long, decode-heavy workloads.
Nemotron 3 Ultra also uses a Mixture-of-Experts (MoE) architecture, where only 55 billion of its 550 billion parameters are active per token. This design improves accuracy per active parameter compared to standard approaches. The model includes 512 experts per layer, with only the top 22 activated per token, keeping computational overhead manageable.
The model was pre-trained on 20 trillion text tokens and extended to handle a context window of 1 million tokens, roughly equivalent to processing 100,000 words at once. This long context capability is crucial for agents that need to reference earlier steps in their reasoning. On the RULER benchmark at 1 million tokens, Nemotron 3 Ultra scored 94.7, significantly outperforming larger comparison models that top out at 256,000 token contexts.
Steps to Understanding Nemotron 3 Ultra's Training Pipeline
- Supervised Fine-Tuning (SFT): The model first learns from 10 million human-curated examples across diverse tasks, establishing baseline behavior before reinforcement learning begins.
- RLVR Training: The model trains across multiple environments simultaneously using verifiable rewards, learning to handle sparse and environment-dependent reward signals across terminal, software engineering, search, math, code, and safety domains.
- Multi-teacher On-Policy Distillation (MOPD): Specialized teacher models score the student's rollouts with dense guidance, providing richer learning signals than sparse rewards alone, running asynchronously with pipelined updates.
- Iterative Refinement: New teachers are initialized from the improved student after each MOPD checkpoint, and their gains merge back into the next round, with NVIDIA running two full MOPD iterations for Nemotron 3 Ultra.
The post-training pipeline includes 1 million new reinforcement learning tasks and 15 new RL environments, bringing cumulative Nemotron open totals to 50 million SFT samples, 2 million RL tasks, and 55 RL environments. This represents a significant expansion of the training data available for agent development.
What Are the Real-World Performance Gains?
On agentic benchmarks, Nemotron 3 Ultra demonstrates competitive performance. It scores 71.9 on SWE-Bench Verified, a benchmark for software engineering tasks, and 56.4 on Terminal Bench 2.1. On reasoning tasks, it achieves 570.0 on IOI 2025, which NVIDIA frames as top-3-human-level competitive programming performance. On AA-Omniscience, a benchmark measuring hallucination tendency, it records the highest non-hallucination score in its comparison set at 78.7, suggesting a lower tendency to answer when uncertain.
Perhaps most importantly for practical deployment, NVIDIA reports up to 30 percent lower cost to task completion on real-world benchmarks like SWE-Bench and Terminal Bench, achieved through fewer tokens per turn. The model also supports three reasoning modes: reasoning-off, regular, and medium-effort. Medium-effort uses about 2.5 times fewer tokens than regular mode at the cost of roughly a 7 percent accuracy drop, providing a useful efficiency lever for high-volume agent steps.
NVIDIA also emphasizes harness robustness, training the model under multiple agent harnesses per task type rather than just one. SWE-Bench Verified scores stay between 65 percent and 70.4 percent across different deployment frameworks, indicating consistent behavior regardless of how the model is deployed.
How Does Quantization Enable Efficient Deployment?
The model uses NVFP4, a 4-bit datatype with two-dimensional block quantization on weights, operating at 5.03 bits-per-element. NVIDIA describes this as the largest-scale demonstration of stable, accurate NVFP4 training to date. The reduced weight footprint has significant deployment benefits: the W4A16 path (4-bit weights, 16-bit activations) leaves room to fit Multi-Token Prediction weights on a single 8-GPU H100 node, whereas an FP8 checkpoint could not fit without spanning two nodes.
On NVIDIA's Blackwell hardware, the model runs with native FP4 math. On older Hopper hardware, it runs as W4A16 since Hopper lacks native FP4 tensor cores. NVIDIA found that accuracy saturated below this precision budget, meaning higher precision added no measurable gain.
The development process was not entirely smooth. NVIDIA documented two training loss divergences and treated them as useful engineering records. The first, near 8 trillion tokens, traced to moving output-layer gradient reduction from FP32 to BF16, where the Multi-Token Prediction gradient contribution was effectively lost in BF16's limited precision. Reverting to FP32 gradient reduction re-stabilized training. The second divergence, near 16 trillion tokens, had no confirmed root cause, but NVIDIA mitigated it by annealing the learning rate early and cutting the total token horizon to 20 trillion tokens.
Nemotron 3 Ultra represents a significant step forward in making long-running agents practical and cost-effective. By combining RLVR's multi-environment training approach with efficient architecture choices and aggressive quantization, NVIDIA has created a model that addresses one of the field's most pressing challenges: how to keep AI agents accurate and affordable as they grow more complex.
" }