NVIDIA's Blackwell Chips Run 20x More AI Agents Per Megawatt, Reshaping How Companies Deploy Artificial Intelligence
NVIDIA's latest Blackwell GPU platform can run up to 20 times more AI agents per megawatt than the company's previous-generation HGX H200 hardware, according to the first published results of AgentPerf, a new benchmark from Artificial Analysis designed specifically for agentic workloads. The distinction matters because AI agents, which chain together dozens or hundreds of model calls to complete complex tasks, consume computing resources very differently than single-response chatbot interactions.
What Is an AI Agent, and Why Does This Benchmark Matter?
An AI agent is fundamentally different from a chatbot. When you ask ChatGPT a question, it processes your input once and returns a response. An agent, by contrast, breaks a goal into many steps and keeps iterating until the task is complete. It might read files, write and edit code, execute commands, compile programs, search databases, and browse the web, with each step feeding context into the next one.
Existing benchmarks measure how fast a model responds to a single request and how many simultaneous requests a system can handle. They miss the chained calls, growing context windows, and tool-call latencies that define what an agent actually does in production. AgentPerf closes that gap by measuring concurrent agents at service-level objectives of 20 and 60 tokens per second per agent, two thresholds meant to capture real production responsiveness.
"Agentic AI is a fundamentally different workload than conversational AI. A single chat completion is a sprint: one large language model call, one response. An agent functions more like a relay: It breaks a goal into many steps and keeps going until the task is done," explained Shruti Koparkar, NVIDIA.
Shruti Koparkar, NVIDIA
How Does Blackwell Achieve This Massive Efficiency Gain?
The 20x improvement does not come from a single chip breakthrough. Instead, NVIDIA attributes the gain to rack-scale codesign, meaning the entire system of 72 GPUs (graphics processing units) working together is optimized as one unit. The GB300 NVL72 platform connects 72 GPUs into a single rack-scale system, allowing a large mixture-of-experts model like DeepSeek V4 Pro to distribute execution across experts without paying a heavy coordination penalty.
CUDA kernels, which are the software instructions that tell GPUs what to do, overlap communication and compute so the cost of routing between experts is absorbed rather than added to latency. On the software side, NVIDIA TensorRT LLM separates input processing from output generation so each can be optimized independently as concurrent agent sessions scale.
"The complexity isn't additive; it's multiplicative," noted Shruti Koparkar, NVIDIA.
Shruti Koparkar, NVIDIA
Why Power Efficiency Matters More Than Speed for Agents
For enterprises building agent fleets, per-megawatt agent throughput translates directly into how much productive work a data center investment can deliver. Agents are long-running, multi-step processes that consume capacity for the duration of a task, not a single response. Power, not silicon, is increasingly the binding constraint on agent deployments at scale.
The test workload used in AgentPerf is DeepSeek V4 Pro, a frontier mixture-of-experts model representative of the systems powering today's most capable agents. The benchmark runs it against real coding-agent trajectories sourced from public repositories spanning more than 12 programming languages, with the agent reading files, writing and editing code, executing commands, and iterating on results.
Ways to Understand the Real-World Impact of This Benchmark
- Production Deployment: Three inference providers, Baseten, DeepInfra, and Together AI, are already serving DeepSeek V4 Pro and other frontier models on Blackwell for production agentic applications, demonstrating that the benchmark results translate to real-world use cases.
- Coding Agent Applications: Together AI powers real-time inference for Cursor, an AI coding platform whose agents debug issues, generate features, and execute refactors while developers continue working, showcasing how agents reduce human workload.
- Business Process Automation: DeepInfra runs Pam.ai, an AI workforce platform for car dealerships that books service appointments, handles calls, and runs outbound sales campaigns entirely on Blackwell, illustrating how agents can automate entire business functions.
What Comes Next for AI Hardware Benchmarking?
AgentPerf is the first round of agentic benchmarking, and NVIDIA has an obvious interest in a benchmark on which its newest rack-scale system tops the chart against its own prior generation. The real competitive test will come when AMD's MI400-class systems and custom silicon from hyperscalers post numbers on the same benchmark, and when Artificial Analysis expands the workload set beyond coding agents.
If AgentPerf becomes the standard buyers use to size agent fleets, the metric NVIDIA cares about shifts from tokens per second on a single query to concurrent agents per megawatt, a framing that favors rack-scale integration and full-stack software optimization, which is exactly where NVIDIA's competitive advantage is widest. Expect every competing accelerator vendor to either run the benchmark or argue publicly that it measures the wrong thing.
NVIDIA is pushing into its next architecture, with the company saying Vera Rubin is now in full production. CEO Jensen Huang is scheduled to keynote GTC Taipei, signaling that the company's focus on agentic workloads will remain central to its product strategy going forward.