NVIDIA's New Security Layer Lets AI Run at Nearly Full Speed Without Sacrificing Protection
NVIDIA has cracked a long-standing problem in enterprise AI: how to keep data and models secure during inference without crippling performance. The company's Confidential Computing (CC) feature, built into its latest Blackwell GPUs, achieves this by embedding security directly into the hardware itself, allowing organizations to protect sensitive workloads while maintaining nearly identical speed to unsecured systems.
Why Does Security Matter for AI Inference?
As artificial intelligence becomes central to business operations, organizations face a critical tension. They need to deploy large language models (LLMs), which are AI systems trained on vast amounts of text to generate human-like responses, but they also need to protect proprietary model weights, customer data, and intellectual property from exposure during active use. This is especially urgent for regulated industries like healthcare and finance, where data breaches carry severe legal and financial consequences.
Traditional security approaches often slow down AI inference significantly, forcing companies to choose between protection and performance. NVIDIA's Confidential Computing sidesteps this dilemma by moving security to the silicon level, where it operates with minimal overhead.
How Does Hardware-Level Security Actually Work?
NVIDIA's approach spans three layers: the GPU hardware itself, the interconnect between multiple GPUs, and the system software. At the foundation, Blackwell GPUs are manufactured with a private signing key fused directly into the silicon. This key never gets exposed to software, firmware, or the host system, creating what NVIDIA calls a "hardware root of trust".
Before any sensitive workload runs, the system undergoes remote attestation. The NVIDIA Remote Attestation Service (NRAS) verifies that the GPU and CPU are in a known-good, unmodified state by checking a signed evidence bundle against a reference integrity manifest. Only after this verification passes can secrets like model decryption keys be deployed. Importantly, this attestation typically happens once at startup, not on every inference request, so it doesn't add latency to individual queries.
For multi-GPU setups, NVIDIA NVLink encryption protects data moving between GPUs, extending the security boundary across entire clusters. This architecture allows organizations to secure their AI workloads and stay compliant with regulations like GDPR and HIPAA without compromising on performance.
What Do the Performance Benchmarks Actually Show?
NVIDIA tested Confidential Computing on the HGX B300, a high-end GPU cluster, using the Qwen 3.5 model with 397 billion parameters running at FP8 precision (a compressed format that reduces memory usage). Across varying concurrency levels, batch sizes, and token lengths, enabling CC produced minimal throughput and latency overhead. In most configurations, the performance hit was under 8%, with the system maintaining up to 98% of the speed of unsecured inference.
The benchmarks tested realistic enterprise scenarios, including input and output token lengths ranging from 1,024 to 8,192 tokens, concurrent requests from 4 to 256, and different batch sizes. This breadth of testing demonstrates that the security overhead remains consistent across diverse workload patterns.
How to Optimize AI Inference With Confidential Computing
NVIDIA and its partners have implemented several technical innovations to minimize the performance impact of security:
- CC-Safe Autotuning in FlashInfer: Replaces standard event timers with the GPU global timer register, allowing the system to accurately compare different kernel implementations and select the fastest one for each workload shape without being fooled by timing artifacts introduced by encryption.
- Async Device-to-Host Copy Worker in SGLang: Moves per-step token readback off the scheduler's critical path, restoring the ability to overlap computation and data copying, which Confidential Computing can otherwise make synchronous and slow.
- Piecewise CUDA Graph Support in SGLang: Adds GPU graph replay for prefill and mixed batches, reducing kernel launch overhead that is amplified in CC mode and would otherwise create bottlenecks.
These optimizations address the two main sources of performance overhead in Confidential Computing: secure work submission latency (the cost of encrypting and launching GPU work) and reduced host-to-device bandwidth (the encrypted transfer rate between CPU and GPU). By batching work more efficiently and reducing synchronization points, NVIDIA's framework partners have made CC practical for production deployments.
What Does This Mean for Enterprise AI Adoption?
The ability to run secure AI inference at near-native speeds removes a major barrier to enterprise adoption in regulated industries. Healthcare organizations can deploy AI diagnostic tools without exposing patient data. Financial institutions can use AI for fraud detection while protecting customer information. Telecom companies can build AI agents for customer service while keeping proprietary models and user interactions confidential.
NVIDIA's benchmarks suggest that organizations no longer need to choose between security and performance. With Confidential Computing on Blackwell GPUs, they can achieve both, making it feasible to deploy sensitive AI workloads in production environments while maintaining compliance with data protection regulations.