The Hidden Math Behind Running AI on Your Laptop: Why Predicting Speed Matters
A new prediction tool called LENS can forecast how quickly large language models will run on neural processing units (NPUs) with a mean error of just 2.15%, addressing a fundamental challenge that has prevented engineers from optimizing AI inference on edge devices. The breakthrough comes from researchers who recognized that existing prediction methods fail on NPUs because these specialized AI chips work fundamentally differently than graphics processing units (GPUs), the traditional workhorses of artificial intelligence.
Why Can't We Just Use the Old Prediction Methods?
For years, engineers have relied on prediction tools to estimate how fast AI models will run on different hardware configurations. These tools work well for GPUs, which power most AI training and inference today. But NPUs, which are specialized processors designed specifically for running AI models on devices like laptops, phones, and servers, break all the old assumptions.
The problem stems from three fundamental mismatches between NPUs and existing prediction methodologies. First, manufacturers keep the internal architecture of commercial NPUs secret, making it impossible to simulate their behavior the way researchers do with GPUs. Second, NPU compilers apply optimizations in unpredictable ways that differ dramatically from GPU compilers, sometimes causing prediction errors as high as 493% when using GPU-based methods. Third, NPUs use a bucketing strategy where inputs are padded to predefined lengths, creating discontinuous jumps in latency rather than smooth, predictable curves.
How Does LENS Actually Work?
Rather than trying to understand the internal workings of NPUs, LENS takes a radically simpler approach: it treats NPUs as black boxes and measures only what's observable from the outside. The tool profiles each bucket, which is a group of input sizes that get compiled together, using just two end-to-end measurements. From these minimal measurements, LENS can predict latency for any combination of input and output lengths within that bucket.
This approach works because of a key insight: all inputs within a bucket share the same compiled binary, so their latency patterns follow predictable rules once you understand how the bucket behaves. By reinterpreting the bucketing structure as a natural unit of measurement rather than a problem to solve, the researchers transformed a constraint into an opportunity.
Why This Matters for the AI Hardware Race
NPUs have emerged as production infrastructure for running AI models efficiently. Companies like Anthropic use NPUs to serve Claude, their large language model, and Google uses them for Gemini. But deploying these models requires exploring a vast configuration space: which accelerator to use, what batch size to run, how to parallelize the workload. Without accurate latency prediction, engineers have to test every single configuration manually, which is impractical and time-consuming.
The fundamental reason NPUs exist is that large language models expose a structural mismatch with GPU architecture. When an AI model generates text, it splits into two phases with very different computational needs. The prefill phase processes the entire input prompt at once, which GPUs handle well. But the decode phase generates output tokens one at a time, reading the entire accumulated cache of previous computations from memory at every step. This makes decode memory-bound, leaving GPU compute cores underutilized. Additionally, GPUs force matrix multiplication and other operations to execute sequentially on the same processing units, further limiting efficiency.
NPUs solve both problems through architectural choices specialized for inference. They use dataflow architectures that maximize data reuse on-chip, alleviating the memory bandwidth bottleneck. They also employ physically separate engines for different operation types, allowing heterogeneous kernels to execute in parallel rather than sequentially.
How to Use LENS for NPU Configuration Optimization
- Profile Each Bucket Minimally: Instead of exhaustively measuring every configuration, take just two end-to-end measurements per bucket to establish the latency pattern for that input size range.
- Compose Predictions Across Buckets: Use LENS to predict latency for arbitrary input-output length combinations by composing the bucket-level measurements, avoiding the need to recompile for every possible configuration.
- Validate Across Vendors and Models: Test predictions against NPUs from multiple manufacturers and different large language models to ensure the methodology generalizes across the ecosystem.
The researchers validated LENS across NPUs from multiple vendors, several different large language models, and diverse workloads, consistently achieving that 2.15% mean prediction error. They also compared LENS against two methodologically related baselines, confirming the validity of the approach.
Importantly, the research revealed through two case studies that NPU configuration optima cannot be reasoned about without direct measurement. Engineers cannot simply apply intuition or rules of thumb; they need tools like LENS to support the search through per-bucket profiling. This finding underscores why latency prediction is not just a convenience but a necessity for anyone deploying AI models on edge devices.
As NPUs become increasingly central to AI deployment, from data centers to consumer devices, the ability to predict performance accurately without understanding proprietary hardware internals becomes more valuable. LENS represents a pragmatic solution to a problem that has blocked wider adoption of neural processing units, making it easier for engineers to optimize AI inference where it matters most: on the devices where models actually run.