Google's LiteRT-LM Brings Powerful AI to Your Phone Without the Cloud
Google has released LiteRT-LM, a new runtime engine that enables smartphones, tablets, and web browsers to run sophisticated AI models directly on-device without sending data to cloud servers. The technology powers Google's Gemma 4 language model across Android, iOS, and the web, achieving inference speeds up to 76 tokens per second on a MacBook Pro, with specialized optimizations that can deliver 2.2 times faster performance through multi-token prediction.
What Makes On-Device AI Faster Than Cloud Processing?
LiteRT-LM solves a fundamental bottleneck in AI inference: the time spent moving billions of model parameters from memory to processing units. By running the model locally on your device's GPU, NPU (neural processing unit), or CPU, the system eliminates network latency and the overhead of sending requests to distant data centers. On Android devices with Snapdragon 8 Elite processors, specialized apps can achieve around 22 tokens per second using the Hexagon NPU, compared to 10 to 18 tokens per second on CPU-only execution.
The runtime also introduces advanced session management, allowing users to pause conversations and resume them later without losing context. This feature works by saving and restoring the KV cache, a data structure that stores information about previous tokens in a conversation. By preserving these states locally, LiteRT-LM reduces redundant computations and eliminates expensive prefill phases when users return to an ongoing conversation.
How Does LiteRT-LM Optimize Memory on Constrained Devices?
Running a 2.58 gigabyte AI model on a phone with limited RAM requires aggressive optimization. LiteRT-LM achieves this through several techniques: keeping per-layer embeddings out of active memory, dynamically loading image and audio encoders only when needed, and leveraging weight caching mechanisms. The result is striking: the Gemma 4 E2B model, which weighs 2.58 gigabytes, operates with just 607 megabytes of physical memory on Apple mobile CPUs.
This efficiency matters because it determines which devices can run which models. On Android phones with 6 gigabytes of RAM or more, users can run 3-billion-parameter models like Phi-4 Mini. Devices with 8 gigabytes of RAM can handle larger models, while phones with less than 6 gigabytes should stick to 1.7-billion-parameter models to avoid crashes and slowdowns.
What Are the Practical Benefits for Users and Developers?
On-device AI offers three major advantages over cloud-based alternatives: privacy, speed, and reliability. User conversations never leave the device, eliminating concerns about data collection or surveillance. Inference happens instantly without network delays. And the system works offline, making it useful in areas with poor connectivity or unreliable internet service.
For developers, LiteRT-LM provides cross-platform consistency. The same optimized inference pipeline runs on Android (via Kotlin and C++), iOS (via a new Swift API), and web browsers (via WebAssembly). This means developers can build once and deploy across all major platforms without rewriting code for each operating system.
How to Get Started With Local AI on Your Device
- Android with Snapdragon 8 Elite: Use MLC Chat for the fastest performance, as it's the only app with verified support for the Hexagon NPU, delivering speeds around 22 tokens per second on mid-range models.
- Android with other processors: Choose PocketPal AI for the best balance of speed, user interface quality, and access to the full GGUF model ecosystem, which includes thousands of open-source models.
- iPhone or iPad: Look for apps built with LiteRT-LM's new Swift API, which will provide optimized performance on Apple silicon chips like the M4 Max and A-series processors.
- Web browsers: Expect LiteRT-LM support in Chrome and other browsers via WebAssembly, enabling AI chat and analysis directly in the browser without installing apps.
- Privacy-focused users: Choose F-Droid apps like Maid, which operate without Google Play Services and work on de-Googled Android builds like GrapheneOS and CalyxOS.
What Challenges Still Exist for On-Device AI?
Despite rapid progress, local AI on phones faces two major obstacles. First, Android's aggressive background process management kills inference tasks mid-generation on many phones, especially Samsung, OnePlus, and Xiaomi devices. Users must manually lock AI apps in the recent apps tray and whitelist them in battery optimization settings to prevent interruptions. Second, model storage remains cumbersome because each GGUF model file ranges from 1 to 8 gigabytes, and Android's fragmented storage system forces models into app-specific directories rather than a centralized location.
Hardware fragmentation also limits performance. Google's Tensor G5 processor in the Pixel 9 Pro does not expose its NPU to third-party apps, forcing all six major local AI applications to run CPU-only, delivering only 10 to 18 tokens per second compared to 22 tokens per second on Snapdragon devices.
Why Is Google Pushing Local AI Into Chrome?
Google has begun automatically installing a 4-gigabyte local language model called Gemini Nano on Chrome browsers without explicit user consent, storing it in a folder called OptGuideOnDeviceModel. The model, which Google calls the "Nano" version of its Gemini LLM, powers the Prompt API and enables AI features directly in the browser. Users who wish to disable this installation can navigate to chrome://flags, search for "optimization-guide-on-device-model," set it to Disabled, and restart Chrome, which will then delete the weights.bin file.
"When it comes to bringing advanced AI to the edge, Google AI Edge's LiteRT-LM delivers one of the most powerful and optimized experiences for deploying Gemma 4 across platforms," stated Tenghui Zhu, Staff Software Engineer at Google.
Tenghui Zhu, Staff Software Engineer at Google
The broader trend reflects a shift in AI deployment strategy. Rather than relying entirely on cloud servers, major tech companies are embedding AI capabilities directly into devices. This approach reduces server costs, improves user privacy, and enables AI features to work offline. However, it also raises questions about storage consumption, energy use, and user choice, particularly when installations occur without explicit opt-in.
As of June 2026, LiteRT-LM represents the most mature platform for running production-grade AI models on consumer devices. The technology powers Google's own AI Edge Gallery app on Android and iOS, demonstrating that complex, multi-step AI tasks like reasoning, tool use, and function calling are now practical on phones. For developers and users interested in privacy-preserving, offline-capable AI, the infrastructure is finally ready.