Hermes Agent Hits 224 Billion Daily Tokens: Why Self-Improving AI Running Locally Changes Everything
Hermes Agent, built by Nous Research and now endorsed by NVIDIA, represents a fundamental shift in how AI assistants learn and improve over time. Most AI agents reset completely between sessions, losing context and learned patterns. Hermes does the opposite: it writes skill files after every complex task, refines them through use, and builds a persistent model of how you work across sessions. On May 10, it reached number one on OpenRouter's global agent rankings, processing 224 billion tokens per day ahead of every other agent on the platform.
What Makes Hermes Different From Other AI Agents?
The distinction that matters is not speed or benchmark scores. It is whether an agent gets better the more you use it. Hermes improves through three interlocking mechanisms that cloud-hosted assistants, which reset on every API call, structurally cannot match.
- Automatic Skill Files: When Hermes completes a multi-step task requiring five or more tool calls, it automatically writes a skill file to disk with no configuration needed. The next time you hand it a similar task, it searches that file library using full-text search and retrieves the relevant procedure, executing faster and with fewer tokens.
- Self-Refining Procedures: Those skills are not static. The agent updates them as it finds better approaches, meaning you are not managing a growing list of prompts manually; the agent curates its own toolkit.
- Cross-Session Memory: Hermes builds a cross-session user model via Honcho, a memory backend that reasons about your working patterns after each conversation. Stable preferences, project history, and context carry forward rather than disappearing when the session ends.
The result is an agent that compounds through use in a way that cloud-hosted assistants cannot match. This self-improvement loop is not theoretical; it is shipping in production today.
How to Set Up Hermes Agent on Your Hardware
NVIDIA's endorsement is not incidental. It is what makes the local angle credible at scale. The RTX AI Garage feature, published May 13, 2026, pairs Hermes with Qwen 3.6, a family of models that changes the math on local inference. Qwen 3.6's 27-billion-parameter model matches the accuracy of 400-billion-parameter models from the previous generation at one-sixteenth the memory footprint. The 35-billion-parameter model runs at roughly 20 gigabytes of VRAM. On RTX PRO hardware, that translates to token generation speeds three times faster than a baseline CPU setup.
- Installation Method: Installation is a single command on Linux, macOS, WSL2, and Android via Termux: curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash. A desktop GUI installer is also available for Windows and macOS.
- Interface Launch: Once installed, run the text-based user interface with the command: hermes --tui. Hermes requires a model with at least a 64,000-token context window, which is roughly 50,000 words of working memory for multi-step tool calls.
- Model Connectivity: Hermes connects to Ollama, LM Studio, OpenRouter, AWS Bedrock, and NVIDIA NIM out of the box, giving you flexibility in where your model runs.
- Setup Timeline: Total time from zero to a functional, memory-backed, skill-learning agent is roughly twenty to thirty minutes, according to the official quickstart documentation.
For developers who already own RTX hardware, this is not a future aspiration. The inference performance is here now, on hardware sitting on desks today.
Why Hardware Matters for Self-Improving Agents
At the higher end, NVIDIA DGX Spark, with 128 gigabytes of unified memory and one petaflop of AI compute, can sustain all-day agentic workflows running 120-billion-parameter mixture-of-experts models. Hermes runs as a persistent background process on DGX Spark, handling tasks continuously rather than as one-off invocations. This persistent operation is critical to the self-improvement loop; the agent needs to stay running to observe patterns across your work.
Beyond the built-in skill system, a separate project called hermes-agent-self-evolution applies DSPy and GEPA (Genetic-Pareto Prompt Evolution) to optimize skill files, tool descriptions, and system prompts offline. It reads execution traces to understand why tasks failed, not just that they did, and proposes improvements via pull requests against your local Hermes configuration. No GPU training is required. Each optimization run costs roughly two to ten dollars in API calls and works with as few as three examples. The approach was presented as an ICLR 2026 Oral, which gives it more than casual credibility.
What Do the Numbers Tell Us About Hermes's Real-World Adoption?
Hermes's trajectory is striking. The project has accumulated 140,000 GitHub stars in under three months, attracted 1,000-plus contributors, and processes 224 billion daily tokens on OpenRouter. Token throughput on a third-party routing layer is a harder metric to inflate than stars. Real developers are routing real workloads through it, which signals genuine adoption rather than hype.
The broader implication is that self-improving local agents are no longer a research concept. Hermes is shipping it in production, running on hardware that a significant portion of the developer community already owns, at a cost well below a monthly cloud AI subscription. If you have been waiting for local agents to reach parity with cloud offerings, this is the inflection point worth paying attention to.