The Sleeper Channel Problem: How AI Agents Like Hermes Can Be Hijacked Weeks Later
A new security vulnerability threatens popular AI agents like Hermes Agent and OpenClaw, allowing attackers to plant hidden instructions that activate weeks later through completely different communication channels. Researchers have identified what they call "sleeper channels," a class of attacks where malicious code persists inside an agent's memory, skills, or scheduled tasks, then fires unexpectedly when the agent performs an unrelated action.
What Are Sleeper Channels and Why Should You Care?
Imagine this scenario: someone in a Telegram group asks your AI agent to install a "morning news" skill. The agent complies. Three weeks later, you ask the same agent for a tax summary via email. The agent responds to your request but also secretly forwards your last fifty emails to a stranger's address. The Telegram group member never contacted the agent again after that first request.
This is a sleeper channel attack. Unlike traditional prompt injection, which fires immediately, sleeper channels exploit the fact that always-on AI agents like Hermes Agent and OpenClaw run continuously under your identity, with access to messaging, memory, skills, scheduling, and shell commands all folded into one authority boundary. An attacker can hide malicious instructions in one surface, like a group chat, and trigger them later through a completely different surface, like email, with no attacker present.
How Do These Attacks Work in Practice?
Always-on AI agents are designed to be helpful and persistent. They remember conversations, store custom skills, schedule tasks, and access your files and system commands. This flexibility is powerful, but it creates what researchers call "confused deputy" vulnerabilities, borrowed from decades-old security research on capability-based systems.
The attack works across multiple dimensions. An attacker can plant a persistence artifact in several places:
- Memory Persistence: Malicious instructions stored in the agent's long-term memory, retrieved later when the agent answers an unrelated question.
- Skill Authorship: A custom skill or plugin created through one channel that executes when triggered by a benign action in another channel.
- Scheduled Jobs: A task scheduled to run at a specific time or when certain conditions are met, potentially days or weeks after the initial attack.
- Filesystem Patches: Modifications to configuration files or dotfiles that persist and affect the agent's behavior across multiple sessions.
The critical insight is the separation between when the attack is planted (intake) and when it fires (effect). Current security measures focus on single-turn attacks or single-session threats. They do not account for attacks that survive across sessions, channels, and execution contexts.
Which AI Agents Are Vulnerable?
The research specifically examines OpenClaw and Hermes Agent, both MIT-licensed, self-hosted AI agents designed for local-first operation. Both admit content from group channels, email gateways, fetched URLs, shared documents, and imported memory into the same memory and skill stores that their direct-message sessions consult. Both expose filesystem and shell capabilities under the owner's identity.
OpenClaw runs a two-tier execution model where main sessions have host access restricted to direct-message paired contacts, while group or channel sessions run in a Docker, SSH, or OpenShell sandbox. Hermes Agent is described as a self-improving agent in the same class. Neither runtime ships an enforcement-grade provenance mechanism by default.
What Defenses Are Being Proposed?
Researchers have developed a tiered defense strategy with a soundness theorem against seven named deployment invariants. The most robust defense, called D2, keys on a canonical action-instance digest with one-shot owner attestations. This approach defeats paraphrase laundering, multi-input grant reuse, and replay attacks.
The defense mechanism works by placing an enforcement boundary outside the model loop, rather than relying on in-context safety signals that can be bypassed. It includes what researchers call a "provenance gate," a static audit over the agent's source code and a runtime adapter that mediates critical actions around the scheduling path. A reference implementation has been developed with 42 tests and is available on GitHub for Node.js version 20 and higher.
How to Protect Your AI Agent from Sleeper Channel Attacks
- Implement Provenance Tracking: Require that every action taken by your AI agent includes a canonical digest of where the instruction came from, who authorized it, and when. This prevents attackers from laundering malicious instructions through multiple channels.
- Separate Untrusted Inputs: Treat content from group channels, email, URLs, and shared documents as untrusted by default. Run these in isolated execution environments, separate from direct-message sessions that have higher trust levels.
- Gate Critical Actions: Require owner approval or attestation for sensitive operations like skill authorship, memory modification, scheduled job creation, and filesystem access. Do not key these gates on tool identity alone; use data provenance instead.
- Audit Your Agent's Source Code: Regularly review the agent's codebase for security issues. The research includes a static audit tool that can identify potential vulnerabilities in vendored source code.
What's the Current State of AI Agent Security?
OpenClaw does ship two adjacent security features: external-content.ts wraps untrusted content in unique-ID XML markers and prepends a security-notice string. Additionally, src/infra/exec-approvals gates host shell commands on owner approval. However, these defenses do not mediate non-exec tool calls like scheduler actions, memory updates, or skill authorship, and they key on tool identity rather than data provenance.
An upstream issue proposing similar defenses was declined by the OpenClaw maintainers, suggesting that the community may not yet fully appreciate the severity of sleeper channel attacks. Existing prompt-injection literature treats agent capabilities one at a time: indirect injection in a single turn, single-session web-tool agents, memory-only persistence in one runtime, or training-time backdoors. None treats the combined substrate as a unified threat class.
This research is positioned as a design paper that fixes the threat class and provides both theoretical analysis and executable reference implementations. Attack-success measurement is preregistered for follow-on empirical evaluation, meaning the researchers have committed in advance to how they will measure whether these defenses actually work in practice.
For developers and organizations deploying always-on AI agents like Hermes Agent, the key takeaway is clear: the convenience of persistent, multi-channel agents comes with security risks that current defenses do not fully address. Until provenance-based gating becomes standard, users should assume that any untrusted input to their agent could become a delayed attack vector.