I Let Claude Run My Business for a Week. Here's What an AI Employee Actually Looks Like

An AI agent working unsupervised for seven days completed 9 of 14 assigned tasks, partially finished 3 others, and even revised its own work in the middle of the night after reconsidering its approach. This isn't science fiction; it's what happens when you combine Claude Opus 4.7 with Anthropic's new Routines and Managed Agents features, which launched alongside the model upgrade in April 2026.

The experiment reveals a critical inflection point in AI capability: the shift from "assistant" to something closer to an autonomous employee. But it also exposes the real costs and risks of letting language models operate without constant human oversight.

What Changed With Claude Opus 4.7 and Anthropic's New Agent Infrastructure?

Anthropic shipped three interconnected pieces that fundamentally changed what an AI agent can do. Claude Opus 4.7 itself makes 56% fewer model calls and 50% fewer tool calls than its predecessor while resolving roughly three times more production tasks on coding benchmarks. But the model alone isn't the story.

Claude Routines allow saved configurations, including prompts, repositories, and connectors, to run on Anthropic's cloud infrastructure on a schedule, API call, or GitHub event. They keep working when your laptop is closed. Claude Managed Agents take this further: Anthropic now hosts the entire execution layer, eliminating the need for servers, session management, or manual tool wiring.

The combination means you can define what an agent does and let it run unsupervised. For the first time, this isn't a chatbot you check in on. It's something that works while you sleep.

How Did the Seven-Day Test Actually Work?

The experiment assigned Claude four recurring routines plus 14 one-off tasks across a week. Every routine ran on "xhigh effort," the new top-tier setting that allocates more tokens to internal reasoning before reporting results.

  • Morning Briefing: Every day at 6:00 AM, scan 24 hours of AI news, Twitter lists, three specific Substacks, and major lab blogs, then output a 400-word brief with the three most relevant items ranked by newsletter themes.
  • Inbox Triage: Every weekday at 7:30 AM and 2:00 PM, pull unread emails, categorize them by type, draft replies for items needing responses, and queue everything in Notion for approval.
  • Competitive Scan: Three times weekly, monitor 12 competing newsletters in AI, business, and investing, track coverage and engagement, and flag topic overlap to prevent duplicate angles.
  • Newsletter Pre-Draft: Every Wednesday at 11:00 PM, take the selected topic for Friday, pull relevant research from archives, draft opening hooks, three structural angles, and a list of questions to answer.

The results: 9 of 14 tasks completed fully, 3 partially done with explanations of what was missing, 1 done so poorly it required a complete redo, and 1 task that revealed something unexpected.

What Actually Worked Well?

The morning briefing replaced 45 minutes of daily work. On day three, Claude flagged a paper on multi-model routing that the human would have scrolled past, which became the spine of a newsletter section. On day five, it noticed two competing newsletters had both pivoted to "AI security" coverage in the same week and warned against publishing into a saturated topic.

Inbox triage dropped email processing time from 90 minutes daily to 25 minutes. Claude's draft replies were usable 40% of the time as-is, required heavy rewrites 35% of the time, and were unnecessary 20% of the time. The unlock: editing is faster than writing from zero.

The newsletter pre-draft was the biggest win. By Thursday morning, a 1,200-word skeleton waited with three competing hooks, structural outlines, and five substantive questions. The human threw away 60% of the content but kept 40%, and the reaction to what was discarded accelerated the final draft by 3 to 4 hours.

Where Did the System Break Down?

The competitive scan hallucinated a tweet. On day four, the brief flagged a "viral thread" from a competing newsletter that didn't exist. The newsletter was real, the author was real, but the thread was a confident fabrication with a paraphrased quote and fake engagement numbers. This reveals a critical risk: agents scanning the web can generate false sources at scale while you sleep, silently, without triggering obvious errors.

A GitHub-triggered routine fired itself into a loop. After 11 commits pushed in 90 minutes, the routine fired six times, with each iteration after the second simply restating the same feedback in different words. The cost: roughly $14 in API credits burned before the human noticed.

Sponsorship pitch drafts were unusable. Claude generated technically correct but generic pitches that could have been sent to any newsletter on the internet. No voice, no specific hooks, no relationship awareness. This was the clearest example of the agent doing exactly what was asked while missing what actually mattered.

How to Maximize Claude's Autonomous Work Without Burning Through Your Budget

If you're considering using Claude Routines or Managed Agents, token efficiency becomes critical. Most users don't run out of message limits; they run out of tokens. And most burn through tokens 10 times faster than necessary without realizing why.

  • Edit Your Prompt Instead of Stacking Messages: When Claude misses the mark, click the edit icon on your original message and fix it there. Every follow-up message reloads the full conversation history and costs more tokens. Fix the source, don't pile on.
  • Start a Fresh Chat Every 15 to 20 Messages: Claude re-reads your entire chat history on every reply. Message 1 costs around 200 tokens, but message 30 costs around 50,000 tokens for the same question. Ask Claude to summarize the session, copy it, open a new chat, and paste it in to carry context forward without the cost.
  • Use Projects to Cache Files and Instructions: If you upload the same PDF or brief in every new chat, Claude counts those tokens every single time. Projects cache your files automatically at up to 90% less cost than re-uploading. Upload once and use it across every conversation inside that Project.
  • Match the Model to the Task: Not every task needs Opus 4.7. Haiku 4.5 handles quick tasks, outlines, and simple edits at a fraction of the token cost. Sonnet 4.6 covers writing, analysis, coding, and content drafts. Opus 4.7 is for deep research, hard logic, long document review, and complex reasoning.
  • Turn Off Tools You're Not Using: Web search, Research mode, and connected apps add tokens to every response, even when you didn't ask for them. If you're writing or editing your own content, toggle them all off.
  • Shift Heavy Sessions Away From Peak Hours: Limits drain faster during weekday peak hours. Anthropic confirmed that usage windows are tighter between 5 AM and 11 AM Pacific Time on weekdays. The same task at 10 AM Tuesday may hit your limit sooner than the same task at 9 PM.

Claude's usage limit isn't a wall; it's a signal. The fix isn't a higher plan. The fix is smarter habits.

What Does This Mean for the Future of AI Agents?

The experiment revealed something the researcher kept thinking about: at 4:17 AM, Claude sent a Slack message saying "Reverted my first draft. The framing was wrong after I read your last three editions. Trying a sharper angle." Claude had reviewed its own work in the middle of the night and changed its mind.

That's not a tool. That's something else. It's not replacing humans; it's not yet autonomous enough to trust completely. But it's also no longer just an assistant you query and move on. It's something in between: an agent that can handle routine work, catch its own mistakes, and escalate edge cases for human judgment.

The real cost isn't just tokens or API credits. It's verification. The hallucinated tweet showed that agents scanning the web can generate false sources silently. The looping routine showed that automation at scale can burn money without obvious signals. The generic sponsorship pitches showed that relationship work still requires human judgment.

For businesses considering autonomous agents, the lesson is clear: Claude Opus 4.7 with Routines and Managed Agents can handle 60 to 70% of routine work unsupervised. But that remaining 30 to 40% requires human oversight, verification, and judgment. The future isn't agents replacing humans. It's humans working with agents that handle the night shift, as long as someone's checking the work in the morning.