After 600 Hours of AI Coding, a Developer Reveals What Actually Works (and What Doesn't)
After testing dozens of AI coding tools and frameworks across more than 600 hours of intensive development work, one engineer concluded that the most effective approach isn't about complexity, but about engineering discipline combined with the right model choices. The findings challenge the prevailing narrative that more agents, orchestrators, and wrappers lead to better results.
Why Most Multi-Agent Setups Don't Work in Practice?
The conventional wisdom in AI-assisted development suggests that layering multiple specialized agents together,a planner, an executor, a reviewer, and a judge,should produce better outcomes. In theory, this division of labor makes sense. In practice, it creates problems. According to the developer's detailed testing, running multiple large language models (LLMs) in parallel increases cost, latency, and the potential for errors without delivering proportional gains.
The real bottleneck isn't the number of agents available. It's the engineering process itself. The developer emphasized that traditional software engineering practices matter more than AI architecture complexity. These include writing tests, maintaining clean code, using continuous integration, pair programming, constant refactoring, keeping code small, automating deployments, and getting fast feedback.
For developers doing what the engineer calls "Agile Vibe Coding",spending several hours a day with a real coding agent on substantial projects,the best cost-benefit options remain Claude Opus 4.7 and GPT 5.5 on subsidized Pro, Plus, or Max plans. Nothing comes close when the job involves consistent, long-duration coding sessions.
How to Safely Run AI Coding Agents on Your System?
- Use Process Sandboxing: The ai-jail tool creates a restricted environment where agents can access only the files and directories they need. On Linux, it uses bubblewrap and Landlock; on macOS, it uses sandbox-exec. This prevents agents from accessing sensitive system files or accidentally modifying unrelated parts of your computer.
- Combine Sandboxing with Fast Permissions: The developer removes confirmation friction inside the agent itself but maintains a fence at the operating-system level. The agent can work quickly within the project directory, but the host filesystem remains essentially read-only except for explicitly mapped directories.
- Layer Your Backups: No single tool replaces comprehensive backups. The developer uses automatic Btrfs snapshots, restic backups to a network-attached storage device, offsite archiving to AWS Glacier, remote Git repositories, and Bitwarden for secrets management. If something goes wrong, recovery is possible.
- Protect Secrets Separately: Store sensitive information like API keys, tokens, and passwords in a dedicated secrets manager outside the project directory. This prevents accidental leaks to version control and makes recovery manageable if a local directory is compromised.
- Use Version Control as a Safety Net: Frequent commits to GitHub or a private Gitea instance mean the worst-case scenario is losing the local project directory, not the entire codebase. Cloning again and recovering from the remote is straightforward.
The developer clarified that process sandboxing is not military-grade security. Kernel bugs and side channels exist, and macOS has inherent limitations. However, for a typical development workflow, it provides a practical layer of protection without the overhead of a disposable virtual machine.
Contrary to folklore about AI agents destroying host systems, the developer's experience over 600 hours of intensive use suggests that catastrophic accidents are rare when you maintain basic engineering discipline. Agents like Claude and GPT typically ask for confirmation before executing dangerous commands. If you say yes without reading, that's a user error, not an AI failure.
What About Cheaper Alternatives Like DeepSeek and Kimi?
Open-source and lower-cost models like DeepSeek v4 and Kimi 2.6 are genuinely cheaper and usable for many tasks. However, they still stumble on continuity, long refactors, comprehensive test coverage, and understanding entire projects at once. Running open-source models locally is fun for one-off tasks and learning, but the latency, context limitations, quality inconsistencies, setup maintenance, and memory requirements make it impractical for all-day, every-day coding work.
The developer tested mixing two models in a planner-plus-executor setup and found the complexity didn't justify the cost or latency trade-offs. For consistent productivity in extended coding sessions, Claude Opus and GPT remain the practical standard, especially when accessed through subsidized monthly plans rather than pay-per-token pricing.
What Tools Actually Proved Useful After 600 Hours?
Beyond the sandboxing approach, the developer highlighted ai-memory as a second critical tool. This system helps coding agents maintain context across long sessions and multiple projects, addressing one of the key limitations of even the most capable models. The specifics of how ai-memory works were detailed in a separate post, but the core insight is that agents need external memory systems to maintain continuity when working on complex, multi-day projects.
The developer emphasized that most of the projects shared on GitHub are proofs of concept and experiments, not production-ready software. FrankMD received significant community contributions, testing, and bug fixes, making it more stable than earlier versions. However, the broader point stands: these tools were born from a "vibe-coding lab," not a three-year product roadmap with enterprise support and service-level agreements.
For a normal developer, the recommendation is simple: focus on Claude Code, GPT Codex, or OpenCode. That's sufficient. While alternatives like Gemini CLI, Cursor, Windsurf, and others exist, the cost-benefit analysis consistently favors the two leading models when you're doing heavy programming work for hours at a time.
The broader lesson is that AI-assisted development isn't about having the most sophisticated architecture or the most agents. It's about combining the right model with solid engineering practices, practical safety layers, and realistic expectations about what these tools can and cannot do reliably.