AI Agents Are Failing at Real-World Tasks: Here's What Amazon and Huawei Just Discovered
The world's most advanced AI models are struggling with tasks that seem simple on the surface: managing your email, calendar, and files simultaneously without being asked. A new benchmark from Huawei researchers shows that GPT-5.5, one of the most powerful language models available, scored just 34.5% when asked to function as an always-on personal assistant in a realistic digital environment. Claude Opus 4.7 performed even worse at 31.8%. Meanwhile, Amazon announced 68 new research awards focused on agentic AI (AI agents that can take autonomous action), signaling that the industry recognizes this as one of the most pressing challenges ahead.
Why Are AI Agents Failing at Tasks They Should Handle Easily?
The problem isn't that AI models lack intelligence. The issue is far more fundamental: real-world personal assistant tasks require coordinating across multiple interconnected services simultaneously, something current AI agents struggle to do reliably. Huawei's new benchmark, called Claw-Anything, doesn't ask AI to answer trivia questions or summarize text. Instead, it simulates a complete digital life and asks AI assistants to manage it across long-horizon event streams and multiple interdependent backend services.
The complexity is substantial. Tasks in the benchmark involve an average of 10.1 interdependent services, with some scenarios reaching up to 18 different systems. The benchmark includes 200 human-verified task environments with an average of 191,700 context words per environment. It evaluates both graphical user interface and command line interface interactions across multiple devices, and it tests proactive behavior: can the AI notice something needs doing before you ask ?
Beyond raw model capability, enterprise leaders are identifying another critical bottleneck: the underlying architecture supporting AI agents. At an OpenGov Asia event in Thailand on May 26, 2026, technology leaders emphasized that AI agents are only as powerful as the systems, workflows and data environments that support them. Many organizations still operate across siloed applications, inconsistent governance and fragmented ecosystems that hinder AI at scale.
"Without unifying, without resiliency and without governance that you can trust, your AI agent is nothing," stated Anothai Wettayakorn, Managing Director and Technology Leader for IBM Thailand.
Anothai Wettayakorn, Managing Director and Technology Leader for IBM Thailand
What Would It Take to Fix AI Agent Performance?
The Huawei research team built an automated pipeline that generated 2,000 training environments for fine-tuning AI models on these complex assistant tasks. The results offer a glimmer of hope: Qwen3.5-27B, a smaller open-source model, showed a 23.7% performance improvement after being fine-tuned on successful task trajectories from these environments. This suggests that specialized training on domain-specific successful examples can meaningfully improve performance, even for smaller models.
The broader OpenClaw ecosystem, which includes related benchmarks like ClawBench and WildClawBench that test similar multi-step practical tasks, shows top AI models scoring somewhere between 33% and 62%. The variation suggests that different types of tasks present different challenges, and that targeted fine-tuning could unlock significant gains.
Steps to Build AI-Ready Enterprise Architecture
- Unified Integration Strategy: Move away from reactive integration approaches and fragmented toolsets toward end-to-end visibility and unified oversight frameworks that allow AI agents to access consistent information flows across the organization.
- Governance and Data Trust: Establish consistent information flows and stronger governance controls that create trusted environments where AI agents can securely access high-quality, governed data while maintaining transparency and oversight.
- Real-Time Connectivity Layer: Build a resilient, intelligent connectivity layer that supports real-time information exchange across hybrid and multi-cloud environments, enabling autonomous AI to make decisions instantaneously rather than waiting for delayed data.
- Domain-Specific Fine-Tuning: Invest in curating high-quality training data from actual successful task trajectories within your industry, allowing AI models to learn from real-world examples rather than relying solely on general-purpose training.
Amazon's research awards program is doubling down on this challenge. The company announced 68 award recipients representing 49 universities in 11 countries, with specific focus areas including agentic AI security, automated reasoning, and cybersecurity. Recipients have access to more than 700 Amazon public datasets and can utilize AWS AI/ML services and tools through promotional credits, signaling that the industry sees solving agentic AI challenges as a collaborative, global effort.
"AI is reshaping cybersecurity faster than ever in advancing how we detect threats and defend systems. At the same time, agentic AI requires stronger guarantees of safety, robustness, and trustworthiness," explained Wei Ding, Applied Science Manager for GuardDuty at AWS.
Wei Ding, Applied Science Manager, GuardDuty, AWS
The research also carries implications for specialized domains like cryptocurrency and decentralized finance. The benchmark specifically tests the kind of complex, multi-step, multi-service coordination that crypto AI agents would need to perform reliably: managing DeFi portfolios across multiple protocols, monitoring governance proposals, rebalancing based on market conditions, and bridging assets between chains. For any organization building AI agents into production systems, the lesson is clear: model capability alone is insufficient. Success requires rethinking enterprise architecture, governance, and data infrastructure from the ground up.