Cisco's Claude Code Agent Just Beat AI's Top Prompt Optimizer by 14 Points

FrontierNews.ai AI Research Desk

Cisco's Claude Code Agent Just Beat AI's Top Prompt Optimizer by 14 Points

Cisco AI has released FAPO, a Claude Code-driven system that automatically optimizes multi-step language model pipelines, beating the previous state-of-the-art prompt optimizer by an average of 14 percentage points across 18 model-benchmark comparisons. The system addresses a persistent pain point in AI development: getting prompts right is notoriously difficult, and small wording changes can swing accuracy by 20 percent or more. When a complex pipeline fails, developers typically have to manually inspect intermediate outputs to find where things went wrong.

FAPO stands for Fully Automated Prompt Optimization. Rather than requiring human trial-and-error, the system uses Claude Code agents to orchestrate a closed-loop optimization process. You supply a dataset of test cases and an initial prompt, and FAPO evaluates, classifies failures, proposes variants, validates them, and iterates automatically until it reaches your target accuracy.

How Does FAPO Actually Work?

The system operates through six stages in each optimization cycle. First, it evaluates the entire chain on your dataset and collects per-case scores. Next, it attributes failures by root cause, using both rule-based heuristics and LLM analysis to understand what went wrong. Then it proposes a variant targeting the dominant failure cluster, which an independent agent reviews for scope compliance and data leakage. The system accepts the variant only if it improves on the previous best; otherwise, it rejects it and tries again.

What makes FAPO different from earlier approaches is that it works at three escalating levels of complexity. Prompt edits are tried first because they're lowest cost. If those don't work, the system moves to parameter changes, adjusting configuration values like retrieval depth or temperature. Only if those fail does it escalate to structural changes, such as adding a self-reflection node or switching to a ReAct reasoning pattern. This staged approach prevents unnecessary complexity.

What Types of Failures Can FAPO Identify and Fix?

The system classifies failures into four distinct categories, each pointing to a different fix. Retrieval failures occur when the pipeline returns empty or irrelevant content. Cascading failures happen when an early step produces empty output, breaking everything downstream. Format failures hide the correct answer inside text the scorer cannot parse. Reasoning failures occur when good inputs still produce a wrong conclusion. Format and reasoning issues are addressable through prompt changes, while retrieval and cascade issues typically require structural fixes.

Retrieval Failures: Empty or irrelevant content returned by the pipeline, typically requiring structural changes like improved document retrieval methods.
Cascading Failures: Early steps producing empty output that breaks downstream processing, fixed through pipeline restructuring.
Format Failures: Correct answers hidden in unparseable text, addressable through prompt refinement and clearer output instructions.
Reasoning Failures: Good inputs still producing wrong conclusions, typically fixed through prompt optimization or adding reasoning steps.

How Did FAPO Perform Against Existing Methods?

Cisco evaluated FAPO against GEPA (Generalized Evolutionary Prompt Architecture), the previous state-of-the-art prompt optimization method. GEPA uses evolutionary search with genetic operators to optimize prompts, but it's limited to prompt-level changes only. FAPO, by contrast, can escalate to structural changes when it detects bottlenecks.

The comparison spanned six benchmarks and tested three different language models: GPT-4.1-mini, GPT-5.4-mini, and Gemma 3-12B. Claude Opus 4.6 served as both FAPO's orchestrator and GEPA's reflector. FAPO won 15 of 18 model-benchmark comparisons, with a mean gain of 14.1 percentage points over GEPA. On the two benchmarks where FAPO escalated to pipeline changes (HoVer and IFBench), it won all six model-benchmark pairs with a mean gain of 33.8 percentage points. AIME was the only benchmark where GEPA led, by 3.1 percentage points, but that gap falls within the standard deviation of stochastic trials.

In concrete terms, on the multi-hop question-answering task, a chain that started at 39.3 percent accuracy rose to 70.3 percent validation exact match across two iterations. On the IFBench instruction-following benchmark, FAPO reached 80.7 percent test accuracy compared to GEPA's performance. On HotpotQA, FAPO achieved 68.3 percent test accuracy versus GEPA's 61.8 percent.

What Safeguards Prevent Overfitting?

FAPO includes multiple guardrails to prevent the optimizer from gaming the system. It inspects only training-split cases during optimization, while validation and test sets expose only aggregate scores, never individual cases. Every variant is saved as a new immutable file and never edited in place, creating a full audit trail. An independent reviewer checks each proposal before it runs, ensuring scope compliance and preventing data leakage.

How Can Developers Get Started?

The system is open source under the Apache 2.0 license and also supports OpenAI's Codex as an alternative optimization agent. The fastest path to using FAPO is to let Claude Code create the tenant files automatically. You describe your task in plain English and provide a JSONL dataset where each line contains a test case with a case ID, task type, context, expected output, and metadata. From there, Claude can scaffold the initial prompt, chain definition, and scorer. The core engine, called hephaestus, is domain-agnostic and handles evaluation, chain execution, and scoring. Out of the box, FAPO supports three providers: OpenAI, Baseten, and SageMaker.

The system works with multi-step LLM pipelines across diverse tasks. Real-world use cases include multi-hop question answering, instruction following, classification tasks, and ReAct agents that use tools. In each case, FAPO can identify whether the bottleneck is in retrieval, reasoning, formatting, or pipeline structure, then propose and validate fixes automatically.

For teams building reliable LLM applications at scale, FAPO represents a significant shift from manual prompt engineering to automated optimization. By combining Claude Code's orchestration capabilities with step-level failure attribution, the system addresses one of the most time-consuming aspects of deploying language models in production.

Your AI & Tech News Engine

Breaking News