Logo
FrontierNews.ai

A 20B Search Agent Just Beat Most Open-Source Models at Finding Relevant Documents

A new retrieval agent called Harness-1 achieves 73% accuracy on document retrieval tasks by offloading routine bookkeeping to a stateful environment, letting the AI model focus purely on semantic search decisions. The 20-billion-parameter model, built on the open-source gpt-oss-20b foundation, was trained using reinforcement learning and released publicly by researchers from UC Berkeley, University of Illinois Urbana-Champaign, and Chroma.

Why Does Separating Search Logic from Bookkeeping Matter?

Most search agents today pack everything into one growing transcript: deciding what to search for, remembering what they found, tracking which evidence matters, and knowing when to stop. This forces a single neural network to optimize search strategy and routine administrative tasks simultaneously, which researchers argue is inefficient.

Harness-1 takes a different approach called "stateful cognitive offloading." The model handles only the semantic decisions, while a structured environment manages the bookkeeping. Think of it like giving a researcher an assistant who maintains organized notes, candidate pools, and verification records, so the researcher can focus entirely on what to investigate next.

How Does the Harness Actually Work?

The system operates as a loop. At each turn, the environment renders a compact view of the current search state and recent actions. The model emits one structured command. The harness executes it, updates its internal state, and presents the next observation. This cycle repeats until the model decides to stop searching.

The environment maintains several pieces of state that the model can reference without storing in its own weights:

  • Candidate Pool: Compressed and deduplicated documents retrieved so far, with redundant content removed automatically
  • Curated Set: The final ranked documents, capped at 30 items and tagged as very high, high, fair, or low importance
  • Evidence Graph: Extracted entities, dates, and relationships that show which documents mention the same people or events
  • Verification Cache: A record of claims the model has already checked, preventing redundant verification

The model works through eight tools: fan_out_search, search_corpus, grep_corpus, read_document, review_docs, curate, verify, and end_search. Search results are automatically compressed using sentence-level ranking, keeping only the top four sentences from each document.

What Are the Performance Results?

Harness-1 reached an average accuracy of 73% on eight benchmarks spanning web search, financial documents, patents, and multi-hop question-answering tasks. This beat the next-best open-source competitor, Tongyi DeepResearch 30B, by 11.4 percentage points. Among all models tested, only Opus-4.6, a frontier closed-source model, scored higher on average.

The most striking finding emerged in transfer learning tests. The model was trained on four benchmark families but used reinforcement learning only on SEC (Securities and Exchange Commission) financial filings. On those source-family tasks, Harness-1 gained 7.9 points over the closest open baseline. On four held-out benchmarks that were completely different from training data, it gained 17.0 points. This 2.2x larger gain on unfamiliar tasks suggests the learned search operations generalize well beyond the training domain.

How Was Harness-1 Trained?

Training split into two phases. First, supervised fine-tuning taught the model to operate the harness interface correctly, using 899 training trajectories filtered from a larger dataset. The model used a technique called LoRA (low-rank adaptation) at rank 32 for three epochs. Then, reinforcement learning improved the search decisions themselves, using on-policy training with a 40-turn cap and rewards only at the end of each episode.

A critical design choice addressed the cold-start problem. When the model makes its first successful search, the harness automatically seeds the curated set with eight reranked results at fair importance. The model then promotes strong documents and removes weak ones, turning the task from building from scratch into refinement. This warm-start approach proved essential for trainable performance.

Steps to Deploy Harness-1 in Your Workflow

  • Download the Weights: The model weights and harness code are publicly released and can be served using common runtimes like vLLM, SGLang, or the Hugging Face Transformers library
  • Integrate as a Retrieval Stage: Use Harness-1 as a retrieval subagent that produces a ranked set of documents for a downstream answering model, rather than as a standalone question-answerer
  • Apply to Evidence-Seeking Tasks: The model excels at literature review, patent analysis, financial-filing research, fact-checking, and modular retrieval-augmented generation (RAG) workflows where documents must support an answer

What Are the Practical Applications?

The research team identified four primary use cases. Literature and patent review benefit from the evidence graph and curated set, which organize many sources into a structured format. Financial-filing analysis can recover specific facts, like an executive transition date, across multiple regulatory documents. Multi-hop fact-checking uses the fan_out_search and verify tools to resolve ambiguous entities before committing to an answer. Modular RAG feeds the curated set to a frozen generator, and better retrieval sets yield higher answer accuracy.

The model was trained on only 4,352 unique training items, far fewer than several comparable baselines, suggesting the stateful approach is data-efficient. Ablation tests confirmed the harness mechanisms matter: disabling all of them dropped recall by 12.2% on one benchmark, showing that the trained policy alone cannot rank documents effectively without the structured environment.

What Are the Limitations?

The evidence graph relies on regex extraction to find proper nouns, years, and dates, rather than full entity linking, which may miss some relationships. The verify tool is implemented as an LLM proxy that can make errors on ambiguous claims. Sentence-level compression may drop context tied to discourse structure, potentially losing nuance in how ideas connect across a document. The research team also reported point estimates without full confidence intervals, so exact uncertainty bounds are unclear.

Harness-1 represents a shift in how retrieval agents are designed. By moving bookkeeping into the environment and leaving semantic decisions to the policy, the approach achieves strong performance on diverse benchmarks while remaining interpretable and modular. The open release of weights and code means developers can integrate it into existing Hugging Face and open-source workflows immediately.