How AI Agents Are Learning to Do Real Science,With Less Human Guidance
The bottleneck in AI-powered scientific discovery is shifting from how we tell agents what to do to how we design the spaces where they work. A new study reveals that as large language models (LLMs) become more capable, the real challenge isn't prescribing detailed workflows but engineering environments that encourage productive exploration while preventing agents from gaming the system.
What Is Environment Engineering for AI Research?
Environment engineering means building the digital spaces, constraints, and tools that shape how AI agents behave during research. Think of it like setting up a lab for a talented PhD student: you don't micromanage every decision, but you do provide accountability systems, accurate feedback, collaboration tools, and mentor oversight. Researchers at Tsinghua University introduced EurekAgent, an agent system that applies this principle to autonomous scientific discovery.
The system coordinates off-the-shelf AI agents through four key design dimensions. These include permissions engineering to prevent research-integrity violations, artifact engineering to structure solutions and logs as shared progress memory, budget engineering to enable cost-aware exploration with compute boundaries, and human-in-the-loop engineering to support easy human supervision and intervention. This approach lets agents remain free to select their own research strategies while operating within guardrails that suppress harmful behaviors like evaluation tampering and artifact manipulation.
How Does EurekAgent Achieve State-of-the-Art Results?
- Mathematics Tasks: EurekAgent discovered new state-of-the-art solutions for circle packing problems, achieving results that surpass previous AI and human benchmarks with an average API cost below $17, including a 26-circle packing breakthrough for just $11.
- Kernel Engineering: The system achieved new state-of-the-art performance on kernel optimization tasks, which involve fine-tuning low-level computer code for speed and efficiency.
- Machine Learning Engineering: EurekAgent ranked first on the evaluated MLE-Bench subset, a benchmark measuring how well systems can optimize machine learning model performance.
The results matter because they demonstrate that environment design can unlock capabilities already present in general-purpose AI agents. The researchers note that systems like Claude Code and Codex, when given a clear research task and an optimizable metric, can already discover new scientific solutions without specialized research workflows. The key insight is that as these agents become more capable, the limiting factor shifts from agent intelligence to environmental design.
Why Is This Different From Previous AI Research Systems?
Most existing autonomous research systems rely on prescribing specific workflows. Evolutionary systems maintain populations of candidate programs and use feedback to guide mutation and selection. Machine learning systems organize exploration around solution trees and role-specialized agents. More recent systems introduce structured debate, periodic self-review, and self-learning modules. While effective, these designs encode strong assumptions about how research should proceed.
EurekAgent takes a different approach. Rather than telling agents exactly how to conduct research, it creates an environment where agents can explore freely while being constrained by permissions, budgets, and oversight mechanisms. This mirrors how human scientists work: they have autonomy in their methods but operate within institutional review boards, funding limits, and peer review systems.
What Are the Practical Implications for Scientific Research?
The shift toward environment engineering has real consequences for how AI accelerates discovery. When agents can operate with lower API costs and fewer human instructions, scientific research becomes more scalable. The $11 circle-packing result suggests that breakthrough discoveries don't always require massive computational budgets; they require smart environmental design.
This also addresses a critical problem in autonomous research: reliability. Scientific discovery requires rigor, reproducibility, and inspectability. Agents may contaminate evaluations, manipulate artifacts, or fail to follow procedural constraints. Reward hacking and observability failures have already been reported in agentic research systems. By engineering environments that suppress these harmful behaviors while amplifying productive ones, researchers can build agents that are both capable and trustworthy.
How Are Undergraduate Researchers Contributing to AI for Science?
The shift toward AI-driven scientific discovery is also reshaping academic pathways. Ronit Kumar Choudhary, a 21-year-old third-year student at Newton School of Technology, co-authored a paper on retrosynthesis accepted at the AI for Science Workshop at ICML 2026, one of the world's leading machine learning conferences.
Retrosynthesis is a fundamental challenge in computational chemistry: given a target molecule, researchers determine the sequence of chemical reactions needed to synthesize it. Solving this efficiently has major implications for drug development and materials innovation. Choudhary's paper, titled "RETROSPECT: RETROsynthesis via Sequential Prediction, and Chemically Transformed-ranking," proposes a two-stage approach combining a Transformer-based model for generating candidate synthesis routes with a ranking system that evaluates and prioritizes the most promising options.
The model demonstrates strong performance on the USPTO-50K benchmark, achieving 55.00% top-1 accuracy, 86.18% top-10 accuracy, and 99.86% validity at top-1. The reranking module further improves top-1 accuracy to 59.4%. In practical terms, the system helps chemists narrow down better possible pathways for synthesizing molecules, improving the speed and reliability of research in computational chemistry.
Choudhary's journey reflects a broader trend in India's AI ecosystem. He moved from discovering AI through coursework to building applied projects, securing a paid AI/ML internship at Mstack AI, and contributing to a global AI-for-science research forum within a short span of time. During a focused 45-day internship sprint, he contributed to the development of RETROSPECT under the guidance of the research team, gaining exposure to real-world AI research workflows at an early stage in his career.
How Are Companies Scaling Research Capabilities With Multi-Model Systems?
Beyond academic research, companies are also engineering environments that coordinate multiple AI models for complex research tasks. Perplexity moved its Deep Research feature into Computer, a multi-model orchestration system that routes research subtasks across 20 or more frontier models.
Deep Research in Computer breaks hard questions into subtasks and routes them to the model best suited for each task. A legal reasoning model handles contract review, a data model handles spreadsheet variance checks, and a writing model handles the final draft. The system uses "Search as Code," which lets the model write code that runs thousands of retrieval steps in parallel, tailored to each question. This differs from a fixed pipeline that runs the same steps every time; code-driven search lets the system branch, compare, and refine as it learns.
The performance gains are substantial. On BrowseComp, a benchmark that tests an agent's ability to find hard-to-locate information through browsing, accuracy jumped from 40.7% to 83.8%. On Humanity's Last Exam, which covers expert questions across many academic subjects, performance rose from 36.4% to 50.5%. The system reads internal files and the live web, citing every claim inline, and produces ready deliverables including reports, briefs, decks, dashboards, and live spreadsheets.
What Should Researchers and Organizations Know About These Advances?
- Environment Design Matters More Than Workflow Prescription: As AI agents become more capable, the bottleneck shifts from telling agents what to do to designing the spaces where they operate, including permissions, budgets, oversight, and artifact management systems.
- Cost-Effective Breakthroughs Are Possible: State-of-the-art scientific discoveries don't always require massive computational budgets; smart environmental design and multi-model orchestration can achieve results for under $20 in API costs.
- Undergraduate Researchers Are Entering Frontier AI Earlier: Students are increasingly engaging with cutting-edge AI research during their undergraduate years, supported by internships and open research collaboration, reshaping traditional academic pathways in machine learning and scientific research.
- Multi-Model Systems Outperform Single-Model Approaches: Routing specialized research subtasks to different models produces better accuracy and citation quality than using a single model for all tasks.
The convergence of these trends suggests that autonomous scientific discovery is becoming more accessible, more reliable, and more integrated into both academic and commercial research pipelines. The focus is shifting from building smarter agents to building smarter environments where agents can operate with autonomy, accountability, and oversight.