Logo
FrontierNews.ai

DeepSeek Researcher Uses AI to Write a 46-Page Survey on AI Research Agents

DeepSeek researcher Deli Chen has completed a comprehensive 46-page academic survey on autonomous research agents by collaborating with two AI systems, demonstrating how rapidly AI is shifting from research tool to independent researcher. The paper, titled "From Copilots to Colleagues: A Survey of Autonomous Research Agents," was written with DeepSeek-V4-Pro handling text generation and GPT-Image2 creating visualizations, with Chen contributing approximately 1% of the content himself.

This unconventional collaboration reveals a striking reality about AI's evolving role in scientific work. The project consumed 648,000 tokens across 108 rounds of agent interactions, took 6 days of calendar time, and required 2,234 lines of LaTeX code. Yet Chen estimates his total "CPU time truly spent on thinking" was less than 2 hours. The first draft alone took 76 minutes to generate. All 103 references were verified, and the paper grew from 45 to 46 pages, complete with 7 figures and 4 tables.

Chen, a core contributor to DeepSeek's V1, V2, V3, V4, R1, DeepSeek-Coder, and DeepSeek-MoE architectures, framed this as a personal research project rather than an official company contribution. He emphasized that the views expressed do not represent DeepSeek's positions. However, the experiment itself carries significant implications for how scientific research is conducted.

What Are Autonomous Research Agents and Why Do They Matter?

Chen's survey systematically analyzes 17 mainstream autonomous research agent systems and covers more than 95 papers in the field. The core concept is transformative: given a scientific research goal, AI can independently complete the entire cycle from hypothesis formulation, experimental design, code execution, result analysis, and paper writing without requiring human approval at each step.

The speed of this transformation is remarkable. In just 18 months, AI's performance on the SWE-bench benchmark for measuring software engineering capabilities climbed from less than 5% to over 70% in solving real GitHub problems. Some systems can now produce complete academic papers at a cost of approximately $15 per paper and have passed initial human review. Others have discovered new mathematical structures beyond known boundaries without human guidance.

This represents a fundamental shift in AI's role. Traditional AI tools function as research assistants, helping with literature searches, organizing tables, and executing code, but they require human direction at each step. The new generation of autonomous agents operates differently: they formulate hypotheses, design experiments, execute code, analyze results, write reports, and even self-review and iterate without needing human approval at intermediate checkpoints.

How Does Chen's Five-Level Classification System Work?

Chen's survey introduces a classification framework for autonomy levels, analogous to the SAE (Society of Automotive Engineers) standard for vehicle automation. This framework helps establish order in what he describes as a "chaotic landscape" of emerging autonomous research systems.

  • Level 1 (Auto-Completion): The most common state, where tools like GitHub Copilot predict the next line of code but users maintain complete control. Productivity increases by 30% to 55%, with no true autonomy.
  • Level 2 (Task Execution): The level most people experience with ChatGPT and Claude daily. AI decomposes tasks and calls tools but requires user approval at each step, with humans making strategic decisions.
  • Level 3 (Multi-Step Autonomy with Checkpoints): Current mainstream intelligent agent programming tools like Claude Code and Cursor Agent. AI independently executes dozens of operations before predetermined checkpoints and only seeks confirmation when exceeding preset scope.
  • Level 4 (End-to-End Full Automation): The current technological frontier, where systems like Devin, SWE-Agent, and AI Scientist work independently for hours or days and produce complete results. Users only evaluate final outputs.
  • Level 5 (Autonomous Setting of Research Agenda): Still a vision, where systems not only execute research but choose what problems to investigate, allocate resources, and accumulate knowledge over weeks to months. No existing system has fully achieved this level, though Google's Co-Scientist shows partial autonomous hypothesis generation and DeepMind's FunSearch has discovered real mathematical knowledge through iterative program search.

Chen's classification depicts a clear evolutionary path from "helping you work" to "thinking for you," and identifies the technological gaps between each level.

What Are the Practical Implications for Researchers and Enterprises?

Chen's observation about "crazy inflation in computer science papers" suggests that autonomous research agents could fundamentally alter academic productivity and publication rates. Work that previously required at least a month can now be completed in days or hours, raising questions about quality control, verification, and the nature of scientific contribution.

For enterprises considering DeepSeek deployment, privacy and data residency remain critical considerations. DeepSeek's privacy policy states that personal data may be collected, processed, and stored directly in the People's Republic of China. Indian users and businesses should avoid pasting personal data, customer records, employee information, confidential documents, credentials, source code, contracts, KYC files, health data, financial data, or legal material into the public DeepSeek app unless the use case has been formally assessed.

For organizations seeking local deployment options, consumer-grade NVIDIA RTX GPUs can run smaller DeepSeek models effectively. An RTX 3060 with 12GB of VRAM is practical for DeepSeek R1 7B and 8B quantized models, an RTX 4090 with 24GB is well-suited for 32B quantized inference, and an RTX 5090 with 32GB provides more room for 32B workloads and limited 70B experiments. However, full DeepSeek V4-class or R1 671B deployments remain workstation, multi-GPU, or server-class projects rather than single-card desktop tasks.

How to Deploy DeepSeek Models Locally on Consumer GPUs

  • Choose the Right Model Size: Most local users run smaller DeepSeek R1 Distill models in quantized formats through tools like Ollama, llama.cpp, LM Studio, vLLM, or text-generation-webui, not the full 671B parameter model.
  • Match VRAM to Your Target: RTX 3060 12GB handles 7B and 8B quantized models comfortably; RTX 4090 24GB is excellent for 32B quantized inference; RTX 5090 32GB can run 32B comfortably but struggles with full V4-class workloads without aggressive quantization or CPU offload.
  • Account for Context and Overhead: Local inference requires memory for model weights, runtime overhead, and the KV cache used for context. A model may load at short context lengths but fail or slow significantly when context increases, so plan accordingly.
  • Use Quantization to Reduce Memory: Q4 quantization produces smaller models than Q8 or FP16, though quality and speed vary by model, framework, and task. This is essential for fitting larger models into consumer VRAM.
  • Assess Your Use Case: For generic, non-personal, non-confidential tasks, public DeepSeek services are acceptable. For sensitive data, self-hosted or locally hosted models on organization-controlled infrastructure provide better data residency and governance.

The distinction between deployment models matters significantly. A business using a self-hosted DeepSeek model on its own infrastructure faces different privacy and security considerations than an individual using the public web app. Cloud providers, resellers, and API implementations each introduce different data processing roles and responsibilities.

Chen's survey and his unconventional authorship approach underscore a broader truth: AI research agents are no longer theoretical. They are operational, measurable, and reshaping how scientific work gets done. The question is no longer whether autonomous research agents exist, but how quickly they will advance through the autonomy levels and what safeguards, governance, and quality controls will be necessary as they do.