Andrej Karpathy's AutoResearch Hit 66,000 GitHub Stars in Weeks. Here's Why Engineers Are Obsessed.
AutoResearch is an autonomous machine learning experimentation system that Andrej Karpathy released in March 2026, achieving 66,000 GitHub stars and 9,600 forks within a month. The tool works by pointing a coding agent at a minimal language model training setup, then letting it run an indefinite loop: read the code, propose a change, run a 5-minute training job, measure results, commit improvements, and roll back failures. No human interaction required. You set it running at night, and by morning you have a complete git history of validated experiments.
What Makes AutoResearch Different from Standard Automation Scripts?
Traditional machine learning automation requires engineers to define the search space explicitly: try these hyperparameters, try these architectures, in this order. AutoResearch operates fundamentally differently. The agent reads the training code, forms a hypothesis using its knowledge of deep learning literature, makes a targeted code change, and evaluates the result. The search space is whatever the agent can think to try.
Karpathy baked one additional research philosophy into the system: a tiny improvement that adds ugly complexity isn't worth keeping, but deleting code while maintaining performance is always a win. This design choice reflects how human researchers actually think about code, not just how algorithms optimize metrics.
How Does the Ratchet Loop Prevent Regression?
The core mechanism that makes AutoResearch work is what the community calls the "ratchet loop," which uses git rollback to ensure one-way improvement. The agent runs on a dedicated git branch. After each experiment, if the validation metric improves, the change is committed and becomes the new baseline. If the metric stays the same or worsens, git instantly reverts the change. The codebase can only move forward; no regression ever persists.
This design has a known structural limitation: the agent cannot take a step backward to set up a larger gain later. Human researchers routinely reason that a change will hurt performance in the short term but enable a bigger improvement downstream. The ratchet prevents this kind of strategic sacrifice. It's an optimization pressure that finds local improvements reliably, not one that explores broadly.
What Were the Real-World Results?
Karpathy's own two-day extended run produced 700 experiments, with the agent stacking 20 additive improvements that dropped the "Time to GPT-2" benchmark from 2.02 hours to 1.80 hours. The metric used is validation bits per byte (val_bpb), which measures how efficiently the model compresses text. Lower is better, and the metric is vocabulary-size-independent, meaning the agent can try different tokenizer configurations without breaking the measurement.
The community response was immediate and substantial. Within a month of release, the awesome-autoresearch curated list tracked dozens of forks and derivative projects applied to domains outside machine learning, suggesting the underlying pattern has broader applicability.
How to Set Up and Run AutoResearch
- Prepare Your Codebase: The system operates on a stripped-down single-GPU language model training implementation called nanochat, with the key file train.py containing approximately 630 lines of Python. The scope is deliberately small: one file, one GPU, one metric.
- Create Your Instruction File: The human's entire contribution to the running system is a Markdown file called program.md, which tells the agent what it's allowed to modify (only train.py), what it must never do (modify prepare.py, which handles data and evaluation), and how to interpret results.
- Let It Run Indefinitely: The official program.md explicitly states in all caps: "Once the experiment loop has begun, do NOT pause to ask the human if you should continue. The human might be asleep, or gone from a computer and expects you to continue working indefinitely until you are manually stopped."
- Review the Git History: In the morning, you have a complete audit trail showing every successful experiment, exactly what the agent changed, and why each change improved the metric.
The current repository requires a single NVIDIA GPU, explicitly tested on H100. The README states without qualification: "This code currently requires that you have a single NVIDIA GPU." CPU, Apple Silicon (MPS), and AMD GPU paths are mentioned as "in principle possible" but not implemented, as adding them would "bloat the code." Community forks exist for lower-compute platforms, but those are not the main project.
What Can and Cannot AutoResearch Do?
AutoResearch can modify everything in train.py: model architecture, optimizer choice, hyperparameters, training loop logic, batch size, and learning rate schedule. The constraint is that prepare.py remains read-only, preserving the integrity of the evaluation function and metric. Every experiment trains for exactly 5 minutes of wall-clock time, regardless of what the agent changes. Whether it tries a small model with a huge batch size or a large model with fewer steps, the time cost is identical.
This creates two important properties. First, experiments are directly comparable; architecture changes, optimizer changes, and hyperparameter changes are all evaluated on the same basis. Second, the system finds the optimal model for your specific hardware. An H100 and a consumer GPU will produce different results, as the agent discovers what works best on the machine it's running on, not what works best on paper. The tradeoff is that results aren't comparable across machines. If Karpathy runs AutoResearch on an H100 and you run it on a 3090, the winning configurations will differ, and the val_bpb numbers cannot be directly compared.
"There are no state graphs, no orchestration frameworks, no tool schemas. The agentic loop is English text. The agent's context window is the state machine. Git is the version control and rollback mechanism. This is a design bet that a sufficiently capable LLM, given clear written instructions and the ability to run code, doesn't need additional scaffolding,"
Andrej Karpathy, as described in AutoResearch documentation
The system represents a fundamentally different approach to machine learning optimization. Unlike AutoML systems such as Optuna, Ray Tune, or neural architecture search frameworks that work by searching a predefined configuration space, AutoResearch lets the agent propose code changes using its knowledge of the ML literature. There's no predefined search space. The agent can propose adding new features, removing unnecessary complexity, or restructuring the training loop entirely.
The release of AutoResearch signals a shift in how researchers think about the relationship between humans and AI agents. Rather than treating agents as tools that execute predefined tasks, Karpathy's design treats them as research collaborators that can form hypotheses, run experiments, and learn from results, all while maintaining human oversight through clear written instructions and git-based version control.
" }