Logo
FrontierNews.ai

Open R1: How Hugging Face Just Proved DeepSeek's Reasoning Model Can Be Rebuilt From Scratch

Hugging Face has successfully reproduced DeepSeek-R1's breakthrough reasoning capabilities using only open-source tools, public datasets, and standard computing hardware. The project, called Open R1, released a fully functional 7-billion-parameter model that performs as well as or better than DeepSeek's own distilled version on several major benchmarks. This matters because it proves that one of AI's most impressive recent advances wasn't locked behind proprietary secrets or exotic infrastructure, but rather a recipe that the broader research community can now study, modify, and build upon.

When DeepSeek released R1 in January 2025, it demonstrated a major leap forward in how AI models reason through complex problems step by step. But the Chinese lab kept the training data, reward models, and reinforcement learning pipeline under wraps. Researchers could use the finished model, but they couldn't understand or reproduce how it was built. Open R1 closes that gap entirely.

What Exactly Did Hugging Face Rebuild?

The Open R1 project broke the reproduction challenge into three phases. The first phase, now complete, focused on replicating the distilled versions of R1 by extracting high-quality reasoning examples from DeepSeek-R1 itself. The second phase aims to recreate R1-Zero, the base model that learns reasoning purely through reinforcement learning without any supervised training. The third would demonstrate that the entire pipeline works end-to-end.

The centerpiece of the completed first phase is the Mixture-of-Thoughts dataset, released in May 2025. It contains 350,000 reasoning traces distilled from DeepSeek-R1, covering mathematics, coding, and science problems. Crucially, each trace is verified, meaning the dataset isn't just raw model outputs but curated examples that actually produce correct answers. This dataset teaches language models to reason through problems step by step, much like a student showing their work.

The team trained their model, OpenR1-Distill-7B, on the same base model as DeepSeek's distilled version: Qwen2.5-Math-7B. The training ran on eight H100 GPUs with 80 gigabytes of memory each, using supervised fine-tuning with the Mixture-of-Thoughts dataset. The results are striking. On AIME 2024, a challenging math competition benchmark, OpenR1-Distill-7B scored 52.7 compared to 51.3 for DeepSeek's model. On GPQA Diamond, a science reasoning test, it scored 52.8 versus 52.4. On LiveCodeBench v5, a coding benchmark, it scored 39.4 versus 37.4.

How Does This Change What Researchers Can Do?

The practical impact is substantial. Any researcher with access to eight high-end GPUs can now reproduce these results using publicly available code and datasets. The project released not just the trained model weights but also the complete training scripts, configuration files, and detailed setup instructions. This transparency means the research community can experiment with variations, test improvements, and build new applications without waiting for proprietary labs to release new versions.

The team also released auxiliary datasets that extend the approach. The CodeForces-CoTs dataset, containing 10,000 competitive programming problems with 100,000 solutions distilled from R1, allows smaller models to outperform Claude 3.7 Sonnet on extremely difficult olympiad-level problems. A 32-billion-parameter model trained on this data actually outperforms DeepSeek-R1 itself on these benchmarks. The OpenR1-Math-220k dataset, with 220,000 math reasoning traces, produces models that match DeepSeek's performance on pure mathematics tasks.

Why Is This Reproduction So Technically Difficult?

Reproducing DeepSeek's results required solving dozens of engineering challenges that research papers typically gloss over. DeepSeek's technical report describes a multi-stage pipeline involving cold-start data collection, reinforcement learning with rule-based rewards, rejection sampling, and distillation. Each stage demands careful implementation. The training scripts use GRPO (Group Relative Policy Optimization), a specialized training technique that scales across multiple computing nodes with support for sandboxed code execution. The repository includes Slurm scripts for multi-node training, YAML configuration files for different model sizes, and detailed instructions for setting up the exact software versions required.

The team made deliberate choices about implementation details. They use ChatML as the default chat template during training, even for base models like Llama that don't natively support it. They had to override the chat template for DeepSeek's distilled models because those models omit the contents of the reasoning tags, which interferes with the format reward function. These are the kinds of practical decisions that separate a working reproduction from a failed attempt.

Steps to Understand Open R1's Impact on AI Development

  • Reproducibility Verification: The project demonstrates that DeepSeek's reasoning breakthrough isn't dependent on proprietary infrastructure or secret training data, but rather a replicable methodology that works on standard hardware available to most research institutions.
  • Open Dataset Availability: The release of 350,000 verified reasoning traces, competitive programming solutions, and math problem datasets gives researchers concrete training material to build their own reasoning models without relying on proprietary sources.
  • Transparent Training Pipeline: Complete training scripts, configuration files, and setup instructions allow any team with sufficient computing resources to reproduce the results and experiment with variations or improvements to the approach.

What Remains Unproven?

The open question now is whether the second phase will succeed. Reproducing R1-Zero, the pure reinforcement learning pipeline that learns reasoning without any supervised training, requires curating large-scale datasets that provide reliable reward signals. DeepSeek used rule-based reward functions that check whether answers are correct and properly formatted, rather than learned reward models. This approach is simpler in theory but still demands the right data at scale. The Open R1 project has not announced a timeline for tackling this second phase.

The significance of what Open R1 has accomplished is this: it took a model that the community could only use and transformed it into a pipeline that the community can study, modify, and build upon. The Mixture-of-Thoughts dataset and the OpenR1-Distill-7B model weights are available on Hugging Face. The training scripts are in the public repository. The gap between what frontier AI labs can build internally and what the open-source community can replicate is visibly shrinking.

" }