The Great Open-Source AI Divide: Why 5.6 Million Projects Hide a Deployment Reality Check
Open-source AI is exploding in supply but concentrated in actual use. Stanford's 2026 AI Index reports 5.6 million open-source AI projects across GitHub and Hugging Face, yet a scan of 50 million live domains reveals that Botpress, the most-deployed open-source AI framework, runs on just 1,558 websites compared to 52,682 for OpenAI's closed API. That roughly 34-to-1 ratio exposes a critical gap between what developers build and what organizations actually ship to customers.
Why Does the Project Count Explode While Deployments Stay Flat?
The explosion in open-source AI repositories is real, but most never leave the lab. Of the 5.6 million projects counted by Stanford, only 206,880 (about 3.7 percent) have earned 10 or more stars, a low bar for community interest. Hugging Face tells a similar story on the model side: roughly half of all models on the hub have fewer than 200 total downloads, while the top 0.01 percent account for nearly half of all downloads.
The primary driver of this explosion is AI agents themselves. GitHub's Octoverse 2025 report found that about 60 percent of the year's fastest-growing projects were AI-focused, with more than 1.1 million public repositories now pulling in a large-language-model SDK, up 178 percent year over year. AI agents now participate in 14.9 percent of all public pull requests, and some agent frameworks like Aider report that more than 70 percent of their own code is now written by AI agents.
Yet here is the paradox: these agents accelerate supply in terminals, CI pipelines, and code editors, not on public websites. They widen the divide between projects created and projects deployed to real users.
Who Actually Deploys Open-Source AI, and at What Scale?
The companies running open-source AI frameworks are overwhelmingly small. Of the 899 companies matched to Botpress deployments via LinkedIn, 64 percent have 10 or fewer employees, and 81 percent have fewer than 50. This is not an enterprise phenomenon. It is an indie and startup story.
Closed APIs, by contrast, dominate the visible web. MIT Sloan research cited in the TechnologyChecker analysis found that closed models take roughly 80 percent of all model usage. OpenAI's API is detectable on 52,682 domains, a scale that dwarfs any open-source competitor.
The momentum, however, is shifting toward open-source. Hugging Face passed 2 million public models in 2026, and model uploads more than tripled between 2023 and 2025, reaching 332,000 in a single quarter. In May 2026, IBM and Red Hat pledged $5 billion to open-source AI, signaling that enterprise backing is arriving even if enterprise deployment has not yet followed at scale.
How Are Developers Building AI Systems Today?
For engineers entering the AI space, the architecture has become clearer. A backend engineer's mental model of modern AI breaks into three layers: large language models (LLMs) as the reasoning engine, retrieval-augmented generation (RAG) to connect those models to real-time data, and AI agents to orchestrate multi-step tasks.
LLMs like OpenAI's GPT-4o, Google Gemini, Anthropic's Claude, and Meta's Llama 3 are trained on fixed datasets and cannot access information outside their training window. RAG solves this by letting models pull in external data before responding, turning a model that only knows what it studied into one that can actually search for current information.
The Model Context Protocol (MCP), introduced by Anthropic as an open-source standard, acts as the connector between AI models and external tools, databases, and file systems. Think of it as the USB-C port of AI: a standardized way for models to securely fetch data from GitHub, Google Drive, Slack, or internal databases without custom integrations for each source.
- The Brain: An LLM provides core reasoning and language understanding, but it is fundamentally reactive; you give it a prompt, and it responds.
- Planning and Reflection: Agents use patterns like ReAct (Reason and Act) to think out loud, breaking complex goals into steps and adjusting if early attempts fail.
- Memory: Short-term memory tracks what the agent has done in the current session, while long-term memory via vector databases remembers preferences and past interactions across sessions.
- Tools: APIs and MCP connections give agents hands to execute plans, whether writing code, querying databases, reading files, or sending emails.
What Does This Mean for Teams Building AI Systems?
The gap between open-source supply and deployment reveals that building with AI is not the same as shipping with AI. A developer can clone Aider, OpenHands, or SWE-agent from GitHub today and use them in a terminal or CI pipeline. But getting those tools into production on a public website, serving customers at scale, requires infrastructure, security, and operational maturity that most indie teams do not yet have.
For organizations evaluating document parsing and data extraction, the landscape has also matured. Tools like LlamaParse, Docling, Google Document AI, Amazon Textract, and Azure AI Document Intelligence now offer specialized capabilities for complex PDFs, tables, and handwritten forms. The choice depends on whether teams prioritize local execution for privacy, cloud-scale processing, or integration with existing cloud platforms like Google Cloud, AWS, or Azure.
The real story is not that open-source AI is failing. It is that open-source AI is becoming the commons while closed APIs remain the default for production systems. As IBM and Red Hat's $5 billion commitment signals, enterprise adoption of open-source AI is coming, but it will take time for deployment infrastructure to catch up with the explosion in projects and models.