How Stable Diffusion Became a Billion-Dollar Company Without Asking Permission
Stable Diffusion and other AI image generators were trained on billions of images collected from the internet without permission or payment to photographers and artists. The dataset behind these tools, called LAION, pulled images from Getty Images, Flickr, Behance, Pinterest, and countless websites between 2008 and 2021. If you posted photographs online during that period, there is a significant chance your work ended up in the training data that powered one of the most valuable AI companies in the world.
What Went Wrong in the AI Training Pipeline?
The ethical failure behind Stable Diffusion's creation was not the result of malicious intent or corporate greed. Instead, it emerged from a system with no guardrails. Academic researchers at Stability AI were exploring a legitimate research question about image synthesis using LAION, a dataset created for basic research into dataset creation. The creators of LAION explicitly warned against using the data for commercial products, but they operated under academic fair use assumptions, the same legal framework that allows university libraries to archive content for research purposes.
The problem was that these academic tools were released as open public resources with no access controls. Anyone could download the data and models, whether they were researchers, students, or startups. When Stability AI released Stable Diffusion 1.4 in 2022, the company's profile changed overnight. Within two months, the firm was valued at $1 billion dollars. The commercialization happened faster than anyone anticipated, and by the time questions arose about whether fair use assumptions still applied, billion-dollar valuations were already in place.
How Did Academic Research Become a Commercial Powerhouse?
The researchers who built Stable Diffusion were not sitting around plotting how to mine photographers' work for profit. They were academics trying to answer a research question about image synthesis. Stability AI stepped up to pay the computing costs of training new models, positioning itself as a patron of open-source AI research rather than a product company. The transformation from research project to commercial venture happened almost as fast as it happened to the broader AI industry.
This pattern mirrors what occurred with large language models (LLMs), which are AI systems trained on vast amounts of text to generate human-like responses. LLMs were also built on datasets collected without permission. Common Crawl, a non-profit archive created to make web data freely available to researchers, collected data from 3.1 billion web pages, including blog posts, forum comments, Reddit threads, personal websites, newspaper articles, and nearly 200,000 copyrighted books. No one asked for permission, no one was compensated, and privacy laws cannot remove your content once training is complete.
What Are the Key Ethical Problems With AI Training Data?
- Lack of Consent: Billions of images and text passages were collected and used to train AI models without the knowledge or permission of photographers, writers, journalists, and other creators whose work was included.
- No Compensation: Once training is complete, individual creators cannot be compensated because their work has been broken down into statistical weights within neural networks, making it impossible to identify discrete contributions.
- Irreversible Integration: Privacy laws cannot remove your content once it has been baked into deployed models, even if you block future data collection from your website or accounts.
- Speed Outpaced Regulation: Venture capital funding moved at internet speed while legal and regulatory frameworks could not keep up, allowing commercialization to happen before ethical questions were adequately addressed.
The underlying issue is that the neural networks powering Stable Diffusion and similar tools were built on the accumulated wealth of human knowledge and creativity. Once an image or piece of writing is broken down into the statistical weights of a neural network, individual works do not retain discrete value. Determining who should be paid and how to divide compensation across billions of contributions becomes practically impossible.
What Solutions Are Being Proposed?
Senator Bernie Sanders recently proposed a more interesting answer to the question of who should benefit from AI-generated wealth. His argument centers on the idea that the foundation of AI is humanity's collective intelligence. Sanders stated that AI models were built on "our books, songs, artwork, journalism, computer code, scientific research, videos, conversations, images, and ideas spanning generations." His proposed American AI Sovereign Wealth Fund Act would create a public fund by imposing a one-time 50 percent tax on the stock of OpenAI, Anthropic, and other major AI companies, giving ordinary Americans voting rights, board representation, and eventually dividend checks.
Sanders
Whether or not you agree with the specific mechanism, the underlying logic is difficult to dismiss. If the models were built on all of our collective output, perhaps the wealth they generate should not flow exclusively to a handful of people and their investors. This proposal represents one attempt to address the fundamental question of fairness that emerged from Stable Diffusion's rapid rise from academic research to a billion-dollar company built on unconsented data.
The story of Stable Diffusion reveals a critical gap between the speed of technological innovation and the pace of ethical and legal frameworks. Academic researchers created tools with good intentions, but the absence of guardrails in the system allowed those tools to be commercialized in ways that raised serious questions about consent, compensation, and the distribution of wealth generated by AI. As the AI industry continues to grow, these questions will likely become even more urgent.