Logo
FrontierNews.ai

21 Million Copyrighted Songs Are Circulating in AI Training Datasets,Here's What's at Stake

Four datasets holding more than 21 million music recordings are circulating among artificial intelligence developers, with tracks from major artists like Taylor Swift, Billie Eilish, and the Beatles mixed alongside work from independent musicians. The collections were identified by The Atlantic's Alex Reisner and represent a significant flashpoint in the ongoing battle between the music industry and generative AI companies over training data, copyright, and artist compensation.

What Are These Music Datasets and Where Did They Come From?

The four datasets vary dramatically in size and origin. Two contain roughly 100,000 recordings each, while the other two are substantially larger, holding approximately 9 million and 12 million tracks respectively. Together, they span decades of music history and include household names alongside tens of thousands of lesser-known independent artists.

Two of the datasets have publicly documented origins. The largest is LAION-DISCO-12M, a collection of more than 12 million tracks released in November 2024 by LAION, a German non-profit organization that compiles open datasets for artificial intelligence research. LAION is also behind the dataset used to train Stability AI's Stable Diffusion image generator. The organization explicitly states the music collection was "released for research purposes" and is intended for use "in academic settings," but warns against deploying its datasets commercially or using them in their original form to create finished products.

The Free Music Archive represents one of the roughly 100,000-track datasets. Published by academic researchers in 2017 as a resource for music-information-retrieval research, it draws on a library directed by WFMU, a freeform radio station in the United States. The catalog consists of tracks that artists released under Creative Commons licenses, which governed free distribution of the music long before generative AI tools began using such material for training.

The Atlantic reported that Google and Stability AI have used tracks from the Free Music Archive. All four datasets have been downloaded several thousand times, though because the industry keeps its training data under wraps, it remains publicly unknown which companies have used most of them.

How Are Musicians and Record Labels Fighting Back?

The music industry's response has been swift and multifaceted. AI music companies including Suno and Udio are now grappling with at least 12 lawsuits, according to The Atlantic. The litigation began in June 2024, when the Recording Industry Association of America (RIAA), acting on behalf of Universal Music Group, Sony Music Entertainment, and Warner Music Group, sued both companies for what it called "mass infringement" of copyright.

However, the legal landscape has shifted significantly. Universal Music Group settled with Udio in October 2025, announcing a "compensatory legal settlement" plus new recorded-music and publishing licenses for a jointly developed AI platform set to launch in 2026. Under that deal, Udio's service is being moved into what UMG called a "walled garden," with fingerprinting and filtering applied before the new platform launches.

Warner Music Group reached its own settlement and licensing deal with Udio in November 2025, and days later became the first major label to settle with Suno. The Warner-Suno agreement, which the companies called a "first-of-its-kind partnership," also saw Suno acquire the concert-discovery platform Songkick from WMG. As part of that deal, Suno said it would launch "new, more advanced and licensed models" in 2026 and deprecate its current models, with downloads on its free tier replaced by playback and sharing.

Suno

Udio has since signed further licensing deals with the independent-label body Merlin in January 2026 and with Kobalt in April 2026. Sony Music remains the only major label still litigating against both Suno and Udio, while Germany's GEMA and Denmark's Koda are also suing Suno.

What Financial Impact Are Musicians Actually Experiencing?

The economic stakes are substantial. A study commissioned by CISAC, the global body for authors' societies, attempted to quantify the losses. The research, carried out by PMP Strategy, estimated that generative AI could take 24 percent of music creators' revenues by 2028. That equates to a cumulative loss of 10 billion euros (approximately $10.5 billion) for creators between 2023 and 2028, rising to 4 billion euros a year by the end of that period, excluding record companies and publishers.

At the individual level, the impact can be devastating. The instrumental duo The American Dollar alleged in a May 2026 lawsuit that Suno had cut its licensing revenue by nearly 80 percent. Their licensing revenue "has been nearly eliminated since the first version of Suno AI was made available to the public," the complaint stated.

The flood of AI-generated music is most visible on streaming services. Deezer reported in April 2026 that it was receiving close to 75,000 fully AI-generated tracks a day, more than 44 percent of all new music uploaded to the platform. That was up from 60,000 a day in January 2026 and just 10,000 when Deezer launched its detection tool in January 2025.

How Are Streaming Platforms Responding to AI-Generated Music?

Consumption of AI-generated music remains low, at 1 to 3 percent of total streams, with 85 percent of those streams flagged as fraudulent and demonetized, according to Deezer. Despite the low consumption rates, the volume presents a significant challenge for the industry.

Streaming platforms are implementing detection and disclosure measures. Deezer says it was the first streaming platform to detect and tag synthetic tracks at the platform level. Alexis Lanternier, CEO of Deezer, stated that "AI-generated music is now far from a marginal phenomenon." Qobuz, Apple Music, and Spotify have since introduced their own AI-tagging or disclosure measures.

Alexis Lanternier, CEO of Deezer

Steps to Understand the AI Music Training Data Landscape

  • Recognize the Dataset Scale: Four major datasets containing over 21 million tracks are actively circulating among AI developers, with two datasets each holding roughly 100,000 recordings and two others containing approximately 9 million and 12 million tracks respectively.
  • Understand the Legal Framework: Major record labels including Universal Music Group and Warner Music Group have shifted from litigation to licensing agreements with AI music companies, while Sony Music continues court battles and independent artists are bringing their own cases.
  • Monitor Financial Impact: Research estimates generative AI could reduce music creators' revenues by 24 percent by 2028, with individual artists experiencing losses as high as 80 percent in licensing revenue since AI music tools became publicly available.
  • Track Platform Detection Efforts: Streaming services including Deezer, Spotify, Apple Music, and Qobuz are implementing AI-detection and disclosure measures to identify and tag synthetic tracks, though AI-generated music currently represents only 1 to 3 percent of total streams.

The situation reflects a broader tension in the AI industry. While organizations like LAION argue their datasets are intended for academic research and explicitly warn against commercial use, the reality is that these collections are being downloaded thousands of times and used by major technology companies. The datasets themselves often contain only links to publicly available content rather than the audio files, creating a gray area around responsibility and liability.

Independent artists have also brought their own cases. The American Federation of Musicians has sued UMG and Warner, alleging their members' recordings were licensed to Suno and Udio without compensation or credit. These lawsuits represent a growing recognition that the current licensing and compensation frameworks were not designed for the scale and speed at which AI systems can absorb and repurpose creative work.

The financial terms of most Suno and Udio settlements have not been disclosed. In fact, Suno is fighting in court to keep the terms of its Warner settlement away from UMG and Sony, which remain active plaintiffs against the company. This opacity makes it difficult for independent artists and smaller labels to understand what compensation models might be available to them or what protections they should demand.