Logo
FrontierNews.ai

The Hidden Training Data Behind AI Music Generators Like Suno

AI music generators like Suno are trained on enormous collections of songs, including millions of tracks from major artists, often downloaded without proper licensing or artist consent. A recent investigation uncovered four giant datasets circulating within the AI development community, containing between 100,000 and 12 million songs each, downloaded thousands of times by developers building commercial music-generation tools.

How Are AI Music Generators Trained on Existing Songs?

AI music generators work similarly to text-generating AI systems. They break down training content into tiny audio snippets and learn the patterns of how those pieces fit together. When given a prompt, the model predicts what should come next, generating new audio that mimics the style of its training data. The problem is that this process sometimes produces music that closely resembles songs from the training set.

The scale of training data is staggering. One dataset contains 12 million tracks, which would take 91 years to listen to from start to finish. Another holds 9 million songs. The two smaller datasets each contain more than 100,000 tracks. These collections include hits from major pop artists such as Bad Bunny, Nirvana, Taylor Swift, Billie Eilish, Pearl Jam, Elvis Costello, Sheryl Crow, and the Beatles, alongside jazz legends like Miles Davis and John Zorn, plus tens of thousands of minor artists across all genres.

Where Do These Datasets Come From?

Three of the four discovered datasets are distributed as lists of links to songs on YouTube or Spotify. AI developers download the actual audio using automated tools that can bypass logins, advertisements, and payment mechanisms designed to earn money or subscribers for creators. These tools violate the terms of service of streaming platforms. The fourth dataset, the Free Music Archive collection, is distributed with MP3 files directly.

Google has publicly acknowledged using one of these datasets, specifically more than 100,000 songs downloaded from the Free Music Archive, a site that allows free streaming for personal listening but requires payments for commercial use. Stability has also used songs from the same dataset. However, because AI companies keep their training data sources secret, claiming they are proprietary information, it remains unclear who else has accessed the other datasets.

What Evidence Shows AI Music Generators Reproduce Existing Songs?

The consequences of this training approach are becoming visible. Suno, one of the most popular AI music generators, has produced tracks that strongly resemble Michael Jackson's "Thriller," Ed Sheeran's "Shape of You," Chuck Berry's "Johnny B. Goode," Bill Haley and His Comets' "Rock Around the Clock," and B.B. King's "The Thrill Is Gone." These examples come from a lawsuit filed by major record labels against Suno.

In one striking case, Olympic-bound figure skaters performed to an AI-generated song that contained recognizable lyrics from the New Radicals' 1998 hit "You Get What You Give." The AI system had converted the song into Bon Jovi-style arena rock but kept lines like "Every night we smash a Mercedes-Benz." The New Radicals' track appears in two of the discovered datasets.

"The platform uses safeguards to protect against unauthorized distribution, impersonation and manipulations," said Rachel Racusen, a spokesperson for Suno.

Rachel Racusen, Spokesperson for Suno

Suno's chief product officer stated in a LinkedIn post that reproductions of training data "should not happen," but the company declined to comment on specific tracks or acknowledge the lawsuit.

How Do AI Companies Justify Using Unlicensed Music?

AI companies defend their right to train models on unlicensed music by invoking "fair use" under copyright law. They argue that training AI models does not harm the market for creators' work. However, this is a complex legal claim, and the legality likely depends on specifics of how an AI system is trained and deployed.

The scale of training data used by major companies is enormous. In 2022, Google trained a model on 44 million tracks, totaling 42 years of music. Suno stated in a 2024 court filing that it trained its models on "essentially all music files of reasonable quality" that it could download from the internet. In 2020, OpenAI scraped 1.2 million songs from the web to train a model called Jukebox that was explicitly designed for generating variations on existing music.

Suno

Steps to Understand the AI Music Training Landscape

  • Dataset Scale: Four discovered datasets range from 100,000 to 12 million tracks, with the largest containing enough music to listen for 91 years continuously.
  • Source Platforms: Training data comes from YouTube, Spotify, and the Free Music Archive, often downloaded using automated tools that bypass platform restrictions.
  • Artist Coverage: Datasets include major commercial artists like Taylor Swift and the Beatles alongside jazz legends and tens of thousands of minor artists across all genres.
  • Legal Defense: Companies claim fair use protections for training, arguing that AI models do not harm the market for original creators' work.
  • Reproduction Issues: AI-generated music sometimes closely resembles songs from training data, as evidenced by lawsuits from major record labels.

What Is the Impact on Streaming Platforms?

The ease of generating AI music has made it ubiquitous on streaming services. Last September, Spotify removed 75 million "spammy" AI-generated tracks from its service. The streaming platform Deezer recently reported that nearly half of the tracks it receives daily are AI generated. Unlike Spotify, Deezer excludes AI-generated tracks from its algorithmic recommendations and labels albums that include AI tracks, though it does not display labels for individual tracks. Spotify does not label AI-generated music on its platform, nor do YouTube or Amazon Music.

Google is uniquely positioned to take advantage of AI music generation because of its massive existing audience. The tech giant has begun embedding the technology into its products. Google's Gemini AI assistant can now generate 30-second music tracks based on a user's uploaded text, photos, or video. Google also encourages video makers on YouTube to use AI-generated backing tracks rather than licensing music from real musicians. For YouTubers who have gotten in trouble by using copyrighted music inappropriately, Google recently added a "Replace Song" button that will replace the music in their video with an AI-generated track.

Suno and its competitor Udio function as listening platforms much like Spotify or YouTube. Users can describe the music they want to hear, and the sites generate a track in seconds. The songs are mostly mundane but can sound realistic enough that many listeners might struggle to recognize them as AI generated.

In an attempt to prevent their products from generating songs that duplicate existing music, AI companies implement detection software. However, neither Suno nor Udio fully prevents users from generating tracks that resemble existing songs, as evidenced by the ongoing lawsuits and the datasets discovered circulating within the development community.