← all stories music 1 sources · 1h ago

Report Finds Tens of Millions of Songs in AI Training Datasets

The finding provides concrete evidence of the scale of copyrighted music in AI training data, directly supporting the copyright lawsuits against Suno and Udio filed by major record labels.

Reporting from 1 sources: GIGAZINE.

Report Finds Tens of Millions of Songs in AI Training Datasets

The Atlantic identified four music datasets used for AI training, containing tens of millions of songs, including copyrighted works by artists like Taylor Swift and The Beatles. The datasets, distributed as links to YouTube and Spotify, have been downloaded thousands of times. Google and Stability have stated they used them.

The Atlantic's Alex Reisner found four music datasets shared in the AI development community. One dataset contains 12 million songs, another 9 million, and two more each hold over 100,000 songs. The collections include works by Bad Bunny, Nirvana, Taylor Swift, Billie Eilish, Pearl Jam, Elvis Costello, Sheryl Crow, and The Beatles. Three of the four datasets are collections of links to songs on YouTube or Spotify, which AI developers download using automated tools that bypass login and ads, violating platform terms of service. The fourth dataset comes from the copyright-free Free Music Archive. Google and Stability have acknowledged using these datasets in their research papers. The Atlantic published a searchable database of the songs involved.

Synthesized by Yomimono from the 1 cited source below, including Japanese-language reporting where cited, then editorially reviewed before publishing.

Sources

GIGAZINE 何千万もの楽曲がAIのトレーニングに利用可能なデータセットとして配布されておりAI生成楽曲の肥やしになっているとの指摘

Report Finds Tens of Millions of Songs in AI Training Datasets

More on this

Sources