Home The murky world of AI training data sets exposed

The murky world of AI training data sets exposed

A new study by the Data Provenance Initiative reveals troubling practices in creating and sharing data sets used to train artificial intelligence systems. These data sets are crucial for developing advanced AI capabilities, but many fail to properly credit sources or lack licensing information, raising legal and ethical concerns.

According to an Oct. 25 The Washington Post report, the research audited over 1,800 popular data sets from leading AI sites like Hugging Face, GitHub, and Papers With Code. Shockingly, around 70% did not specify licensing terms or mislabeled permissions compared to creators’ intentions. This leaves AI developers in the dark about potential copyright limitations or requirements when using these data sets — more information is needed.

“People couldn’t do the right thing, even if they wanted to,” said Sara Hooker, co-author of the report. The murky licensing demonstrates broader problems in the fast-paced world of AI development, where researchers feel pressure to skip steps like documenting sources as they rush to release new data sets.

Far-reaching consequences follow incorrect procedures regarding creators’ licensing terms and permissions

The implications are far-reaching, as these data sets power advanced AI systems like chatbots and language models, including Meta’s Llama and OpenAI’s GPT models. Tech giants face lawsuits over text scraped from books and websites without permission. Critics argue AI companies should pay sources like Reddit for their data, but licensing issues create roadblocks.

Behind the scenes, AI researchers “launder” data by obscuring origins, trying to eliminate restrictions. Leading AI labs reportedly prohibit re-using their models’ outputs for competing AIs but allow some noncommercial uses. However, proper licensing documentation is lacking.

The study aimed to peer inside this opaque ecosystem fueling the AI gold rush. The interactive tools don’t dictate policies but help inform developers, lawyers, and policymakers. Analysis revealed most data comes from academia, with Wikipedia and Reddit as top sources. However, data representing Global South languages still comes mainly from North American and European creators and websites.

“Data set creation is typically the least glorified part of the research cycle and deserves attribution because it takes so much work,” said Hooker. The research moves toward more transparent and ethical AI by highlighting the need for better practices. But profound work remains to illuminate the dark side of data fueling AI’s relentless march into the future.

Featured Image Credit: Photo by Shuki Harel; Pexels; Thank you!

About ReadWrite’s Editorial Process

The ReadWrite Editorial policy involves closely monitoring the tech industry for major developments, new product launches, AI breakthroughs, video game releases and other newsworthy events. Editors assign relevant stories to staff writers or freelance contributors with expertise in each particular topic area. Before publication, articles go through a rigorous round of editing for accuracy, clarity, and to ensure adherence to ReadWrite's style guidelines.

Radek Zielinski
Tech Journalist

Radek Zielinski is an experienced technology and financial journalist with a passion for cybersecurity and futurology.

Get the biggest tech headlines of the day delivered to your inbox

    By signing up, you agree to our Terms and Privacy Policy. Unsubscribe anytime.

    Tech News

    Explore the latest in tech with our Tech News. We cut through the noise for concise, relevant updates, keeping you informed about the rapidly evolving tech landscape with curated content that separates signal from noise.

    In-Depth Tech Stories

    Explore tech impact in In-Depth Stories. Narrative data journalism offers comprehensive analyses, revealing stories behind data. Understand industry trends for a deeper perspective on tech's intricate relationships with society.

    Expert Reviews

    Empower decisions with Expert Reviews, merging industry expertise and insightful analysis. Delve into tech intricacies, get the best deals, and stay ahead with our trustworthy guide to navigating the ever-changing tech market.