Home OpenAI and Google accused of using YouTube transcripts for AI

OpenAI and Google accused of using YouTube transcripts for AI

TL:DR

  • OpenAI and Google allegedly transcribe YouTube videos for AI training.
  • Report suggests over one million hours of videos transcribed.
  • Potential violation of copyright and YouTube's terms discussed.

OpenAI and Google have reportedly transcribed YouTube videos to harvest text for their AI models, potentially violating creators’ copyrights.

According to an investigation by The New York Times and Meta, the tech giants allegedly cut corners to access as much data as possible to train their AI models.

OpenAI researchers are said to have created a speech recognition tool called Whisper, which allows audio transcription from YouTube videos. This can yield new conversational text that would make an AI system smarter.

The inquiry cites several sources who claim that more than one million hours of YouTube videos have been transcribed, despite conversations discussing how it could violate YouTube’s rules. The transcripts were then inputted into GPT-4, the advanced AI system powering the most recent version of ChatGPT’s chatbot. Google, the parent company of YouTube, was also reported to have transcribed videos to train its own AI models.

In addition to this, OpenAI president Greg Brockman was personally involved in collecting videos that were used, the Times writes.

OpenAI’s alleged use of YouTube videos could also breach Google’s policies, which prohibit using its content for “independent” applications and the “automated means” of its videos through methods like robots, botnets, or scrapers.

Google responds to claims

Google told ReadWrite that it had seen “unconfirmed reports” regarding the news. It added, however, that OpenAI and Microsoft would have to answer whether they employ such practices.

“Both our robots.txt files and Terms of Service prohibit unauthorized scraping or downloading of YouTube content, and we have a long history of employing technical and legal measures to prevent it. We take action when we have a clear legal or technical basis to do so,” the statement continued.

The company admitted that Google’s models are trained on some YouTube content, in accordance with its agreements with YouTube creators.

The tech firm also updated its privacy policy in July 2023, but did not expand the types of data that Google can use to train its AI models. “Our privacy policy has long been transparent that Google uses publicly available information from the open web to train language models for services like Google Translate.

“This update simply added Bard as an additional example of products that may be developed using such data, and used the more broadly understood term ‘AI models’ rather than ‘language models’. We did not start training on additional types of data based on this language change. It was a change for clarity,” they explained.

In terms of consumer data, Google said it had been clear it did not use its Workspace data to train or improve the underlying generative AI and large language models that power Gemini, Search, and other systems outside of Workspace without explicit permission.

Are tech companies running out of training data?

The report also suggests that OpenAI had depleted its supplies of useful data in 2021, and as a result, discussed transcribing podcasts, audiobooks and YouTube videos to train its next-generation model. By then, it is said that they had mined the computer code repository GitHub, and used up databases of chess moves and data describing high school tests and homework assignments from the website Quizlet.

The Times claims that Google’s legal department requested the company’s privacy team to modify the wording of its policy to broaden the scope of actions it could take with consumer data, including the use of office tools like Google Docs.

According to the Times, Meta is also facing a shortage of available training data, and in recordings reviewed by the publication, its AI team was heard discussing the unauthorized use of copyrighted materials in an effort to keep pace with OpenAI. Having exhausted “almost available English-language book, essay, poem and news article on the internet,” the company reportedly contemplated measures such as acquiring book licenses or outright purchasing a major publishing house.

Last week, YouTube CEO Neal Mohan said that using the videos on the platform to train an AI model would be a “clear violation” of YouTube’s terms and conditions after OpenAI’s CTO “didn’t know” whether the tool was trained on YouTube videos.

Advanced systems created by OpenAI, Google, and others need vast expanses of information to learn. This need is depleting the reservoir of high-quality public data on the internet, especially as certain data owners restrict AI companies’ access. The Wall Street Journal states that there is a 90 per cent chance the demand for high-quality data will outstrip supply by 2028.

OpenAI, Google, and Meta have been approached for further comment.

Featured image: Canva

About ReadWrite’s Editorial Process

The ReadWrite Editorial policy involves closely monitoring the tech industry for major developments, new product launches, AI breakthroughs, video game releases and other newsworthy events. Editors assign relevant stories to staff writers or freelance contributors with expertise in each particular topic area. Before publication, articles go through a rigorous round of editing for accuracy, clarity, and to ensure adherence to ReadWrite's style guidelines.

Suswati Basu
Tech journalist

Suswati Basu is a multilingual, award-winning editor and the founder of the intersectional literature channel, How To Be Books. She was shortlisted for the Guardian Mary Stott Prize and longlisted for the Guardian International Development Journalism Award. With 18 years of experience in the media industry, Suswati has held significant roles such as head of audience and deputy editor for NationalWorld news, digital editor for Channel 4 News and ITV News. She has also contributed to the Guardian and received training at the BBC As an audience, trends, and SEO specialist, she has participated in panel events alongside Google. Her…

Get the biggest tech headlines of the day delivered to your inbox

    By signing up, you agree to our Terms and Privacy Policy. Unsubscribe anytime.

    Tech News

    Explore the latest in tech with our Tech News. We cut through the noise for concise, relevant updates, keeping you informed about the rapidly evolving tech landscape with curated content that separates signal from noise.

    In-Depth Tech Stories

    Explore tech impact in In-Depth Stories. Narrative data journalism offers comprehensive analyses, revealing stories behind data. Understand industry trends for a deeper perspective on tech's intricate relationships with society.

    Expert Reviews

    Empower decisions with Expert Reviews, merging industry expertise and insightful analysis. Delve into tech intricacies, get the best deals, and stay ahead with our trustworthy guide to navigating the ever-changing tech market.