MITB Banner

Thanks to Google, YouTube Belongs to Everyone

Even OpenAI and other AI competitors.

Share

Thanks to Google, YouTube Belongs to Everyone

Illustration by Nikhil Kumar

Listen to this story

According to a recent report, OpenAI trained GPT-4 using millions of hours of transcription of YouTube videos using its speech-to-text model Whisper. The company has been desperately trying to gather as much data as possible to make its AI models. 

This report comes right after the recent interview of OpenAI CTO Mira Murati that was making rounds on the internet. In the video Murati looked tongue-tied and was unable to specify how the company trained its latest video generation model Sora. The company has been operating in the dangerous territory of AI copyright for quite some time now. 

The problem here is that YouTube does not allow AI companies to download videos and transcripts. Neal Mohan, the CEO of YouTube, said using its videos for training AI models is a violation of the platform’s terms of services. Though Mohan couldn’t be sure if OpenAI had indeed used the videos. “It would be a violation,” he added. 

“From a creator’s perspective, when they upload their hard work to our platform, they have certain expectations,” Mohan said in an interview. “Lots of creators have different sorts of licensing contracts in terms of their content on our platform,” Mohan said.

OpenAI is not alone

Meanwhile, Mohan claimed that Google too has used portions of YouTube videos to train the Gemini model, which he says adhered to the usage policy. Interestingly, the company tweaked its privacy policy’s language to expand what it could do with the data, which is quite shady. 

It has been established several times over the year that YouTube is a gold mine of data for training any multimodal AI model. The rider here is that not everyone can use this data and train on YouTube videos except Google, which owns it.

The Times reported that OpenAI exhausted all useful text data in 2021 and has since been desperately trying to get its hands on any data possible. Though Murati said that Sora has been trained on publicly available data, it cannot be pinpointed if it was YouTube, Facebook, or Instagram, or all of them combined. But now, it has been confirmed that at least GPT-4 was trained on the transcripts. 

Speaking of Facebook and Instagram, parent company Meta has had internal discussions in the previous year involving the potential acquisition of Simon & Schuster, a publishing house, with the aim to obtain longer-form content. This information was gleaned from recordings of internal meetings. 

This is similar to OpenAI partnering with several news agencies. Google, on the other hand, believes that it has the right to scrape all the information off the internet being the dominant search engine. It recently partnered with Reddit for access to its Data API

Even the entire universe of the internet is not enough for these data-hungry AI models. 

‘Better to ask for forgiveness than permission’

Meta has also been in talks about the possibility of aggregating copyrighted content from various online sources, despite potential legal repercussions. The participants expressed concerns that negotiating licenses with publishers, artists, musicians, and news outlets would be time-consuming.

The requirement of data is so huge that even using copyrighted material after acquiring a license is not enough. “The only practical way for these tools to exist is if they can be trained on massive amounts of data without having to license that data,” Sy Damle, a lawyer who represents Andreessen Horowitz, said.

OpenAI CEO Sam Altman has been quite vocal about the need for data for AI models and that training would use up all the available data on the internet. This eventually ended up with the company transcribing YouTube videos such as audiobook and podcasts for high-quality data and information. 

Several Google employees are aware that OpenAI used YouTube videos to train its AI models but did not voice it out as Google was also doing the same. It would have been hypocritical for the company to do it. So the future is simple, either no one, or everyone would be using YouTube videos to train AI models. 

What this would do to the creators is still a question. Altman has clearly said that he wants to compensate the artists and creators, but the process isn’t clear to him as well. For now, it is all about training on YouTube’s gold mine of data, and then paying off hefty fines (if and when imposed). 

But now that YouTube’s data is already exhausted on GPT-4 and Gemini, we wonder what these companies would train their upcoming models such as GPT-5 on. They would find a way – legal or illegal – and figure it out later.

Share
Picture of Mohit Pandey

Mohit Pandey

Mohit dives deep into the AI world to bring out information in simple, explainable, and sometimes funny words. He also holds a keen interest in photography, filmmaking, and the gaming industry.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.