A few days ago, Google seemingly put out a warning to OpenAI, saying they’re not allowed to train their models on YouTube data.
Days later, the NYT reports that OpenAI, Meta, and Google have all ignored the rules to train their own models.
When AI companies are asked about what they train their data on, it’s usually a vague answer about “publicly available data.”
In an interview with The Wall Street Journal, OpenAI’s chief technology officer Mira Murati said she “wasn’t sure” if data from YouTube or other social platforms was used to train Sora. That’s like a chef saying they don’t know what’s in the dish they’re serving you.
When pressed further, she said, “I’m just not gonna go into the details of the data that was used, but it was publicly available data or licensed data.”
But if the answer’s as simple as publicly available data, why are AI companies always so evasive?
Back in November 2023, Ed Newton-Rex, who led Stability AI’s audio team, resigned, stating that he didn’t “agree with the company’s opinion that training generative AI models on copyrighted works is ‘fair use.’”
AI companies often use the “fair use” exemption to justify training their models on copyrighted material. However, Rex argues that creators’ works do suffer as a result of the duplicative content made by AI models.
This is why publishers like The New York Times have terms of service that explicitly prohibit AI companies from using their content to train AI models. But with no federal AI law, enforcing this term becomes quite the task.
It’s one the NYT has taken on against OpenAI, filing a lawsuit against the company in December. The publisher joins authors and comedians who have also sued the AI giant for copyright infringement.
OpenAI maintains that they’ve done nothing wrong and have always used publicly available and licensed content.
Rex told Axios that the term “publicly available” is a term beloved by AI companies and used to confuse people.
He says it doesn’t mean the creator has given permission to use the content, it just means it wasn’t illegally obtained via hacking or similar method.
A recent NYT article reports that the big AI players – OpenAI, Google, Meta – are all cutting corners when it comes to collecting training data.
Should we take a second to feign shock?
Just a few weeks ago, newly released documents from a class-action lawsuit against Meta reveal that back in 2016, Zuckerberg and his team allegedly discussed how to intercept app traffic from Snapchat users.
They executed on this strategy, and later did the same thing with YouTube and Amazon users.
Doing so gave them access to sensitive data, such as usernames, passwords, and app activity.
More recently, employees at Meta discussed obtaining copyrighted data knowing the risk of lawsuits, to avoid the lengthy process of procuring licenses, according to the NYT article.
Google and OpenAI reportedly also took part in questionable data gathering processes, as a way to stay ahead in the AI race.
What’s happening today is that the amount of data available on the internet is dwindling.
So, AI companies are scrambling to secure big banks of data, either by offering large licensing deals or using other not-so-straight-and-narrow methods.
So, how can creators fight back? Many are going the lawsuit route, but it’s not an easy one.
In February, a federal judge dismissed the majority of the copyright infringement claims made by a group of authors including Ta-Nehisi Coates and Sarah Silverman.
This setback creates a difficult precedent for creatives relying on the law to protect their works.
In the coming months and years, we’ll likely see landmark cases and legislation that will shape how creators approach sharing their work and how AI companies collect their data.