Home
Marketing
What does 'publicly available' training data mean to AI companies?

What does 'publicly available' training data mean to AI companies?

Written by: Martina Bretous

AI TRENDS IN 2025 REPORT

Discover how AI is transforming the industry and empowering professionals through the insights of 1,000+ marketers.

Download Now

What does 'publicly available' training data mean to AI companies?

Updated: 04/09/24

A few days ago, Google seemingly put out a warning to OpenAI, saying they’re not allowed to train their models on YouTube data.

Days later, the NYT reports that OpenAI, Meta, and Google have all ignored the rules to train their own models.

Download Now: The State of AI [Free Report]

How AI Companies Dance Around Training Data Questions

When AI companies are asked about what they train their data on, it’s usually a vague answer about “publicly available data.”

In an interview with The Wall Street Journal, OpenAI’s chief technology officer Mira Murati said she “wasn’t sure” if data from YouTube or other social platforms was used to train Sora. That’s like a chef saying they don’t know what’s in the dish they’re serving you.

When pressed further, she said, “I’m just not gonna go into the details of the data that was used, but it was publicly available data or licensed data.”

But if the answer’s as simple as publicly available data, why are AI companies always so evasive?

Back in November 2023, Ed Newton-Rex, who led Stability AI’s audio team, resigned, stating that he didn’t “agree with the company’s opinion that training generative AI models on copyrighted works is ‘fair use.’”

AI companies often use the “fair use” exemption to justify training their models on copyrighted material. However, Rex argues that creators’ works do suffer as a result of the duplicative content made by AI models.

This is why publishers like The New York Times have terms of service that explicitly prohibit AI companies from using their content to train AI models. But with no federal AI law, enforcing this term becomes quite the task.

It’s one the NYT has taken on against OpenAI, filing a lawsuit against the company in December. The publisher joins authors and comedians who have also sued the AI giant for copyright infringement.

OpenAI maintains that they’ve done nothing wrong and have always used publicly available and licensed content.

Rex told Axios that the term “publicly available” is a term beloved by AI companies and used to confuse people.

He says it doesn’t mean the creator has given permission to use the content, it just means it wasn’t illegally obtained via hacking or similar method.

Why AI Companies Might Take Shortcuts

A recent NYT article reports that the big AI players – OpenAI, Google, Meta – are all cutting corners when it comes to collecting training data.

Should we take a second to feign shock?

Just a few weeks ago, newly released documents from a class-action lawsuit against Meta reveal that back in 2016, Zuckerberg and his team allegedly discussed how to intercept app traffic from Snapchat users.

They executed on this strategy, and later did the same thing with YouTube and Amazon users.

Doing so gave them access to sensitive data, such as usernames, passwords, and app activity.

More recently, employees at Meta discussed obtaining copyrighted data knowing the risk of lawsuits, to avoid the lengthy process of procuring licenses, according to the NYT article.

Google and OpenAI reportedly also took part in questionable data gathering processes, as a way to stay ahead in the AI race.

What’s happening today is that the amount of data available on the internet is dwindling.

So, AI companies are scrambling to secure big banks of data, either by offering large licensing deals or using other not-so-straight-and-narrow methods.

So, how can creators fight back? Many are going the lawsuit route, but it’s not an easy one.

In February, a federal judge dismissed the majority of the copyright infringement claims made by a group of authors including Ta-Nehisi Coates and Sarah Silverman.

This setback creates a difficult precedent for creatives relying on the law to protect their works.

In the coming months and years, we’ll likely see landmark cases and legislation that will shape how creators approach sharing their work and how AI companies collect their data.

AI market research tools: The top 7 your marketing strategy needs

Sep 22, 2025
Getting started with LinkedIn marketing & networking

Sep 18, 2025
How to create a landing page with high ROI [+ expert and data-backed tips]

Sep 17, 2025
How to create a QR code in 5 easy steps

Sep 12, 2025
Brand strategy 101: A marketing pro explains the important elements of a company branding plan

Sep 11, 2025
Social media SEO: 14 social media strategies to boost SEO

Sep 10, 2025
36 landing page examples + conversion secrets from HubSpot strategists

Sep 09, 2025
I tried 8 AI project management tools to see if they’re worth it

Sep 08, 2025
How to get sponsored on Instagram [what 500+ social media marketers are looking for]

Sep 08, 2025
Spam trigger words: How to keep your emails out of the spam folder

Sep 04, 2025

What does 'publicly available' training data mean to AI companies?

AI TRENDS IN 2025 REPORT

Download Now: The State of AI [Free Report]

How AI Companies Dance Around Training Data Questions

Why AI Companies Might Take Shortcuts

Related Articles

AI market research tools: The top 7 your marketing strategy needs

Getting started with LinkedIn marketing & networking

How to create a landing page with high ROI [+ expert and data-backed tips]

How to create a QR code in 5 easy steps

Brand strategy 101: A marketing pro explains the important elements of a company branding plan

Social media SEO: 14 social media strategies to boost SEO

36 landing page examples + conversion secrets from HubSpot strategists

I tried 8 AI project management tools to see if they’re worth it

How to get sponsored on Instagram [what 500+ social media marketers are looking for]

Spam trigger words: How to keep your emails out of the spam folder

Thank you!

You've been subscribed

Blogs

Blogs

Marketing

Sales

Service

Website

AI

Instagram Marketing

Customer Retention

Email Marketing

SEO

Sales Prospecting

Newsletters

Newsletters

The Hustle

Masters In Marketing

The Pipeline

Mindstream

Videos

Videos

The Hustle

Marketing with HubSpot

My First Million

Marketing Against the Grain

HubSpot

The Next Wave

The Science of Scaling

Podcasts

Podcasts

My First Million

The Hustle Daily Show

Marketing Against the Grain

The Next Wave

Science of Scaling

Inclusion in Marketing

7 Day Weekend

Nudge

The Ross Simmonds Show

Truth, Lies and Work

Resources

Resources

Academy

Templates

Ebooks

Kits

Tools

HubSpot Products

The HubSpot Customer Platform

Overview of all products

Marketing Hub

Sales Hub

Service Hub

Content Hub

Data Hub

Commerce Hub

About HubSpot

Contact Us

Customer Support

Log in

日本語

Deutsch

English

Español

Português

Français

What does 'publicly available' training data mean to AI companies?

AI TRENDS IN 2025 REPORT

Download Now: The State of AI [Free Report]

How AI Companies Dance Around Training Data Questions

Why AI Companies Might Take Shortcuts

Don't forget to share this post!

Related Articles

AI market research tools: The top 7 your marketing strategy needs

Getting started with LinkedIn marketing & networking

How to create a landing page with high ROI [+ expert and data-backed tips]

How to create a QR code in 5 easy steps

Brand strategy 101: A marketing pro explains the important elements of a company branding plan

Social media SEO: 14 social media strategies to boost SEO

36 landing page examples + conversion secrets from HubSpot strategists

I tried 8 AI project management tools to see if they’re worth it