Home
Marketing
What does 'publicly available' training data mean to AI companies?

What does 'publicly available' training data mean to AI companies?

Written by: Martina Bretous

AI TRENDS IN 2025 REPORT

Discover how AI is transforming the industry and empowering professionals through the insights of 1,000+ marketers.

Download Now

What does 'publicly available' training data mean to AI companies?

Updated: 04/09/24

A few days ago, Google seemingly put out a warning to OpenAI, saying they’re not allowed to train their models on YouTube data.

Days later, the NYT reports that OpenAI, Meta, and Google have all ignored the rules to train their own models.

Download Now: The State of AI [Free Report]

How AI Companies Dance Around Training Data Questions

When AI companies are asked about what they train their data on, it’s usually a vague answer about “publicly available data.”

In an interview with The Wall Street Journal, OpenAI’s chief technology officer Mira Murati said she “wasn’t sure” if data from YouTube or other social platforms was used to train Sora. That’s like a chef saying they don’t know what’s in the dish they’re serving you.

When pressed further, she said, “I’m just not gonna go into the details of the data that was used, but it was publicly available data or licensed data.”

But if the answer’s as simple as publicly available data, why are AI companies always so evasive?

Back in November 2023, Ed Newton-Rex, who led Stability AI’s audio team, resigned, stating that he didn’t “agree with the company’s opinion that training generative AI models on copyrighted works is ‘fair use.’”

AI companies often use the “fair use” exemption to justify training their models on copyrighted material. However, Rex argues that creators’ works do suffer as a result of the duplicative content made by AI models.

This is why publishers like The New York Times have terms of service that explicitly prohibit AI companies from using their content to train AI models. But with no federal AI law, enforcing this term becomes quite the task.

It’s one the NYT has taken on against OpenAI, filing a lawsuit against the company in December. The publisher joins authors and comedians who have also sued the AI giant for copyright infringement.

OpenAI maintains that they’ve done nothing wrong and have always used publicly available and licensed content.

Rex told Axios that the term “publicly available” is a term beloved by AI companies and used to confuse people.

He says it doesn’t mean the creator has given permission to use the content, it just means it wasn’t illegally obtained via hacking or similar method.

Why AI Companies Might Take Shortcuts

A recent NYT article reports that the big AI players – OpenAI, Google, Meta – are all cutting corners when it comes to collecting training data.

Should we take a second to feign shock?

Just a few weeks ago, newly released documents from a class-action lawsuit against Meta reveal that back in 2016, Zuckerberg and his team allegedly discussed how to intercept app traffic from Snapchat users.

They executed on this strategy, and later did the same thing with YouTube and Amazon users.

Doing so gave them access to sensitive data, such as usernames, passwords, and app activity.

More recently, employees at Meta discussed obtaining copyrighted data knowing the risk of lawsuits, to avoid the lengthy process of procuring licenses, according to the NYT article.

Google and OpenAI reportedly also took part in questionable data gathering processes, as a way to stay ahead in the AI race.

What’s happening today is that the amount of data available on the internet is dwindling.

So, AI companies are scrambling to secure big banks of data, either by offering large licensing deals or using other not-so-straight-and-narrow methods.

So, how can creators fight back? Many are going the lawsuit route, but it’s not an easy one.

In February, a federal judge dismissed the majority of the copyright infringement claims made by a group of authors including Ta-Nehisi Coates and Sarah Silverman.

This setback creates a difficult precedent for creatives relying on the law to protect their works.

In the coming months and years, we’ll likely see landmark cases and legislation that will shape how creators approach sharing their work and how AI companies collect their data.

Topics:

Artificial Intelligence

Answer engine optimization trends in 2026: How AEO is transforming the landscape

Jan 06, 2026
The best AI visibility tools that actually improve lead quality

Jan 05, 2026
AI search strategy: A guide for modern marketing teams

Dec 31, 2025
What we learned building SalesBot — HubSpot’s AI-powered chatbot selling assistant

Dec 29, 2025
Automated email segmentation: Setting up for better targeting

Dec 26, 2025
Top 7 use cases for AI personalization in marketing

Dec 23, 2025
AI-powered email content suggestions that actually convert leads

Dec 15, 2025
How to humanize AI content to rank, engage, and get shared

Dec 15, 2025
Generative AI tools every marketing team should use

Nov 10, 2025
Starting a new business? Here are the AI tools I would use when building from scratch

Oct 21, 2025

What does 'publicly available' training data mean to AI companies?

AI TRENDS IN 2025 REPORT

Download Now: The State of AI [Free Report]

How AI Companies Dance Around Training Data Questions

Why AI Companies Might Take Shortcuts

Related Articles

Answer engine optimization trends in 2026: How AEO is transforming the landscape

The best AI visibility tools that actually improve lead quality

AI search strategy: A guide for modern marketing teams

What we learned building SalesBot — HubSpot’s AI-powered chatbot selling assistant

Automated email segmentation: Setting up for better targeting

Top 7 use cases for AI personalization in marketing

AI-powered email content suggestions that actually convert leads

How to humanize AI content to rank, engage, and get shared

Generative AI tools every marketing team should use

Starting a new business? Here are the AI tools I would use when building from scratch

Thank you!

You've been subscribed

Blogs

Blogs

Marketing

Sales

Service

Website

AI

Instagram Marketing

Customer Retention

Email Marketing

SEO

Sales Prospecting

Newsletters

Newsletters

The Hustle

Marketing Against the Grain

The Science of Scaling

Mindstream

Videos

Videos

The Hustle

Marketing with HubSpot

My First Million

Marketing Against the Grain

HubSpot

The Next Wave

The Science of Scaling

Resources

Resources

Academy

Templates

Ebooks

Kits

Tools

Podcasts

HubSpot Products

The HubSpot Customer Platform

Overview of all products

Marketing Hub

Sales Hub

Service Hub

Content Hub

Data Hub

Commerce Hub

About HubSpot

Contact Us

Customer Support

Log in

日本語

Deutsch

English

Español

Português

Français

What does 'publicly available' training data mean to AI companies?

AI TRENDS IN 2025 REPORT

Download Now: The State of AI [Free Report]

How AI Companies Dance Around Training Data Questions

Why AI Companies Might Take Shortcuts

Don't forget to share this post!

Related Articles

Answer engine optimization trends in 2026: How AEO is transforming the landscape

Thank you!

You've been subscribed