About a week ago, Bloomberg reported that Reddit had just signed a huge licensing deal ahead of its IPO, allowing an unnamed company to train its AI models on their data.
A new report says that company is Google, although neither party has confirmed it. If true, this would be Reddit's first content deal.
Since the AI race started, getting access to large, quality datasets has been a top priority.
AI models are trained on data – the more data it’s trained on, the better the output. In addition to quantity, there’s also a quality perspective. AI models want access to high-quality data that their competitors ideally don’t have access to.
This is where publishers like Reddit come in.
For a long time, OpenAI and other AI companies were freely roaming through publishers’ data. That was until publishers like The New York Times and Reddit caught on.
Last April, Reddit said, “If you want access to an 18-year deep well of data, you’re going to have to pay up.”
The NYT, on the other hand, just said, “no.” (And they’re suing OpenAI for allegedly still doing it.)
Now close to a year later, Google, Apple, and OpenAI have all signed licensing agreements with huge publishers worth $100+ million.
The latest to join is Reddit who reportedly signed with Google, in a deal worth $60 million annually. This deal likely has an exclusivity clause, ensuring that only Google has access to this data, however that hasn’t been confirmed.
With an upcoming IPO, Reddit’s CEO Steve Huffman shared the company had earned over $200 million in licensing deals.
“Reddit’s vast and unmatched archive of real, timely, and relevant human conversation on literally any topic is an invaluable dataset for a variety of purposes, including search, AI training, and research,” wrote Huffman in their S-1 filing.
This would also be a huge win for Google who’s been trying to dethrone OpenAI for years.
Some see licensing deals as a win-win: Publishers get paid for their data while AI companies get access to large, quality datasets.
However, it also comes with some setbacks.
Social media platforms like Reddit and X are community forums where people can write just about anything. Conspiracy theories, misinformation, and hateful rhetoric.
And although Reddit does have content moderators and policies, they only introduced a ban on hate speech 15 years after the site was founded.
Is that what AI models should be trained on?
AI companies can clean their data to filter out this type of content but there’s no clear standard that every model is built on. So, as a consumer, I won’t know what data models were trained on and how well they’ve been “cleaned.”
So, it begs the question: Should some websites be off the table when it comes to training AI models? And what guardrails are in place to ensure their models aren’t regurgitating the darkest content on the internet?
These answers are still up in the air.