Yikes, Internal Microsoft Data Leaked While Sharing AI Training Data

Subscribe to HubSpot's Next in AI Newsletter
Martina Bretous
Martina Bretous

Updated:

Published:

Ever show someone a specific picture in your Photos album and they start swiping left and right, seeing things they weren’t supposed to? That’s kind of what happened with Microsoft last week.

Internal Microsoft Data Leaked While Sharing AI Training Data

Their AI research team published AI training data on GitHub, a cloud-based repository where developers store and manage code. Turns out they also granted public access to 38 terabytes of sensitive information.

Here’s how the error was discovered.

Click Here to Subscribe to HubSpot's AI Newsletter

The research team at Wiz, a cloud security startup found Microsoft’s GitHub repository, which included a URL to access and download AI models for image recognition, during a routine data exposure tracking exercise.

Unbeknownst to Microsoft, the link also granted users access to the entire storage account, including more than 30k internal Teams messages from over 300 employees, passwords, secret keys, computer backups, and other personal data.

microsoft's leaked Teams messages from github repository

Image Source

But that’s not all – users also had full control of the data itself, which they could use to delete and overwrite files. This means that any bad actor could add malicious code into the models, creating a domino effect impacting every user who downloaded them. Big yikes.

To understand how it happened, we have to get technical for a second.

Azure is Microsoft’s cloud computing platform, and on it are Shared Access Signature (SAS) tokens. Think of these tokens as keys that grant access to Azure storage resources with customizable permissions and expiration dates.

In Microsoft’s case, a token was accidentally included in this publicly accessible URL.

Wiz alerted Microsoft of the issue on June 22 and the token was revoked two days later. Microsoft’s Security Research and Defense team says customer data wasn’t exposed and neither was data from other Microsoft services.

They’ve also taken measures to expand GitHub’s secret scanning service to include the monitoring of SAS tokens.

What’s Wiz’s recommendation? Limit the use of SAS tokens, because they’re difficult to monitor.

As for the bigger takeaway, this incident makes the case for AI research teams and security teams to work together more closely.

Organizations training AI models are working with a much higher volume of data. With that volume comes a need for more robust security checks to prevent breaches.

Click Here to Subscribe to HubSpot's AI Newsletter

Related Articles

A weekly newsletter covering AI and business.

SUBSCRIBE

The weekly email to help take your career to the next level. No fluff, only first-hand expert advice & useful marketing trends.

Must enter a valid email

We're committed to your privacy. HubSpot uses the information you provide to us to contact you about our relevant content, products, and services. You may unsubscribe from these communications at any time. For more information, check out our privacy policy.

This form is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.