On Wednesday, Dharmesh stood on stage in front of a crowd of 24,000 people at INBOUND and introduced The Customer Code, one of the core tenets being: “Own your mistakes." This week, we put that principle to use.Early Thursday morning EDT we experienced an outage that affected some of our Marketing Hub Enterprise customers. Our engineering team resolved the underlying issue fairly quickly, then spent the remainder of the day resolving the effects of the outage. They met this morning to conduct an analysis of what happened and what we can learn from it.
We are sorry, and I want to provide more detail about what caused this issue and how we are going to prevent it in the future.
The Root Cause
HubSpot just rolled out a substantial number of new features to our Marketing Hub Enterprise customers. With this rollout, we also want to make it possible for customers at the starter and professional levels to try the full enterprise feature set on a self-serve basis.
On Wednesday, our engineering team began the required infrastructure work that would eventually support the improved enterprise trial experience. The plan was to launch the new trial experience before the end of September.
While making the first of these changes, we inaccurately tagged all existing Marketing Hub Enterprise portals as trial portals. This in itself did not cause a problem, as those portals still had their enterprise features and access.
However, many of those portals had originally started as trial portals, and when they were created they were given a trial expiration date by the trial system. That expiration date was never removed when the portal was upgraded to a Marketing Hub Enterprise portal. Again, this was not a problem in itself and is something that has existed in our system for some time.
But on Thursday morning, a daily process designed to turn off expired trials came across these Marketing Hub Enterprise portals, which were now tagged as trials with expiration dates in the past. The process determined that they were expired trials and downgraded them to free portals, removing all enterprise-level functionality from those portals.
The affected portals immediately lost their automation, CMS hosting, ads, email, and other functionality that is part of the enterprise tier. The most serious result of this was that customers’ websites, blogs, landing pages, and forms hosted on HubSpot immediately stopped working. For customers that have embedded forms in external pages, leads were still captured and did not experience any downtime.
After internal monitoring systems alerted us to outages within our content system, our engineering team began working to identify and resolve the problem. The issue was quickly identified, and our top priority was to restore these portals to the full functionality they had before this error. This process was completed by approximately 7:00 AM EDT Thursday morning.
Phase two was fixing the consequences of the downgrade process. Since it is not common for a downgraded portal to upgrade without human intervention, our system was not prepared to seamlessly and automatically “reupgrade” many of these “new” enterprise portals. As a result, our affected customers experienced lost site settings and pages, and found themselves with disconnected domains and workflows that weren’t executing.
This clean-up work was the most challenging for our team, and it’s where our customers felt the most pain as they waited for key pieces of their marketing stack (like websites and landing pages) to be restored. And as they waited, many of our customers’ customers were seeing an error message that made it appear as if the fault was with our customers, not us (more on how we’re solving for this below).
For the rest of the morning and into the afternoon, we were working to fix all of these issues and restore the sites back into the same state that they were in before the downgrade.
So, why did this take so long? Don’t we have backups, and shouldn’t this have been a simple matter of rolling back by an hour or so?
Yes, we do have backups, and we did use some of them.
If we had simply reverted to backups, however, HubSpot customers that were unaffected by the issue would have lost all of the work completed in the four hours between the changes going live and our discovering the issue, and we had no way to calculate the potential impact this would have on their businesses. So we decided not to rely on backups alone.
Instead, we needed to take a much more surgical approach. We needed to figure out how to revert specific customers’ data back while preserving all other customers’ work. And we had to do this across many different areas of the product while our customers were also working hard to get things up and running for their customers.
The recovery process involved our engineering team building and deploying a series of targeted scripts while also working closely with our customer support team to identify and resolve issues alongside affected customers.
First, I want to be very clear that the website downtime some of our customers experienced is unacceptable. This doesn’t just impact our customers. It impacts their customers and prospects as well.
Something that we emphasize at HubSpot is that we’re going to make mistakes, and that’s natural — but when we make mistakes, we make sure we learn from them to avoid repeating similar issues in the future. Our engineering team is doing everything possible to ensure that this will not happen again.
There are several big changes we’re making to help prevent this in the future. First, in the short term, we’ve removed all the old trial expiration dates from our system so that no more paid portals can be considered expired again. In addition, we’re sunsetting our old system for managing trials and moving to a more modern system that manages trial status in its own database. As part of this modernization, we’re going to simplify how we handle deactivation and make sure we only take simple, reversible actions until a much longer grace period has elapsed. We’re also going to improve our systems to detect anomalous changes to our product configuration code and halt any modifications that make too many changes at one time.
Second, we’re going to invest in tools that can quickly revert data for a subset of customers and ensure that systems that update data automatically carefully consider the current state of the data before making any changes. We’re going to work to improve how we handle reactivation so that if a portal loses access due to a legitimate cancellation, expired trial, or bug, we’ll be able to recover that portal back to its full state much more gracefully and quickly.
Third—and this is a small but important change—we’re going to fix how we communicate errors to our customers’ customers so that no error message ever makes it appear that our mistake appears to be our customers’ mistake.
Again, we’re sorry.
If you’re experiencing any additional issues related to this outage, please call HubSpot Support at 1-888-HUBSPOT x3.
Originally published Sep 7, 2018 3:06:37 PM, updated September 12 2018