Last week, HubSpot had a significant issue with our service due to the failure of one of the infrastructure systems that supports many parts of our platform. For those affected, we are so sorry for the pain this caused.
We’ve spent the last week focusing on recovery and talking with affected customers. Today I want to provide more information about what led to the issue, the ensuing events, our response, and what we’re changing as we operate moving forward.
We want to be as transparent as possible with you, so we’re sharing a lot of details about how our systems are set up and the series of events that exacerbated the outage. If you’d rather just know what we’re changing as a result, you can skip down to the final section.
Background on Our Systems
We use Apache Kafka, an open source software project used by many internet companies to keep HubSpot systems in sync with each other. In many respects, it's a notification system. Updates about actions taken in HubSpot are sent to Kafka and then other systems are able to read those updates to stay up-to-date.
For example, when a customer clicks a CTA, a message gets written to Kafka. Another system can then read this message and use it to update your analytics dashboard. Because Kafka is a distributed system made up of multiple servers, it uses another open source software system, ZooKeeper, to coordinate between servers in a cluster. A cluster is a collection of related servers working together.
Kafka and ZooKeeper are both designed to safeguard against crashes. Within our infrastructure, we run multiple Kafka clusters. If a few servers within a cluster crash, the system is designed to continue to function with no interruption for customers.
So, If There Are Safeguards, Why Did Things Break Down?
Three things happened on March 28th that led to the issues: a ZooKeeper crash, a related Kafka crash, and an unsuccessful Kafka restart. I will first walk through those events and then follow up with what we’ve learned and what we’ll do in the future to prevent it from happening again.
The issues started when one of the ZooKeeper clusters managing our primary Kafka cluster became overwhelmed by a high volume of requests that had backed up in our system. We were unaware of the strain this was putting on the system until parts of ZooKeeper began to crash.
Normally, ZooKeeper is able to quickly recover, even with multiple crashes, and Kafka is not affected. However, in this case, the ZooKeeper system did not recover for several minutes, which caused Kafka to enter into a degraded state.
Under normal circumstances, Kafka will return to a normal state once ZooKeeper recovers, but that did not happen as it should have. Although we were able to identify the issue with ZooKeeper and restore the ZooKeeper clusters to healthy operation, the primary Kafka cluster did not return to normal, and many of the Kafka servers crashed or failed.
Unsuccessful Kafka restart
Before we were able to fully restart Kafka, ZooKeeper experienced a second outage. As a result, when we went to restart Kafka again, it failed to start cleanly. We encountered multiple bugs inside its startup and crash recovery process. As a result of these bugs, the servers began restarting very slowly and were corrupting data as they recovered.
To safeguard against corrupted data, we shifted our response and began manually recovering the data directly from the servers. This was a time-consuming process that required backing up and copying many terabytes of files. In our opinion, it was the right thing to do to ensure that no additional data would become corrupted in the recovery process; this did, however, lead to a regrettably long recovery process.
The first crash happened at 11:10 AM EDT on March 28th. By 21:32 EDT on March 29th, all functionality was available again, but the data restoration work from the outage continued into this week. No data prior to the outage was lost. A small number of customers experienced form-submission losses that had come in during the outage period. We're reaching out individually to those customers.
We’re Making Some Key Changes
We know that this issue, along with the slow recovery, was painful for many of our customers. As users of the platform ourselves, we experienced that pain first-hand across our own marketing, sales, and service teams. That fact isn’t meant as consolation for our customers, but I wanted you to know our solutions will be informed by both your feedback and that first-hand experience. We're making changes across the company to better protect our customers from issues of this scale and impact in the future.
Here’s what we’re going to do in response to this outage:
Limit the impact of crashes by splitting up more key clusters within individual infrastructure components like Kafka and ZooKeeper. As mentioned above, both Kafka and ZooKeeper are comprised of clusters of servers working together. This crash affected one of the most important clusters within Kafka. As a result of this experience, we're going to split up that cluster so that an infrastructure outage of this size and scope won’t happen again if a single cluster goes down. Doing so will not only restrict issues to smaller portions of the product, it will make recovery much faster.
Invest heavily in our reliability team. HubSpot has a team that is almost entirely focused on testing and upgrading underlying systems to eliminate bugs as they're discovered. “Almost entirely” is entirely not enough. As of today, we have significantly increased resources to ensure that we have a fully dedicated team solely focused on reliability, upgrades, and testing. The dedicated team will oversee new standards, frequencies, and resources to ensure that we're consistently evaluating our key infrastructure systems for code fixes and critical patches without gaps.
Increase audits and scenario testing for massive failures. HubSpot regularly runs what is referred to by engineering teams as “chaos testing.” The goal of this testing is to create a test environment where you can purposefully and safely bring down servers to test the system response. This sort of testing has happened extensively at HubSpot but not as holistically as it should have for Kafka.
Our new reliability team’s first initiative will be a total audit of how we use systems like Kafka, including a thorough assessment of how we can move more quickly to recover data in cases where it's not immediately accessible from Kafka. We'll also make significant investments to increase regular testing of our critical infrastructure systems more comprehensively, including running scenarios on technical recovery from catastrophic failures of this sort.
Commit to better communication. Our technical response is only one aspect of managing an incident as significant as this one. We also need to make improvements to the way we communicate with you in the most urgent hours of an outage. Over the course of the outage and recovery, we posted 32 updates to our status page, but in those earliest, most critical hours, a desire to be accurate held us back from providing much detail beyond the fact that we were still working on the problem. We understand that’s not helpful to a customer who is trying to determine what to do next, and we’ll strive to do better moving forward.
We may not always have all the details settled, but we promise to provide more consistent updates on the status page and give you greater information about which tools are affected. For any issue spanning multiple hours, we also commit to including a reliable and specific timeframe for when the next update will be within each status post. When an individual issue is posted, you can subscribe to be alerted to any updates that follow.
Every day you wake up and place your trust in HubSpot to run the tools that help grow your business, and we’re so thankful for that. Your trust means so much to us, and we hate that last week we put you in a position to question it. We are sorry. We will get better and grow better for you.
Originally published Apr 4, 2019 1:10:31 PM, updated April 22 2019