As a continuous delivery service, we form an integral part of our customers’ workflow. Any interruption to our service impacts our customers a lot, so we need to make sure we’re resolving any service interruption immediately.
By building our system to be very resilient with immutable infrastructure at the core, we’re able to react to most interruptions quickly. Additionally we’ve implemented automatic health checks to remove systems from production as soon as the load or network latency becomes too high.
This is very important when deploying into cloud systems because the cloud behaves more unpredictably than your own hardware. You want to take this into account when building your system so your team can be focused on building your product at all times.
A recent interview I did with Statuspage prompted me to write in more detail about how we respond to incidents. In this post, I’ll outline our processes, tools, and the document templates we use to make sure we have a repeatable workflow.
Downtime recognition must always be based on metrics in your system. If downtimes are only recognized when your team or customers see it in the production application a negative impact has already happened.
To make sure we’ve got great insights into our infrastructure, we use Librato Metrics, Pingdom, and New Relic for collecting various application and server metrics. Every system has various alerts set that trigger Pagerduty incidents.
Our on-call hours
We use Pagerduty to manage on-call times and notify our team of any service interruption. We’re spread out between Vienna and Boston and a few other remote locations, so we can cover almost 24 hours without getting into somebody’s night time. A person woken up in the middle of the night will not perform at their best during the outage. Plus, it ruins their productivity for the next day.
Our local coverage times are as follows:
- Europe: 4 a.m. – 4 p.m. CEST
- Boston: 10 a.m. – 10 p.m. EST
Both are reasonable times for each city. In the worst case, somebody might be woken up in the morning in Vienna, though that’s at a time where we get much less traffic, so service interruptions are less likely. We can cover the whole US workday, from East to West Coast, from Boston so nobody has to be woken up during this most crucial time of the day for us.
We’ve recently switched to weekly schedules, so you’re on call for a week in your timezone. Weekends are covered completely by one person, so it’s rare that you have to be on call on a weekend.
To make sure there’s always somebody covering even if somebody misses a pager call, we’ve set up a secondary schedule where even non-technical people are on call.
Our primary versus secondary schedules
The primary schedule is always comprised of people who can fix our infrastructure; the secondary is made up of most of the rest of the team. This has several advantages.
First of all, we share on-call duty between most people in the team. Even if one of the developers doesn’t see the page, somebody else picks up and can ping other developers and talk to customers.
Having only the development team on a schedule would put an unfair burden on the dev team. It would also make it more difficult to resolve the issue and communicate with customers at the same time.
Through this secondary schedule system, we always have somebody we can pull in to take over customer communication. That can mean calling the person on secondary support at night as well if that’s necessary. It usually isn’t necessary to pull in the secondary, but it’s definitely a great fallback to have.
In case a pager call falls through the first and second layer, Jim, our VPE, is on a third layer, and I am on a fourth layer. This makes sure the escalation goes all through the technical team in Codeship if necessary (thankfully it hasn’t been for a long time). While growing our team, we’re constantly iterating on our coverage hours to make sure we have great coverage at all times.
Downtime Remediation and Follow Up
As soon as the issue is discovered, the development team will decide who will look into it. Our customer success team will decide who will take over communication for this specific issue and will work with the assigned developer to communicate regularly with our customers.
By defining those two people, we have a clear line of communication between the developers who know the current status of the issue and customer success who can answer support requests and send updates through Twitter or our Statuspage.
Steps while fixing the issue
The first and most important step is to update our Statuspage. We want to make sure we’re always communicating immediately with our customers if we have a service interruption.
We’re always trying to ask ourselves, “How would we expect important services that we use to communicate with us in a downtime?” and “How can we build more long-term trust through our downtime communication?”
The way you communicate to your customers during a downtime will decide if they trust you more or less in the future. A single interruption doesn’t necessarily erode trust, but a downtime with bad communication definitely will. Customers expect full transparency when issues are coming up, including a detailed description of what the issue is and how it will be resolved.
While fixing the issue, we will regularly update our customers through various channels, and the development team and customer success are in constant contact with each other.
As we’re currently growing our engineering team, we will probably move to a setup where the roles are even more clearly defined. As soon as you have several teams working on different services where downtime might impact several services at once, you need to have a higher level management role in place to manage the downtime effort.
Heroku released a very interesting post last year about their incident response with the Incident Command System. It’s definitely an interesting approach that we’re thinking about for the future.
As soon as an issue is fixed, we start work on a debriefing about the service interruption. We’ve found structured debriefings to be an incredibly helpful tool to get an overview of the whole issue and determine future steps we need to take.
Because we’re spread out between Europe and the US, compiling this debrief is typically done asynchronously on GitHub with people commenting or committing and pushing changes directly onto the debriefing branch. Meetings or calls are done if necessary, but typically don’t have to be part of a debriefing.
We have a GitHub repository called Operations that contains our debriefings as well as a template for them. I’ve copied the template into a GitHub Gist.
The template contains an executive summary so anyone can get a quick overview on what happened. We can share this summary internally.
The debriefing template then dives into different aspects of the issue, from detailed technical descriptions to customer communication. We also capture any steps we can take to improve our resolution efforts in the future.
We want to make sure this document accurately reflects what has happened and digs deep while at the same time doesn’t push blame on anyone. Only when we analyze exactly what happened in our technology and process are we able to improve. But we won’t be able to dig deep and make sure everyone opens up about their mistakes if we then turn around and immediately use it to punish people.
One very important lesson we’ve learned about our template is to leave irrelevant sections of the template empty instead of removing them for a particular debriefing. Removing irrelevant sections makes it difficult to compare debriefings to one another. It can also cause somebody not directly involved with the downtime to wonder if something is missing or getting hidden. Debriefings should be as transparent as possible to maintain your team’s trust in the process.
The debriefings also capture To Dos, so it’s easy for anyone to follow up and see if we followed through with steps we needed to improve.
These debriefings also form the core data we use to write any public follow-up blog post or any other customer communication we’re doing after the resolution.
By having a well-defined process that goes from detection of an issue to customer communication to solving the issue to debriefing, we can focus all of our energy on resolving an issue once it comes up instead of scrambling to find an ad hoc process.
In fact, we want this process to be so second nature to everyone that we wrote down a document outlining those steps and hung them up in some of our restrooms. You can download the document here.
We also include the document as part of our onboarding activities. This ensures that everyone who joins knows exactly how to resolve an issue once they become part of the Pagerduty schedule.
While we’re definitely going to change and formalize our process as we grow, we feel we’ve got the right balance of process and flexibility for the current size of our team.
Let us know if you have a different process for finding or resolving issues in your system in the comments.