Postmortem - Partial Outage on July 22nd 2015
In the spirit of maintaining a relationship of transparency and trust with our customers, we’ve provided a rundown of events that transpired yesterday that affected a number of regions. We sincerely apologize to those affected and have taken steps to ensure this particular issue cannot happen again.
On July 22nd, 2015 we inadvertently took a subset of projects offline. The largest impacted region was Joyent us-east-1 and Joyent eu-ams-1. The underlying cause was a bug in a newly launched internal auditing tool. The responsibility of the auditing tool was to continually scan our infrastructure for servos that are no longer associated with a project and deprovision them. Unfortunately, the tool incorrectly identified a subset of live servos. The projects were restored with a simple stop/start command to the project, which causes a re-provision, but the projects were down during this time.
The effects of the bug were confined to only our previous generation of application hosts. The bug was caused by a slight and overlooked difference between how we store information about what servos on are each hosts. Whereas the tool did go through code review, staging testing, and we successfully rolled it out to multiple smaller regions, it wasn't until it hit the large Joyent us-east-1 region that the bug manifested itself.
We already have in place tighter release and review policies for changes that impact customer uptime. For example, changes to the load balancers are tightly controlled. This tool should have gone through a more stringent review because it did have the potential to affect uptime. We have changed our policy to better identify tools that can affect uptime and ensure they pass through the appropriate release procedure.
We apologize again if you were one of the people affected. Maintaining uptime is one of our most important goals as a product and as a company and we did not meet that goal yesterday. We have learned from this and as a result will move forward with a more stable environment for our customers and their applications.