Summary
On June 7th 17:08 UTC, a new version of our Controller software was rolled out. Shortly after the rollout completed, our on-call team received automated exception alerts and began investigating. The new version had inadvertently included changes to clean up our database that were out of sync with the deployed code, and the team decided to rollback the new software deployment. The rollback proceeded smoothly, however the previous code version was missing the database fields that had been cleaned up, and the incident started at 17:24 UTC.
A fix was prepared and rolled out starting at 17:29 UTC. Deployment completed on the first cluster at 17:38 UTC and proceeded to the remaining clusters once we verified that the error state had been resolved. Deployment was completed on all clusters at 17:48 UTC.
Post-incident Analysis
We initially thought the incident had only impacted Admin Console users, however the following systems were impacted:
Root Cause
An error in our deployment process logic led to a mismatch in deployed code and database schema.
Corrective Actions