Components impacted
Relay clusters in us-east4 (Ashburn, Virginia) and europe-west6 (Zurich, Switzerland)
Summary
We’ve recently been working on adding spot instance scaling to our Relay cluster infrastructure. On Feb 14th at 01:00 UTC, we initiated this upgrade process via a Terraform change to all of our Relay clusters globally, which took approximately 45 minutes to complete.
At 01:40 UTC, we noticed a decrease in the number of connections in two clusters (us-east4 and europe-west6) and started an investigation. We also engaged our cloud infrastructure provider proactively to rule out a regional cloud provider issue.
At 02:04 UTC, we disabled pods in the two affected clusters, which caused connected Clients and Connectors to re-connect to the next closest Relay cluster. Overall connection metrics were seen to be normal across the redistributed connections.
On further investigation, we determined that an error in our Terraform configuration affecting network firewall rules had caused the issue. At 03:30 UTC we corrected the error and redeployed the affected clusters, which resolved the issue.
Root cause
An error in the deployed Terraform configuration removed a critical network tag, which was required to set the correct network firewall rules within our Relay cluster. The result was that Relay clusters were discoverable, but not reachable, leading to the deadlock state experienced by the Clients and Connectors attempting to attach to the affected clusters.
This issue only affected two of our Relay clusters because of a configuration difference that was dependent on the overall sizing of these clusters. This ultimately hid the issue during testing in our staging environment because this cluster sizing difference, which in turn leads to different configuration outcomes, was not accurately reflected.
Corrective actions
We are taking the following short term actions, some of which are already completed, to avoid this problem in the future:
We also have medium term plans to add multi-cluster connectivity to our Connectors to handle regional Relay cluster problems automatically.