Relay2 (us-east4) and Relay4 (eu-west6) Issue

Incident Report for Twingate

Postmortem

Components impacted

Relay clusters in us-east4 (Ashburn, Virginia) and europe-west6 (Zurich, Switzerland)

‌

Summary

We’ve recently been working on adding spot instance scaling to our Relay cluster infrastructure. On Feb 14th at 01:00 UTC, we initiated this upgrade process via a Terraform change to all of our Relay clusters globally, which took approximately 45 minutes to complete.

At 01:40 UTC, we noticed a decrease in the number of connections in two clusters (us-east4 and europe-west6) and started an investigation. We also engaged our cloud infrastructure provider proactively to rule out a regional cloud provider issue.

At 02:04 UTC, we disabled pods in the two affected clusters, which caused connected Clients and Connectors to re-connect to the next closest Relay cluster. Overall connection metrics were seen to be normal across the redistributed connections.

On further investigation, we determined that an error in our Terraform configuration affecting network firewall rules had caused the issue. At 03:30 UTC we corrected the error and redeployed the affected clusters, which resolved the issue.

‌

Root cause

An error in the deployed Terraform configuration removed a critical network tag, which was required to set the correct network firewall rules within our Relay cluster. The result was that Relay clusters were discoverable, but not reachable, leading to the deadlock state experienced by the Clients and Connectors attempting to attach to the affected clusters.

This issue only affected two of our Relay clusters because of a configuration difference that was dependent on the overall sizing of these clusters. This ultimately hid the issue during testing in our staging environment because this cluster sizing difference, which in turn leads to different configuration outcomes, was not accurately reflected.

‌

Corrective actions

We are taking the following short term actions, some of which are already completed, to avoid this problem in the future:

Accurately reflect Relay cluster sizing and configuration differences in our staging environment.
Auto-create plans for all environments, including all configuration variations in product, with feature development branches.
Enhance Relay health checks to ensure clusters are non-discoverable to Connectors and Clients if the necessary network firewall tag is not in place.
Research and implement staggered rollouts with Terraform for our Relay cluster infrastructure.

We also have medium term plans to add multi-cluster connectivity to our Connectors to handle regional Relay cluster problems automatically.

Posted Feb 17, 2023 - 16:47 UTC

Resolved

We have identified the issue and a fix is implemented. Both relays clusters are healthy and processing requests.

Posted Feb 14, 2023 - 03:34 UTC

Identified

We found issues with 2 of our relay clusters, one in US (region: Virgina and one in Europe (region: Zurich). While we are working on bringing them up, the connectors and clients should have automatically reconnected to the other relay clusters, which may cause some slowness for some customers closer to those clusters since now they need to connect to other clusters.

Posted Feb 14, 2023 - 02:37 UTC

Investigating

We see issues with one of our relay clusters (US relay cluster 2) and investigating.

Posted Feb 14, 2023 - 02:06 UTC

This incident affected: Data Plane (Americas Relays, Peer-to-Peer).