US East Coast Relay issue
Incident Report for Twingate
Postmortem

Components impacted

Relay

Connector

Summary

On this date we had an outage during routine maintenance of our relay infrastructure. The issue started at 4am UTC and was resolved within 2 hours, requiring some customers to restart their connectors in order to re-establish connectivity to our relay infrastructure.

Root cause

In our investigation we determined that the connector received a malformed response from the relay cluster during its maintenance cycle. The malformed response in question contains the address of a particular relay node to which the connector is instructed to connect. This malformed response resulted in the connector retrying access to a non-existing relay node without failing over to another relay cluster.

Corrective actions

After correcting the specific issue that caused the malformed response, we modified both the relay and connector logic so that failover now happens automatically any time that a malformed response is received. We also modified our maintenance procedures to add additional health checks to prevent malformed responses. Finally, we took the opportunity to enhance how the failover logic works to incorporate multiple levels of relay redundancy in the connector's initial configuration that it receives after authentication.

Posted Jul 29, 2021 - 18:39 UTC

Resolved
The affected Relay cluster is now fully operational.
Posted Oct 22, 2020 - 06:56 UTC
Update
We are monitoring as our Relay cluster is coming back online. Any affected Connectors that did not automatically reconnect may require a restart in order to resolve any connectivity issues.
Posted Oct 22, 2020 - 06:33 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Oct 22, 2020 - 06:32 UTC
Investigating
We are currently investigating the issue.
Posted Oct 22, 2020 - 03:00 UTC
This incident affected: Data Plane (Americas Relays).