Issue connecting to Twingate

Incident Report for Twingate

Postmortem

Summary

At approximately 02:26 UTC on January 19th, we observed an increase in latency between our API layer and our backend database system. Within a few minutes, this spike in latency developed into an outage that resulted in 90% of requests returning one of two responses to the requestor: either a 500 (Internal Server Error) or a 502 (Bad Gateway Error) error depending on where the error in our system occurred. These error conditions were caused by timeouts occurring between our API layer and the database and persisted until approximately 03:36 UTC.

During the outage, both our private and public APIs were affected. These APIs are used to drive most of the functionality that end users and administrators experience in Twingate. Specifically, this means that customers’ admin consoles were not accessible, the public API was not responsive to requests, Clients and Connectors were unable to initiate authentication, and existing connections were eventually dropped without the ability to re-authenticate.

Root cause

The root cause of the issue is attributed to significant degradation in database performance due to a spike in CPU utilization, which increased latency across the system. The consequence of this increased latency was that even though our API layer was available to respond to requests, requests were taking significantly more time, leading to connection saturation of our API layer and the resulting rejection of additional requests, manifested as 500 or 502 errors to the requestor.

Corrective actions

In order to mitigate the risk of this root cause impacting our service in the future, we have initiated a number of improvements:

Completed: We have doubled the master database cluster server size in order to prevent utilization spikes disrupting our ability to continue to serve requests.
Short term: We are working on introducing zonal database read replicas, which will improve distribution of system load and will also remove the master database as a single point of failure. These improvements will also allow our service to maintain partial connectivity in situations when the master database is unavailable.
Medium term: We are implementing changes to the Client connection session management to maintain connectivity in cases when backend services are unreachable. This will introduce an additional layer of resiliency to our system beyond the changes described above.

Posted Jan 23, 2022 - 07:20 UTC

Resolved

We are continuing to monitor the system, and it remains stable and available. We are closing out this incident and we will follow up with a post mortem here.

Posted Jan 19, 2022 - 17:16 UTC

Monitoring

We have re-established connectivity and Twingate services have been restored. We are continuing to monitor our systems.

Posted Jan 19, 2022 - 15:52 UTC

Update

Our engineers have isolated the problem to a network connectivity issue between our application servers and our database infrastructure. Our team is working to restore network connectivity and we will continue to post regular updates.

Posted Jan 19, 2022 - 15:23 UTC

Investigating

We are aware of an incident affecting our production system and are currently actively investigating the issue. We will be posting regular updates pertaining to this incident.

Posted Jan 19, 2022 - 14:47 UTC

This incident affected: Data Plane (Americas Relays), Control Plane (Connector Heartbeat), and Management (Admin Console).