Twingate Service Incident
Incident Report for Twingate
Postmortem

Summary

On January 24 at 19:58 UTC, our on-call team started to receive automated alerts regarding system performance degradation. The team began an investigation and, by 20:02 UTC, the degradation had escalated to a point where some Twingate Clients began to experience request timeouts. The Client behavior on request timeout is to initiate request retries, which triggered additional requests to our infrastructure.

Due to the overall performance degradation, the increase in inbound requests overloaded the system to a point where internal health check requests also began to fail. This resulted in system components being marked as offline, further reducing the available capacity to respond to requests. Autoscaling of serving infrastructure did occur, but the increase in capacity was insufficient to remedy the system’s overall decrease in performance, on top of the additional request workload.

Between 20:05 UTC and 20:45 UTC, we identified that the performance degradation was exclusively affecting our authorization engine, independent of other system capabilities. At 20:47 UTC, we promoted our physically separate standby cluster to share load with the existing cluster in an active-active mode.

Both clusters began serving traffic at 20:48 UTC, and some improvement to authorization engine throughput was seen, but individual requests were still taking much longer than normal. Noticing that the authorization engine was experiencing a higher load from certain tenants, the team next separated these tenants’ traffic to an isolated replica cluster in order to provide a surplus of processing bandwidth. System load returned to normal on the main cluster, and the traffic was gradually recombined between 21:13 UTC and 21:48 UTC. The system fully recovered at this point.

Post-incident Analysis

In our analysis across all tenant traffic during the incident, we determined that for tenants with the latest Connector and most up-to-date Client applications, less than 10% of users experienced any downtime related to Resource access. Many users were unaware of this incident as their connections remained active due to changes we implemented last year and were introduced in Client and Connector updates. The experienced severity of this incident was hence highly correlated with whether Clients and Connectors were up to date for a given tenant. However, this version disparity also affected the severity of the incident as a whole, and we discuss this in both the root cause and corrective actions, below.

Root Cause

This incident occurred because of two independent events that occurred simultaneously that in turn were made worse due to deployed Connectors and Clients with out of date behaviors.

Specifically:

  1. A temporary anomaly in our infrastructure provider’s load balancer caused a short term, but very significant (greater than 10 seconds), increase in request latency. This in turn triggered Client request retry behavior, increasing the overall load on the system in a short time span.
  2. Independent to the above event, a large number of computationally-costly changes were triggered in our authorization engine through non-anomalous tenant activity. This increased the processing time for authorization requests.
  3. Sufficient Connectors and Clients are deployed in our tenant base that do not have the most up to date logic in place for handling connection degradation. Clients and Connectors released before approximately May 2022 do not back off their retry requests, leading to an overwhelmingly large volume of requests to our system from a relatively small number of deployed Clients and Connectors. This exacerbated both (1) and (2).

We are confident that if any of the above three conditions were not true, this incident would not have occurred.

Corrective Actions

Our corrective actions focus on addressing the three contributing factors above. In short, we will be: making upgrades and configuration changes to our infrastructure provider’s load balancers; improving authorization engine performance; and forcing upgrades of out of date deployed components. Many of these tasks were already underway before the incident, and some related tasks’ completion will be accelerated. A detailed breakdown is provided below.

Immediate

We have already taken the following immediate corrective actions:

  1. Increased authorization engine capacity and distributed the load between multiple clusters located in different geographic regions
  2. Isolated authorization requests to a dedicated deployment
  3. Increased backend and health check timeouts to more appropriately match the potential for authorization request latency increases
  4. Upgraded our infrastructure provider’s load balancer to improve container-awareness

Short Term

  1. Complete a significant upgrade of our authorization engine. This includes removing a subsystem that was identified as the bottleneck during this incident and previous incidents. This project began in 2022 Q4 and we expecting this replacement upgrade to complete by early 2023 Q2.
  2. Introduce additional deployment isolation for different request types so that a failure in one part of the system doesn't affect other subsystems. This proved to work very well during the incident, and we will be further standardizing this in product.
  3. Introduce additional logging to help accelerate future troubleshooting.

Medium & Long Term

  1. Gradually move more parts of our application servers from synchronous request processing to asynchronous processing.
  2. Consider the use of a sidecar proxy in front of our application servers.
  3. Consider the use of an improved load based auto-scaling mechanism for the authorization engine.
Posted Jan 28, 2023 - 02:10 UTC

Resolved
We are marking the issue as resolved. The system works as expected with healthy metrics.
Posted Jan 25, 2023 - 01:03 UTC
Update
Public-API (Admin-API) has been brought up too. While all the metrics for the service looks healthy, we will continue to monitor them.
Posted Jan 24, 2023 - 22:32 UTC
Monitoring
We have identified the issue and Twingate system looks healthy since 1:48 pm PST. We are still monitoring the issue.
Public-API is still being kept down for the time being.
Posted Jan 24, 2023 - 22:05 UTC
Update
We are continuing to investigate this issue.
Posted Jan 24, 2023 - 21:30 UTC
Update
Twingate engineering is still working on identifying the root cause of the issue. We'll continue to provide updates as we find out more.
- Public-api is disabled.
- Admins and Logins should work.
- Still seeing issue with Authorization.
Posted Jan 24, 2023 - 21:20 UTC
Update
Twingate Engineering is fully engaged and we are still investigating the issue. We'll provide further updates as soon as we can.
Posted Jan 24, 2023 - 20:44 UTC
Investigating
We are currently investigating this issue.
Posted Jan 24, 2023 - 20:09 UTC
This incident affected: Control Plane (Authentication, Authorization, Connector Heartbeat), Management (Public API, Admin Console), and Homepage (Downloads).