Summary
On January 24 at 19:58 UTC, our on-call team started to receive automated alerts regarding system performance degradation. The team began an investigation and, by 20:02 UTC, the degradation had escalated to a point where some Twingate Clients began to experience request timeouts. The Client behavior on request timeout is to initiate request retries, which triggered additional requests to our infrastructure.
Due to the overall performance degradation, the increase in inbound requests overloaded the system to a point where internal health check requests also began to fail. This resulted in system components being marked as offline, further reducing the available capacity to respond to requests. Autoscaling of serving infrastructure did occur, but the increase in capacity was insufficient to remedy the system’s overall decrease in performance, on top of the additional request workload.
Between 20:05 UTC and 20:45 UTC, we identified that the performance degradation was exclusively affecting our authorization engine, independent of other system capabilities. At 20:47 UTC, we promoted our physically separate standby cluster to share load with the existing cluster in an active-active mode.
Both clusters began serving traffic at 20:48 UTC, and some improvement to authorization engine throughput was seen, but individual requests were still taking much longer than normal. Noticing that the authorization engine was experiencing a higher load from certain tenants, the team next separated these tenants’ traffic to an isolated replica cluster in order to provide a surplus of processing bandwidth. System load returned to normal on the main cluster, and the traffic was gradually recombined between 21:13 UTC and 21:48 UTC. The system fully recovered at this point.
Post-incident Analysis
In our analysis across all tenant traffic during the incident, we determined that for tenants with the latest Connector and most up-to-date Client applications, less than 10% of users experienced any downtime related to Resource access. Many users were unaware of this incident as their connections remained active due to changes we implemented last year and were introduced in Client and Connector updates. The experienced severity of this incident was hence highly correlated with whether Clients and Connectors were up to date for a given tenant. However, this version disparity also affected the severity of the incident as a whole, and we discuss this in both the root cause and corrective actions, below.
Root Cause
This incident occurred because of two independent events that occurred simultaneously that in turn were made worse due to deployed Connectors and Clients with out of date behaviors.
Specifically:
We are confident that if any of the above three conditions were not true, this incident would not have occurred.
Corrective Actions
Our corrective actions focus on addressing the three contributing factors above. In short, we will be: making upgrades and configuration changes to our infrastructure provider’s load balancers; improving authorization engine performance; and forcing upgrades of out of date deployed components. Many of these tasks were already underway before the incident, and some related tasks’ completion will be accelerated. A detailed breakdown is provided below.
Immediate
We have already taken the following immediate corrective actions:
Short Term
Medium & Long Term