Summary
At approximately 02:26 UTC on January 19th, we observed an increase in latency between our API layer and our backend database system. Within a few minutes, this spike in latency developed into an outage that resulted in 90% of requests returning one of two responses to the requestor: either a 500 (Internal Server Error) or a 502 (Bad Gateway Error) error depending on where the error in our system occurred. These error conditions were caused by timeouts occurring between our API layer and the database and persisted until approximately 03:36 UTC.
During the outage, both our private and public APIs were affected. These APIs are used to drive most of the functionality that end users and administrators experience in Twingate. Specifically, this means that customers’ admin consoles were not accessible, the public API was not responsive to requests, Clients and Connectors were unable to initiate authentication, and existing connections were eventually dropped without the ability to re-authenticate.
Root cause
The root cause of the issue is attributed to significant degradation in database performance due to a spike in CPU utilization, which increased latency across the system. The consequence of this increased latency was that even though our API layer was available to respond to requests, requests were taking significantly more time, leading to connection saturation of our API layer and the resulting rejection of additional requests, manifested as 500 or 502 errors to the requestor.
Corrective actions
In order to mitigate the risk of this root cause impacting our service in the future, we have initiated a number of improvements: