Twingate Service Incident - Aug 19, 2023
Incident Report for Twingate
Postmortem

Summary

On August 19 at 7:51 AM UTC, Twingate received alerts of issues with the login services. Within a few minutes, the Twingate engineering team began investigating. The team quickly identified that our backend was seeing excessive timeouts from a 3rd-party API, preventing it from being able to process other requests such as authentication. After some initial fixes were unsuccessful, Twingate contacted the 3rd party and also disabled support for real-time updates that make use of these specific 3rd-party API calls. As a result, the issues started resolving at 8:10 AM UTC. Most of the services recovered quickly and full resolution occurred at 8:15 AM UTC.

The vendor later confirmed and fixed the issue, and Twingate re-enabled the real-time update feature shortly after on the same day, August 19.

Root cause

The Twingate backend was exhausted due to timeouts from a 3rd-party API.

Post-incident Analysis

Twingate had already separated out most services to their own deployments, allowing those services to function throughout the incident. Therefore, only some users that needed to authenticate or re-authenticate were affected; any user that had authenticated prior to the incident was not impacted.

Analysis of logs post-incident showed that the incident started at 7:49 AM UTC and fully recovered at 8:15 AM UTC.

Corrective actions

Short Term:

  • Separate Authentication and real-time services to their own deployments - COMPLETED

Medium / Long Term:

  • Reevaluate and optimize timeout values for various backend and 3rd party services
  • Simplify the internal Twingate process for enabling and disabling features
Posted Aug 23, 2023 - 06:13 UTC

Resolved
This incident has been resolved. We'll publish RCA as soon as we can.
Posted Aug 19, 2023 - 08:21 UTC
Update
We are continuing to investigate this issue.
Posted Aug 19, 2023 - 08:01 UTC
Investigating
We are seeing issues with Twingate service and investigating.
Posted Aug 19, 2023 - 08:01 UTC
This incident affected: Control Plane (Authentication, Authorization, Connector Heartbeat) and Management (Public API, Admin Console).