Twingate Public RCA: March 21, 2025 Authentication/Authorization Incident
Summary
On March 21, 2025, between 21:10 and 21:41 UTC, a subset of authentication and authorization requests to Twingate services experienced elevated error rates. The issue was mitigated by 21:41 UTC and services returned to normal operation.
What Happened
Twingate runs an active-active architecture across multiple Google Kubernetes Engine (GKE) clusters, balanced by a Global Load Balancer (GLB). At the onset of the incident, we observed anomalies in one of our clusters, and to protect overall system health, we proactively scaled down deployments in that cluster. This action shifted traffic to other clusters in the topology, primarily one that typically handles a lighter load and had been scaled accordingly.
The target cluster began autoscaling as expected, but the increased traffic caused elevated error rates, which triggered our retry mechanisms across services. While these retries helped many requests succeed, they also increased the overall system load. Simultaneously, the cluster underwent a cloud provider-initiated update operation that caused pod restarts and reduced capacity.
To stabilize the system, we reintroduced capacity in the previously affected cluster, rebalancing the traffic across regions. Once this occurred, error rates subsided and retries diminished.
Root Cause
Anomalous behavior from our cloud provider, including unexpected request timeouts at the load balancer level and instability during a cluster update, led to a cascade of retry traffic that temporarily overwhelmed parts of the system. We are actively investigating both the unexpected timeout configuration and the behavior of the cluster during the update with our cloud provider.
Corrective actions
Short-Term: