Twingate Service Issues

Incident Report for Twingate

Postmortem

Twingate Public RCA: March 21, 2025 Authentication/Authorization Incident

Summary

On March 21, 2025, between 21:10 and 21:41 UTC, a subset of authentication and authorization requests to Twingate services experienced elevated error rates. The issue was mitigated by 21:41 UTC and services returned to normal operation.

What Happened

Twingate runs an active-active architecture across multiple Google Kubernetes Engine (GKE) clusters, balanced by a Global Load Balancer (GLB). At the onset of the incident, we observed anomalies in one of our clusters, and to protect overall system health, we proactively scaled down deployments in that cluster. This action shifted traffic to other clusters in the topology, primarily one that typically handles a lighter load and had been scaled accordingly.

The target cluster began autoscaling as expected, but the increased traffic caused elevated error rates, which triggered our retry mechanisms across services. While these retries helped many requests succeed, they also increased the overall system load. Simultaneously, the cluster underwent a cloud provider-initiated update operation that caused pod restarts and reduced capacity.

To stabilize the system, we reintroduced capacity in the previously affected cluster, rebalancing the traffic across regions. Once this occurred, error rates subsided and retries diminished.

Root Cause

Anomalous behavior from our cloud provider, including unexpected request timeouts at the load balancer level and instability during a cluster update, led to a cascade of retry traffic that temporarily overwhelmed parts of the system. We are actively investigating both the unexpected timeout configuration and the behavior of the cluster during the update with our cloud provider.

Corrective actions

Short-Term:

Continue collaboration with our cloud provider to understand the root cause of the unexpected timeouts and cluster update impact.
COMPLETED: Keep deployments on all regions with same number of replicas in HPA configuration
COMPLETED: Increase max node counts on cluster node auto-scalers to give us more room to scale up.
COMPLETED: Configure HPA to scale up faster

Posted Mar 28, 2025 - 15:14 UTC

Resolved

This incident has been resolved.

Posted Mar 21, 2025 - 21:13 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Mar 21, 2025 - 20:41 UTC

Investigating

We are currently investigating this issue.

Posted Mar 21, 2025 - 20:10 UTC

This incident affected: Control Plane (Authentication - Enterprise, Authentication - Social, MFA, Authorization, Connector Heartbeat) and Management (Identity Providers Sync, Public API, Admin Console, Realtime Updates, Client Log Upload, 3rd Party Integrations, Network Dashboards, DNS Filtering Dashboards).