Twingate Service Down

Incident Report for Twingate

Postmortem

Incident Duration

UTC Time: May 28, 2025, from 20:45 to 21:14 (29 minutes)

Components Impacted

Control Plane Services

  • Authentication (Enterprise & Social)
  • Multi-Factor Authentication (MFA)
  • Authorization
  • Connector Heartbeat

Management Plane Services

  • Identity Provider Sync
  • Public API
  • Admin Console
  • Real-Time Updates
  • Client Log Upload
  • 3rd Party Integrations
  • Network Dashboards
  • DNS Filtering Dashboards

Summary

On May 28, 2025, Twingate experienced a service disruption that impacted multiple core services for approximately 29 minutes. The issue stemmed from a misconfiguration introduced during a rollout that updated routing rules in our global load balancer.

A new routing rule was unintentionally prioritized, resulting in most API traffic being directed to a single backend. This overwhelmed the target service, leading to elevated error rates. Lower environments did not expose this issue due to their reduced traffic levels.

The incident was promptly detected and the rollback was initiated quickly. However, recovery was delayed due to infrastructure dependencies that also relied on affected components. A manual intervention using local tooling ultimately restored service.

Root Cause

A routing misconfiguration caused a traffic imbalance that overwhelmed a backend service, leading to widespread API failures.

Resolution Timeline

  • 20:45 UTC – Incident begins. API traffic misrouted.
  • 20:48 UTC – Incident detected and triage begins.
  • 20:52 UTC – Rollback initiated.
  • 21:05 UTC – Automated rollback process encounters delays.
  • 21:09 UTC – Engineers execute manual fix.
  • 21:14 UTC – Services fully recovered.

Corrective and Preventative Actions

Short-Term (Completed)

  • ✅ Routing rule corrected.
  • ✅ Alerts added to detect backend traffic anomalies in lower environments.
  • ✅ Improved resource scaling thresholds.

Mid-Term (In Progress)

  • ⚙️ Decouple infrastructure tooling from service dependencies to ensure faster recovery options.

Long-Term (Planned)

  • 🔄 Introduce backend sharding to isolate customer workloads and reduce incident impact.

We sincerely apologize for the disruption and are taking actions to improve our systems' resilience. We appreciate your continued support and trust.

Posted Jun 16, 2025 - 18:27 UTC

Resolved

This incident is resolved now. We'll provide a post-mortem as soon as we have details.
Posted May 28, 2025 - 21:32 UTC

Monitoring

A fix has been implemented and we are monitoring the results.
Posted May 28, 2025 - 21:16 UTC

Identified

The issue has been identified and a fix is being implemented.
Posted May 28, 2025 - 20:53 UTC
This incident affected: Control Plane (Authentication - Enterprise, Authentication - Social, MFA, Authorization, Connector Heartbeat) and Management (Identity Providers Sync, Public API, Admin Console, Realtime Updates, Client Log Upload, 3rd Party Integrations, Network Dashboards, DNS Filtering Dashboards).