Twingate controller service impacted

Incident Report for Twingate

Postmortem

Summary

On September 21 at 4:14am UTC, Twingate released a new controller version as part of improvements to the authorization engine. This new release contained both code changes and involved data migration. The change caused an unexpectedly significant increase in load on the system, which only fully manifested itself several hours later once cached data started expiring at approximately 7:54am UTC. At this time, Twingate customers started to see issues initiating access to resources and login failures.

Based on initial evidence around intermittent responses and increased network latency, our initial suspicion was that the failures were related to infrastructure problems. Increasing backend application capacity and other efforts to mitigate these issues were not successful. These efforts, combined with information we received from our cloud vendor support team, led us to shift our focus away from infrastructure issues at this point towards investigating the application layer.

Our next step at approximately 10:00am UTC was to roll back the recent software changes to the controller, including the associated data migration that was performed as part of this update. This software rollback task, which also incorporated supervised data migration rollback, was completed at approximately 10:30am UTC.

After rolling back the software changes and data migration, we observed improvements in the behavior of the system. Both network latency and cache hit ratios were much improved, but not back to normal operational levels. Continuing to investigate the issue, with software and data migration rolled back to a known state, we initiated the process to fall over to our standby cluster to fully rule out any infrastructure issues.

At 11:24am UTC we initiated the failover process to our standby cluster, which completed at 11:27am UTC. At this point the system fully recovered with normal operational metrics.

Root cause

After detailed investigation, we identified two separate issues in the application layer that interacted with each other. First, a code bug caused the system to re-evaluate the permissions of all our users at the same time, causing a significant load that saturated the system. Second, the data migration process failed to replace existing cached values, which led to failed requests at the application layer. This second factor only became apparent as existing cached data expired.

Failing over to our standby cluster was only effective after the data migration and software changes were reversed. This is because the cache was empty at the time that the standby cluster was brought online. Due to the nature of the software bug, data migration, and caching interactions, performing cluster failover earlier in the incident would have replicated the same problem on our standby cluster.

Corrective actions

Upon postmortem investigation, we also noticed that certain metrics were available that could have allowed us to detect similar issues before they fully impact the entire system. This early warning mechanism could potentially have caught this issue earlier, preventing the faulty code change from reaching our production environment.

We have initiated a number of improvements:

Short-term
- We are increasing monitoring and alerting coverage for the performance of the authorization engine.
- We are continuing our efforts to compartmentalize our controller application, so degradation in one part of the system doesn’t impact the whole system.
- We are writing an integration test to simulate this exact issue.
- Based on log analysis we concluded we should update our incident protocol to immediately turn on “read-only” mode when an incident occurs to improve client and connector offline behavior.
- We are reinforcing the engineering team’s use of feature flags and dark launches for new features and data migrations.
Medium/Long-term
- Based on data collected in this incident, we have identified areas where we can improve the behavior of our client and connection applications to better handle similar situations and allow connectivity even under controller downtime.

Posted Sep 23, 2022 - 18:42 UTC

Resolved

This incident has been resolved.

Posted Sep 21, 2022 - 04:04 UTC

Monitoring

We have restored service for all customers but are verifying each of our internal services in-turn and checking for any residual issues

Posted Sep 20, 2022 - 11:40 UTC

Investigating

Although we identified a problem earlier it appears that it was not the root cause of the issue. We are serving some customer requests but there remains an ongoing impact to service availability that we are investigating.

Posted Sep 20, 2022 - 11:07 UTC

Update

We are still working on the fix. It takes a bit longer then we expected.

Posted Sep 20, 2022 - 10:17 UTC

Identified

The issue has been identified and a fix is being implemented.
the system is slowly recovering. We will keep update.

Posted Sep 20, 2022 - 09:32 UTC

Update

We are working with our cloud provider to identify the issue

Posted Sep 20, 2022 - 08:56 UTC

Update

We are continuing to investigate this issue.

Posted Sep 20, 2022 - 08:05 UTC

Investigating

We are currently investigating this issue.

Posted Sep 20, 2022 - 08:05 UTC

This incident affected: Control Plane (Authentication - Enterprise, Authorization, Connector Heartbeat).