Summary
On September 21 at 4:14am UTC, Twingate released a new controller version as part of improvements to the authorization engine. This new release contained both code changes and involved data migration. The change caused an unexpectedly significant increase in load on the system, which only fully manifested itself several hours later once cached data started expiring at approximately 7:54am UTC. At this time, Twingate customers started to see issues initiating access to resources and login failures.
Based on initial evidence around intermittent responses and increased network latency, our initial suspicion was that the failures were related to infrastructure problems. Increasing backend application capacity and other efforts to mitigate these issues were not successful. These efforts, combined with information we received from our cloud vendor support team, led us to shift our focus away from infrastructure issues at this point towards investigating the application layer.
Our next step at approximately 10:00am UTC was to roll back the recent software changes to the controller, including the associated data migration that was performed as part of this update. This software rollback task, which also incorporated supervised data migration rollback, was completed at approximately 10:30am UTC.
After rolling back the software changes and data migration, we observed improvements in the behavior of the system. Both network latency and cache hit ratios were much improved, but not back to normal operational levels. Continuing to investigate the issue, with software and data migration rolled back to a known state, we initiated the process to fall over to our standby cluster to fully rule out any infrastructure issues.
At 11:24am UTC we initiated the failover process to our standby cluster, which completed at 11:27am UTC. At this point the system fully recovered with normal operational metrics.
Root cause
After detailed investigation, we identified two separate issues in the application layer that interacted with each other. First, a code bug caused the system to re-evaluate the permissions of all our users at the same time, causing a significant load that saturated the system. Second, the data migration process failed to replace existing cached values, which led to failed requests at the application layer. This second factor only became apparent as existing cached data expired.
Failing over to our standby cluster was only effective after the data migration and software changes were reversed. This is because the cache was empty at the time that the standby cluster was brought online. Due to the nature of the software bug, data migration, and caching interactions, performing cluster failover earlier in the incident would have replicated the same problem on our standby cluster.
Corrective actions
Upon postmortem investigation, we also noticed that certain metrics were available that could have allowed us to detect similar issues before they fully impact the entire system. This early warning mechanism could potentially have caught this issue earlier, preventing the faulty code change from reaching our production environment.
We have initiated a number of improvements:
Short-term
Medium/Long-term