Components impacted
Relay
Connector
Summary
On this date we had an outage during routine maintenance of our relay infrastructure. The issue started at 4am UTC and was resolved within 2 hours, requiring some customers to restart their connectors in order to re-establish connectivity to our relay infrastructure.
Root cause
In our investigation we determined that the connector received a malformed response from the relay cluster during its maintenance cycle. The malformed response in question contains the address of a particular relay node to which the connector is instructed to connect. This malformed response resulted in the connector retrying access to a non-existing relay node without failing over to another relay cluster.
Corrective actions
After correcting the specific issue that caused the malformed response, we modified both the relay and connector logic so that failover now happens automatically any time that a malformed response is received. We also modified our maintenance procedures to add additional health checks to prevent malformed responses. Finally, we took the opportunity to enhance how the failover logic works to incorporate multiple levels of relay redundancy in the connector's initial configuration that it receives after authentication.