Web app and test running service unavailable
Incident Report for loader.io
On Jan 2, 2023 the loader.io website, API, and test running services became unavailable due to an expired TLS certificate in a backend service that manages service credentials. The certificate had been renewed, but was not distributed to all servers that needed it.

When the certificate expired, Loader's orchestration system was unable to read credentials for internal connections to databases and other services, and several services failed as a result, including the web interface and the test running & scheduling jobs. Our team was not notified due to a separate failure of alerting systems, and the team was not in the office because of the observance of the New Years Day holiday on the Monday after new years day.

As soon as a team member noticed the outage, service was restored by distributing the renewed certificate to the credential service, and restarting the other failed services.

- Test settings, results, and other account information in the loader.io web interface was inaccessible
- Tests that had been scheduled for Jan 2, 2023 were not run during their scheduled time, and instead would have run after service was restored, early on Jan 3 2023
- Some scheduled tests may have been scheduled twice in error when service was restored, due to retries and delays processing the backlog of tests

We are reviewing our automation and monitoring systems to ensure that critical systems are better automated, and that our team receives alerts promptly!
Posted Jan 02, 2023 - 02:30 EST