The following is the series of events that took place:7:05AM EST - A particular machine begins crashing by exhausting its connection pool to the database7:45AM EST - Our continuous registry push/pull monitor pages the on-call due to timeouts and investigation begins8:25AM EST - The machine exhausting its connection pool is manually terminated by the on-call SRE8:30AM EST - A cascading failure causes all app servers to go out of service causing a full service outage8:35AM EST - Few app servers manage to enter service returning the service to a partial outage9:02AM EST - Full service is restored as the desired number of healthy app servers are now in service

We are continuing our investigation internally to determine the root cause of the single machine that caused a cascading failure.

Posted 18 days ago. Aug 28, 2019 - 11:25 EDT

Monitoring

We are monitoring the service after a recovering from cascading failure.The registry and API should be operating as normal.

Posted 18 days ago. Aug 28, 2019 - 09:07 EDT

Investigating

We are currently investigating a full outage for the registry and API service.