Timeouts in US-EAST-1 region for browser checks

We are experiencing timeouts in the us-east-1 region for browser checks. Extra capacity is being deployed and we will monitor the situation.

Update:

All systems are back to normal for region us-east-1.

Introducing global environment variables

New

You can now store API tokens, passwords, usernames or any other piece of configuration data you want to use in multiple checks in global environment variables. This was a long requested feature and it is live right now.

Showing actual values in API check runner

When creating & editing your API checks, we now show the actual value next to the target value of an assertion if we can determine it. This makes debugging a ton quicker. We also show this value on all normally scheduled check results.

Delayed or non-functioning retries

Fix

Over the last weekend of 13 -14 Oct 2018 retrying failed API checks when the "double check" option was had delays or did not work for all regions except eu-central-1.

The reason for this was a configuration changed introduced that caused the routing of double checks to fail or take much longer than necessary.

We did not notice this behavior directly, as the issue was region specific and only for failing checks (a small percentage of Checkly traffic).

To make sure this doesn't happen again, we have added monitoring for this specific case.

Title

Fix

Between 17:30 and 18:10 CET on 09 Oct 2018 the Checkly dashboard and API calls for triggers where slower than normal. One of our infrastructure suppliers, Heroku, reported a routing latency in their EU zone.

This outage only affected usage of the Checkly web application, not the running of API or Browser checks as they run on separate cloud infrastructure.

Fix

Summary

Browser checks were less available for 24hr due to a new release that was misconfigured. The issue wasn’t noticed due to oversights in our monitoring infrastructure. All systems are back online and we have added extra tests and monitoring to make sure this never happens again.

1. What happened?

Between Thursday 04 October 15:00 CET and Friday 05 October 18:30 browser checks either did not run or did not report any results.
No false alerts were triggered. Also, the ad hoc browser check runs triggered in the edit and create screens did not work.

2. Why did it happen?

TLDR: We forgot to tweak a configuration parameter.

On 04 October we released a new version of our browser check feature. Part of that release was a new way how browser checks are handled on our back end. One change in this release was how the browser checks reported their data to the main application.

Browser checks are run in isolated containers for security reasons. They are launched by a launcher container which deals with all the scheduling and communication.

All our tests passed and we ran a shadow deployment for one week. This testing period however did not show that multiple concurrent runs could trigger a port allocation issue. Because each browser check run gets a dedicated control server listening on a port, these ports need to be unique per box. If not, an “address already in use” error is thrown. We wrongly configured our runners to use the same port for potentially five spawned runners. Spawning depends mostly on how busy a certain region is.

This should not have been an issue if our “restart on death” policy worked correctly. But due to a completely unrelated code issue, crashing processes were not restarted by our nanny process.

This problem was noticed very late, only on October 05. The reason for this was that our external monitoring (completely outside the Checkly infrastructure) was not setup to report on not reporting instances. A “dead” process would have triggered an alert, but a “hung” process did not.

3. What are we doing about it?

configuration changes are made to never have processes compete for the same port again.

monitoring has been updated to alert on non responsive browser check runners.

unit and end-to-end test are updated to simulate concurrency and trigger these situations.

Introducing Browser Checks V2

New

As of today all Checkly's browser checks are running on the second iteration of the browser checks site transaction monitoring system.
This upgrade brings the following benefits: