Debugging the issues which caused your downtime

Our support is sometimes asked general questions such as, “why was my website down?”, “other uptime monitors that I use didn’t detect my website to be down. Why did HetrixTools alert me?”, “I just checked my site and it loaded up. Why did I get notified that it is down?”, and so on and so on.

These are general questions that people ask when doubting the results of our platform, so we’ll try to go through these questions one by one and provide you with the answers that you need.

So let’s begin:

[ Scenario A ] – “my website uptime monitor is shown as offline, but my server is online and pinging”
– Considering you are monitoring a website on your server, it is very likely that your server is online and working just fine, but your website is NOT loading up. Remember, your website uptime monitor is checking only if the website is loading up, not the server’s health.
– The website could be down or slow due to a number of reasons such as, but not limited to: high CPU usage (or generally low resources availability), slow scripts, slow database queries, crashed or overwhelmed web server, etc.
– In scenarios like the ones described above, you will of course see that your server is pinging and it is online, but that doesn’t change the fact that your website is having issues loading up.

[ Scenario B ] – “my website is loading up, but HetrixTools shows my website uptime monitor as being offline”
– In some cases, your website is indeed loading up, but it takes more time to load up than your monitor’s “Timeout” setting.
– For instance, the default “Timeout” setting for all our uptime monitors is 10 seconds (you can edit this in your Uptime Monitor’s settings). This means that if your website loads up in (let’s say) 12 seconds, it will still be marked as offline, because our monitoring nodes only wait for your website to load up for 10 seconds, and then give up, reporting it as being down.
– In such cases, you could:+ (a): increase the uptime monitor’s “Timeout” to the maximum of 15 seconds, which our platform supports.+ (b): investigate why your website takes so long to load up. This could be due to (but not limited to) lack of resources on your server, slow network, packet loss, slow scripts, slow database queries, etc. A proper website shouldn’t even come close to needing 10 seconds to load up, let alone 15 seconds. If you website takes that much time to load up, you or your system administrator should really look into what’s causing the slow load time.
– Another reason for such a scenario might be the fact that your server is having regional network issues. This means that the website may be loading just fine for you, but it appears down for most of the world.

[ Scenario C ] – “why was my website or server down?”
– We provide two very important tools which you can use to debug what was wrong with your website or server, and see what errors we’ve encountered during the downtime:+ (a): Network Diagnostics – will provide you Ping and MTR samples, taken right after your Uptime Monitor has been found as being down by our platform. These samples are being collected from all of the locations that your Uptime Monitor is being checked from. You can use this data to debug network issues only. This data will not help you on non-network related downtimes (i.e.: the network could be just fine, but your website could still be down, as explained in “Scenario A”).+ (b): Location Fail Log – will provide you with a log of all of the encountered errors, from each of the monitored locations. These errors are collected at the exact time that each monitoring location checks your website or server. You can use this data to actually see what the exact error was that was encountered at the moment our monitoring nodes checked your website or server. Some examples of such errors are:* [1] if you have a Ping Uptime Monitor, the Location Fail Log will contain failed ping samples (i.e.: ping timeout);* [2] if you have a Website Uptime Monitor, the Location Fail Log may contain timeout errors (if your website didn’t load up in time), keyword missing errors (if you are also monitoring for a specific keyword on the website, and this keyword is not found), wrong HTTP codes (i.e.: 503, 404, 403, etc), etc.;* [3] if you have a SMTP Uptime Monitor, the Location Fail Log may contain the encountered SMTP errors which prevented our monitoring locations from connecting to your SMTP server.

[ Scenario D ] – “my network diagnostics show no network issues during the downtime, so why was my Uptime Monitor down?”
– Not all downtimes are network related, so the network could have been just fine, but the Website or SMTP or Service that you’re monitoring still had issues. You should look at what errors have been collected in your Uptime Monitor’s Location Fail Log, to see what exactly was it that went wrong.
– Another possible answer to this question, although with a rare occurrence, is that the outage was too short, and as we’ve mentioned earlier, the Network Diagnostics are dispatched to be collected after the Uptime Monitor is marked as being down. This means that from the time that our monitoring nodes detect your Uptime Monitor as being down, to the time it is marked as being down in the central database, to the time the Network Diagnostics are dispatched to be collected, and to the time the Network Diagnostics are actually finished to be collected, there is a gap of around 20-30 seconds. So, if the network outage is quite small, there is a chance that our Network Diagnostics will not catch it in the Ping and MTR samples. In this case, our suggestion would be, again, to look at your Location Fail Log, since it contains samples/errors right from the moment the monitoring nodes detect the issues with your Uptime Monitor.

[ Scenario E ] – “other uptime monitors that I use didn’t detect my website to be down. Why did HetrixTools alert me?”
– This happens because, unlike most other Uptime Monitoring Services out there, we check from all the monitoring locations simultaneously. This means that we provide more precise results in a shorter time than other services that check from just one location at a time.
– By checking from all of the monitoring locations at the same time, our platform will instantly have all the data needed to conclude if your Uptime Monitor is having issues or not, unlike services that check from just one location at a time. Those services need to then ask further locations to look into your website or server, and double check if the initial location was correct or not. This process takes time and is not always accurate.
– You can read more about comparing these two methods here: Simultaneous Uptime Monitoring Checks.

In almost all of the cases described above, the first thing that you should be doing when trying to debug the issues that caused your downtime, would be to check the Network Diagnostics and the Location Fail Log. These two will contain enough data for you to get to the root of the problem.