Benchmarking Basecamp's uptime against five other web apps

As we announced at the beginning of the month, we’re always on a mission to improve our uptime. Inaccessible apps are the cause of much frustration and users don’t care whether that’s because they’re scheduled or not.

While publishing our own uptimes have been a great step towards getting everyone in the company focused on improving, we also wanted to compare ourselves to others in the industry. So since December 16, we’ve been tracking five other applications through Pingdom to compare and contrast.

The goal is to have the least amount of downtime and here are the results from the period December 16 to January 31:

DHH

on 31 Jan 12

GB, we’ve published our entire uptime history for the last twelve months for Basecamp on http://basecamphq.com/uptime. As mentioned in the post linked in the very first sentence, we were down about 6 hours in 2011. We certainly aim to decrease that big time in 2012.

GB

on 31 Jan 12

@DHH

All I’m saying is this blog post is deceiving, given as you said – you already posted your uptimes (downtimes) for all of last year.

When people read a headline that says “Benchmarking Basecamp’s uptime” ... then only see 6 minutes, my immediately reaction was WHAT, NOT TRUE?

Then I had to notice you were only reporting for a 6 week period. Seems a bit like you guys just cherry-picked a period of time when you guys had below average downtime (extrapolated 2 hours of downtime vs actual 6 HOURS of yearly downtime).

DHH

on 31 Jan 12

GB, the reason we’re only reporting 6 weeks is because that’s all the data we have for the benchmarked services. If you read the article, you’ll see that we setup the benchmarks on December 16. So while it would be great to compare a whole year’s worth of data, it just wasn’t possible. When it is possible we will.

I didn’t think it was a very long article? Your first comment missed the first paragraph and your second comment missed the second paragraph. There’s only two more paragraphs to go, so please take a swing at them :)

Adam

on 31 Jan 12

@GB I think it’s a bit silly to just read a number and make an assumption of the time period it is for.

They made no claims that this was an average month or representative of past or future performance. You made a HUGE jump in extrapolating yearly performance based on a small sample size.

@DHH It’s great when any company embraces more transparency and attempts to improve their service.

FWIW, over at ArgyleSocial.com we managed 8 minutes of downtime (according to pingdom) over that period. Looks like we’re holding up pretty well, though perhaps a bit less load than you guys ;)

DHH

on 31 Jan 12

Adam, that’s awesome! 8 minutes of downtime over 45 days is 99.99% uptime. You have reason to be proud of that.

Just signed up for pingdom and they sent me my password via email in clear. Great!

Ryan

on 31 Jan 12

As much as I love GitHub, the web frontend going down is usually a minor annoyance, while problems on the git backend can be a major disruption to workflow. I’m guessing these numbers only include the web frontend?

NL

on 31 Jan 12

@Ryan – correct, for Github we’re only monitoring the web site. In each case, our check does something that should exercise the web part of the app in some way that’s similar to how a user would use the app. Our benchmark checks get the same attention towards finding a good check as our own apps do.

Michael

on 31 Jan 12

This is an awesome way to use Pingdom. I never thought of doing this until now!

Will Jessop

Pingdoms monitors only check every 60 seconds. If you want real downtime stats, signup for the Verelo beta at www.verelo.com and get monitoring that checks as regularly as every 5 seconds.

You cant pretend your site is down for 6 minutes if you only check it once every 60 seconds.

Verelo is going to monitor the sites mentioned above for the next 30 days and provide comparative results the next time this blog is updated.

Michael

on 31 Jan 12

Andrew, perhaps you could tell us what difference you’re expecting.

David Andersen

on 31 Jan 12

@GB -

Let me help you.

“Then I had to notice you were only reporting for a 6 week period.”

should be

“Then I actually pulled my head out of my *ss, read the literal text, used my brain and realized there’s no problem here.”

David Andersen

on 31 Jan 12

Andrew, how often do you think a given site goes down and then back up within a 60 second window? For it to happen a statistically significant # of times would be odd.

L Roa

on 31 Jan 12

This is great (and hilarious). One of the things I found out when moving to Silicon Valley (after working as an Architect at Bell Labs) was the complete lack of understanding of what it takes to have 5 or 6 nines of uptime by the majority of the designers in the Valley. Of course Marketing would claim a “very high availability and fault tolerance” but when actually measured the results were far, far from the benchmark.

Experience can’t be improvised.

It’ll take some time to get there, but I finally see a new web 2.0 outfit make a serious attempt at it.

It’ll take some time and gaining some experience, but you are on the right path (real world measurements). Good luck and best wishes!

PingOfDeath

on 31 Jan 12

@Andrew: How can you claim Verelo’s measurement will be accurate if it only sample every 5 seconds?

I wrote my own tool which continuously hammer a server in order to accurately measure uptime. I found out that most servers out there crash all the time… Strangely enough usually soon after my tool starts sampling, though.

Based on what we’ve seen its actually very common for a site to go up and down inside the 1 minute range. Monitoring every 60 seconds is pretty decent, but if you’re very concerned or seeing some very unusual downtime events that only happen on and off (as in not consistently enough within a 1 minute period to trigger an alert) you’re likely to pick it up once you start monitoring below the 1 minute mark.

@PingOfDeath

Good question, and that’s exactly how we feel about it too :-) As far as we know we’re the only provider offering a very reasonably priced 5 second monitor, you’ll find the industry standard is around 60 seconds – 5 minutes. We are considering 1 second checks but are first addressing some required verification tasks to ensure we dont “DOS” someones site by attempting to monitor it.

Your own script to “hammer” it isnt a bad idea but you are probably not monitoring from all around the world. Verelo is monitoring from a lot of locations, and we’re constantly adding more to ensure all sides of the equation are taken into consideration i.e. Caching, Network connectivity between geographical regions and distance.

PingOfDeath

on 31 Jan 12

@Andrew: “but you are probably not monitoring from all around the world”

Oh but I am! A Russian friend of mine graciously lent (well leased, actually) me some sort of distributed network of computers he somehow set up around the world (I think he calls that a botnet).
I can tell you the target servers are probed from all around, so to speak.

Brian

on 31 Jan 12

Will 37signals be sticking with Assistly (desk.com) now that Salesforce owns it. Seems like little by little they are moving into your space with these smaller acquisitions.

Matt Carey

on 31 Jan 12

Basecamp is down again. It goes down for a few minutes every day at the moment, which must add up to more than 6 minutes…

In one very rare case we noticed a bad puppet script of a company we were monitoring was rebooting apache on all their servers once every 30 minutes (it was missing a clause which would first check if a host file already existed), which was resulting in the load balancer kicking the servers our of service (and there being none in service!)

Appreciate the question, i think the types of issues sub-minute monitors pickup v’s 60 seconds and above are fairly different and in general we’re just not use to finding them this way. Its a pretty cheap and easy way to find some very unexpected results, we’ve personally discovered a lot in past systems we’ve worked on through this product.

Sending you password in clear text doesn’t necessarily mean they store it in clear text, though it really makes users worry. They might create a random password, send you via email, then store the hashed password for authentication.

GeeIWonder

on 01 Feb 12

Interesting stuff, admirable goals. Not all downtime is equal, so some metrics on actual impact would be merited.

The Basecamp numbers are not monitored via Pingdom, correct? I think I’d want to emphasize which numbers were derived how. There’s a granularity issue that should be significant on these sorts of time scales. Just ask your cellphone service provider.

GeeIWonder

on 01 Feb 12

A companion number that is therefore useful would be to compare # of outage events. This is probably readily available to you.

DHH

on 01 Feb 12

GW, yes, the Basecamp numbers are measured through Pingdom as well. Same methodology as we measure the benchmarks with.