We experienced intermintent downtime, timeouts, and general performance problems in the past 24 hours as the result of a failing disk drive. To prepare for situations like this all our storage is redundant and we keep hot and cold spares in stock. However, this particular drive failure was problematic because performance degraded as the drive slowly failed. Dell’s hard drive monitoring utilities didn’t detect the drive as failing and our applications were blocked on I/O as the drive became slower and slower.

We suspected a problem with the drive earlier this week when we began to see this warning in our logs (the time stamps are CDT):

We contacted Dell for warranty support and an RMA, but they insisted that was only a warning and we shouldn’t be concerned. At the time iostat didn’t show anything out of the ordinary. Yesterday we began to notice extremely high I/O utilization on the NFS server with that faulty disk. As a result, the load on our front-end Django servers skyrocketed as they were blocked on I/O.

Atlassian’s internal development teams heavily use Atlassian Bamboo continuous integration build agents running on Amazon EC2 instances, and many of our teams host their repositories on Bitbucket. To alleviate load during this situation we blocked several of our EC2 IP addresses. However, since EC2 IP addresses are dynamic, later in the day those IP addresses were given to other EC2 users. That inadvertently blocked some of our customer’s continuous integration servers and we’re extra sorry for that inconvenience.

We contacted Dell again today and took immediate action after seeing this error in our logs:

We took the failing drive out of service and let the RAID rebuild itself with a hot spare. We also replaced the hot spare with a drive from our inventory. I/O utilization quickly returned to normal levels and our site has been stable since.

18 Comments

I’m glad to see Bitbucket informing users with such a high level of detail for every failure that occurs (last time EC2 went down, now the HDD failed). This shows how much you care for your users :-)nnKeep up your great work!

Glad that Atlassian is informing us about these things.nnNot to be party-pooper but BitBucket’s performance and responsiveness have taken quite a dive the past few weeks, this isn’t the first incident in the past month that rendered BitBucket (nearly) unusable at times…

Not to be party-pooper but BitBucket’s performance and responsiveness have taken quite a dive the past few weeks, this isn’t the first incident in the past month that rendered BitBucket (nearly) unusable at times…

Not to be party-pooper but BitBucket’s performance and responsiveness have taken quite a dive the past few weeks, this isn’t the first incident in the past month that rendered BitBucket (nearly) unusable at times…

You’re right, during the last two weeks the site performance has been below par, which is why we’re being as transparent as possible. We’re accountable to the you as a site user, and we want you to know we’re doing everything we can to resolve all issues.