Thursday, October 01, 2009

If you were affected by the ho-hum availability of the FeedBlitzRSS service recently (we're A-OK now, by the way), here's why: Traffic jumped by a factor of 17 over this time last week and it took us too long to get to grips with it, for which I apologize. I hope this post will help explain why that happened, what we did and didn't do, and what we have learned from the experience.

Chronology

Late last Friday (September 25th) we noticed that incoming traffic had - very suddenly - started to clog the RSS service. We dug around a little, found out why, and added a new server. That seemed to handle things nicely and that was that - or so we thought at the time. Over the weekend the service was doing OK - a couple of flags here and there but nothing that seemed to merit urgent attention - and this continued through into Monday. So far, so good, and I was able to talk about the new integrated comment feature instead. Swell!

Come Tuesday morning, though, and things weren't so happy. So we span up yet more servers to hold the fort (let's hear it for Amazon EC2 and cloudy services!), which seemed to settle things down.

Yesterday (Wednesday), however, the traffic was again swamping the virtual server array and it was clear that simply spinning up more of the servers we had been using wasn't cutting it. So, instead, yesterday morning we hauled out a series of much larger servers into (from?) the cloud. At one point we had 7x the number of servers running than we did a week ago. Right now, the computing capacity currently deployed to serve feeds is more like 30x more than this time last week.

Results

The good news is that the current infrastructure is now holding its own really well - in fact, we're starting to take some of the servers down now that we understand better the nature of this flood and how to handle it. There's plenty of headroom left right now to cope with at least a doubling of traffic should that happen and RSS delivery is currently extremely stable and responsive.

From a numbers perspective, comparing yesterday to a week ago, RSS traffic we served went from just under 15 requests per second to 251 requests per second yesterday - that's the 17-fold increase. That's a huge change in what had otherwise been a predictable service with predictable traffic patterns.

Questions, Questions.

So that's what we did - but where did all this traffic come from? And how come we were so surprised?

Let me take the second question first. After about 6 months running the RSS offering we know (or thought we knew) how it worked and how it changed. Basically, as we incrementally acquire new customers for the service, the load on the service grows incrementally in parallel. Most of the feeds we see have circulations in the thousands and tens of thousands, nothing scary about that at all and well within the ability of last week's infrastructure to handle for the foreseeable future. We had no idea last Friday that this was about to change.

At the end of last week a new client started to use FeedBlitz to serve their feed. Their feed was not being accessed via traditional aggregators, though. Instead, it was being accessed by an automatically updating browser toolbar. With a circulation, apparently, of around 7 million individual users. So when a user with the toolbar fires up their browser they fetch the feed (and because the toolbar itself isn't particularly well-written, they keep on asking for the full feed, despite the headers we send back. Grrr!). Literally one minute things were fine, and the next, overload central.

The toolbar factor is important. By way of background, most bloggers and content publishers have large portions of their audiences consolidated by web aggregators, such as Google Reader, email services like FeedBlitz, or have well-behaved desktop aggregators that know not to keep hitting up feed every 10 seconds. So even if a blog network comes to us with an audience of, say, a million RSS subscribers, it's more than likely that a good half of those will be handled by a few large aggregators, and the load from the personal systems might add a million hits or so a day to the system or so. That's easily handled. RSSaggregators are well behaved too; they understand things like entity tags in HTTP headers which help minimize traffic.

The toolbar added an extra 20 million hits per day to the infrastructure. It's therefore roughly equivalent to the load generated by a blog with a multi-million subscriber RSS feed. Does anyone in the real world have a feed that size, anyway? I don't think even TechCrunch's or Scoble'sRSS audiences are that big. So, no, we just didn't expect this kind of load appearing, unannounced, overnight.

Looking Ahead

So, anyway, hence the performance issues earlier this week. We strive to do our best here but for a while we weren't able to keep up. That's personally disappointing for me and I am certainly sorry for the inconvenience wrought on some of our clients as a result. The good news is that the RSS infrastructure is now not only better able to cope with a similar sized surge in the future, we're better able to ramp it up quickly and aggressively as and when we have to. One key lesson learned is that we scaled too conservatively when the initial hits came in, causing us to have to scramble later on - weekend traffic is always lighter than weekday and we failed to take that into account adequately. With hindsight we should have overscaled at first, then pulled back as the full traffic profile became clearer. We won't be making that mistake again.

Geekery: AWS, EC2, S3 and Memcached

Technically, there were a few surprises as well as we drilled into how to improve performance. You may not realize this, but when we serve a feed to you - say http://feeds.feedblitz.com/feedblitz - we're not simply sending a static file. The feed is subtly different for each individual user. Which is why we need beefy servers to handle the load because FeedBlitzRSS is CPU-limited, not bandwidth limited. In other words, serving a feed is work for the computer involved; we're not simply yanking the file out of a local cache and declaring victory.

But talking of caches, here's another interesting technical factoid we discovered. Like many services, we use a technology called memcached to help scale. We'd assumed that it would work well in the EC2 cloud because there was no reason to think otherwise. And we were wrong. Our instrumentation in the middle of this episode clearly showed that caching and retrieving data from Amazon's S3 service was consistently an order of magnitude faster for servers under stress than fetching the same data from an EC2-hosted memcached server. My assumption is that there are environmental optimizations within the AWS environment that grease the electronic skids for S3 requests. If you're using memcached inside EC2 my recommendation is that you instrument the code you're using - memcached might not be working as effectively for you as you think it is. For what it's worth, we had been using memcached as a primary cache with S3 as a backstop. Now we just go directly to S3 instead. It's way faster.

The Bottom Line

So there's the scoop. Big flood of traffic, didn't scale fast enough, but we're good now.

Post Script

Oh, and by the way: If you're planning on dropping a few million subscribers on us overnight, please do. We can handle it now. Just give us a heads up first, okay? Thanks!

5 Comments:

CONGRATULATIONS on great increase of traffic! The thing is I still did not get any updates of a website which I subscribed to on 30.09.2009 and it was updated by myself twice... (to check the reliability of services I'm offering on my own site). SO PLEASE HELP ME AS SOON AS POSSIBLE!!!! Looking forward to hear from you.Patpatryszkad@wp.pl

I always enjoy learning how other people employ Amazon S3 online storage. I am wondering if you can check out my very own tool CloudBerry Explorer that helps to manage S3 on Windows . It is a freeware.