Monthly Archives: May 2012

Recently we’ve been looking over various job queues, and there are a lot of attractive options out there. In particular, certain internal projects have had great success with Resque, and there are myriad others, from Celery to beanstalk and so on and so forth. We wanted something a little new and different, and we wound up with qless a Redis-based job queue with strong guarantees that jobs don’t get dropped, high performance, stats, job tracking and more. In case you’re thinking about bailing out after the first paragraph, I’ll try to keep you interested with a compendium of selling points (I apologize for the slight shamelessness of it):

Language agnostic, and we use it in Ruby and Python, and have stubbed out C++ and Node

Completely client-managed. A Redis instance is all you need

Jobs don’t get dropped

Jobs have priority

Queues are pipeline-oriented, but it’s not required

Jobs keep a history of what’s happened to them

Jobs can be tagged (for search / debugging / tracking)

Jobs can be scheduled (and made recurring)

Jobs can have interdependencies that unlock each other

Qless keeps extensive stats about how long jobs wait and how long to run

A bit of context: Those of you who are customers of our campaign crawl know that it’s been and continues to be worked on. The full spectrum of changes and difficulties we’ve encountered is beyond the scope of this post, but I’d like to call out in particular the queueing system we were using until very recently. At its heart, a job queue is a simple idea. A worker asks for something to do, and it gets handed off, completed and that’s that. One of the major problems we had is that sometimes jobs would get dropped by a worker and we wouldn’t notice. The other is that scheduling jobs is a pain. One model is to periodically look for all the jobs that should be run, and put them into the queue. Problem is, that process can fall over and for us that means customers missing crawls. With the scene set, we can start the bragging (rather, we humbly present qless).

Qless makes heavy use of a feature new to Redis 2.6, server-side Lua scripting. Transactions are important when making a job queue, and since Lua scripts are executed atomically on the Redis server, it alleviates a lot of concerns about locking, semaphores, etc. The other huge get for us in using Lua scripts is that new language bindings can use the exact same scripts. When it comes time to get serious about bindings for Node or C++ (we’ve done a little poking around to make sure we haven’t backed ourselves into a corner), it’s just a matter of writing a language-specific wrapper that loads the same Lua scripts as any other bindings, and invokes them. Easy-peasy.

Completely client-managed. The Lua scripts that comprise the core library do all the maintenance, too. The very act of popping cleans out expired locks, and checks for dropped jobs, and job completion does some tidying up as well. This way there’s no need for a nanny process — just a Redis instance you can point your workers at.

Jobs don’t get dropped. We try our best to write clean code, but it’s easy to make mistakes, and sometimes workers fail, and it can be difficult to ensure that jobs don’t disappear. To this end Resque takes a very reasonable stance by forking off a child process for every task that’s going to get done. In this way, the parent process can be trusted to make sure that everything happens as it should — either the job gets completed, or catches an error, etc., but some appropriate action takes place. Our tack is to use heartbeating. As opposed to a lot of job queues that have a large number of tasks that take maybe a few minutes, we have some very large jobs (some take about a week) and some very small (a few seconds). Rather than trusting that workers will complete their jobs, they have to check in as they make progress, or qless gives that job to another willing worker. We can (and we often do) completely nuke worker boxes, and the jobs get picked up somewhere else with no problem.

Jobs have priority. Some other systems have support for this, and it was important to us, too. In particular, if a customer writes in with a problem, we want to be able to bump its priority to make sure it rockets through and we can get back to them as soon as possible. Job priorities can be adjusted mid-flight, too.

Queues are pipeline-oriented and jobs know their history. Like many large tasks, ours are broken into stages of a pipeline. As an example, we crawl a customer site, analyze it, aggregate Mozscape data, and then generate prematerialized views. Qless lets us describe a pipeline in a single job class and run it with a single job entity. And the job keeps track of events as it moves along: put in the crawl queue at such and such time, then popped and completed by such-and-such worker and so forth. It’s surprisingly helpful in debugging.

Jobs are tagged and tracked. Not all jobs are created equal, and some are more interesting than others. In particular, problem jobs, or customer complaints are of particular interest. Qless lets us flag jobs that we want to keep a close eye on (more on this at the very end), and tag them with useful information. Every job has a unique identifier, but a project might have additional ways they want to look up jobs. When we want to find all the running jobs for a given customer, it’s as easy as that. It builds an index of these tags to make for efficient lookup.

Jobs can be scheduled and recurred. This is solving a particularly painful problem for us of the scheduling crawls. You describe a recurring job much in the same way you would a normal job — it’s bound to a queue, it has data, priority, etc. When a pop request detects that a recurring job should be run, it creates a copy right then and there and returns it as one of the results.

Jobs can have dependencies. To borrow a particularly good example from a co-worker, imagine making Thanksgiving dinner. You need to make pies, and put the turkey in the oven, and make the gravy, and they’re all independent tasks, but they might have dependencies on each other. You can’t make the gravy until you’ve cooked the turkey, and you can’t cook the turkey until you’ve made the stuffing, and so on. When one job depends on another, when the independent job completes, it automatically unlocks the dependent job to be popped.

Qless keeps performance stats. We keep summary stats about how long jobs wait in various queues on any given day, and how long they take to run. But more than that, it also provides a histogram of runtime so you can look at the distribution (or just admire your handiwork). This doesn’t have to be the extent of your benchmarking, but it seemed like an intuitive fit that you should be able to have access to these metrics in the same place you’re managing your queues.

A powerful web interface. Inspired by the great web interface from Resque, and armed with Twitter’s Bootstrap, it was important to us to make a web app for managing these queues. In particular, we wanted to make something that would easily enable our ops and help teams to quickly gain insight about customer or server issues, whether it’s requeueing jobs, or just tracking their progress.

Before I get to the last little flourish, you are to be thanked for making it this far. I’ve done my best to keep this brief, but that clearly hasn’t worked. This has been the product of a lot of thought, time, grief with our last job queue, and effort, and so it is hard to not talk about the details. (If you’re curious, you should checkitout on github and then let us know what you think.)

Notifications. It’s all well and good to track jobs, which aggregates a short list of the jobs you’re tracking so you have a quick summary, but we thought we’d take it one step further and get notifications about job progress. We use campfire pretty heavily internally, and I personally love growl notifications. Start up a little daemon, and it uses Redis’ PubSub to get notified of any changes to tracked jobs — when they fail, get completed, get popped, pushed, etc. Gone are the days of hitting refresh to check up on trouble tickets!

In parting, as always, contributions, suggestions, bug reports and so on are always welcome. Happy queueing!

As you may have read, our Rankings service has recently been afflicted with an SSD-related technical malady. Specifically, we were using SSD drives in a heavily read/write environment to be super-fast harddrives and many of them failed all at once. The lesson here is that SSDs have a specific, limited number of rewrite cycles and don’t provide warnings that they are near their end of life. In the future, for our distributed applications, we will use cheap, simple hardware, and have excess capacity or we will expect SSDs to fail at a point in the future and be ready for it.

First, a bit of background
Specifically, we use a Riak storage cluster to hold historical rankings data. The data is fed to this cluster from a rankings collector, and the most recent rankings data is pulled into our Rails db for display on the rankings summary page. Historical data requests (whole account historical rankings export, and historical keyword rankings) are handled by doing a real-time request to this back-end service. Our Keyword Analysis service also uses this backend rankings service.

Riak is a very scalable, key-value store similar to cassandra that supports map-reduce operations out of the box. The data that we store in Riak is stored redundantly to 3 drives in the cluster, and each of those drives is backed up on a daily basis.

What Happened
We have been using SSDs in our Riak nodes to facilitate speedy map-reduce operations and data lookups. Between Sunday night and Monday afternoon, 5 drives started to fail. When SSDs fail, they fail differently than normal magnetic drives. Rather than developing a few bad blocks, they cease to be able to write data (though reading still succeeds). Across Sunday and Monday, these drives failed to succeed writing data. On Linux, this ends up hanging the machine. After a restart, things appear OK. In fact, reads are OK. When the writes start again, things hang again. Eventually we isolated the issue to the disks, and were in the process of replacing the disk and restoring from backup data, when more drives failed. At this point, we had two issues: too many nodes were down to have successful reads for some data, and too many nodes were failing to write for us to succeed at storing new data. So, we stopped the workers writing data. Then we worked to fix the disk issues so that reads could succeed.

The Fix
After having done that, we got the workers back up, however, the jobs for Monday, Tuesday and Wednesday were all scrambled together, and monthly reports for many campaigns are dependent on the Monday jobs. To make the result more clear to users, we removed all jobs unrelated to the Monday jobs so that we can prioritize those. We are also looking for ways to accelerate data collection to catch up as fast as possible. Based on prior performance, we expect to be caught up by Sunday night.

How It Happened
Our Riak processing jobs use the SSDs intensely and the load is very balanced across the nodes. SSDs can only tolerate a limited number of R/W cycles and we hit the limit on R/W cycles on most of our SSDs at the same time, causing a catastrophic failure, despite having redundant drives. Sadly, there is no way to query the SSDs for their lifecycle information.

To Prevent This In The Future

Until we can get SSDs that can tell us when they are used up, we cannot use them in an application the rewrites a lot of data. This data is now on spinning disks, and we will work to make sure that the these provide sufficient redundancy and performance. We may need to add more processing nodes or pre-compute other values to improve performance from spinning disk.