Life, love, and computer science

The pssh tool is great. Just great. At SEOmoz we use a number of deployment schemes, but every so often I find myself needing to log into 50 machines and perform some simple one-off command. I’ve written such a line many times:
for i in {01..05}; do ssh -t ec2-user@some-host-$i 'sudo ...'; done

This is fine if the command is quick, but it really sucks if it’s something that takes more than just a few seconds. So in the absence of needing to use sudo (and thus the ‘-t’ flag), pssh makes it easy to run these all in parallel:
pssh -i --host=some-host-{01..50} -l ec2-user 'hostname'

I recently went on a mission to find (and perhaps build) a better dictionary. I’ve been looking at Dr. Askitis’ work on so-called HAT-tries, which are something akin to a burst trie. It all seems reasonable enough, and his experimental results seem very promising. In particular, I was looking for an open source version of this data structure that didn’t preclude commercial use (as Askitis’ version does).

HAT-tries rely on another of Askitis’ data structures, the hash array map. Essentially, it’s a hash table, but instead of using linked lists to store nodes containing the various key/value pairs stored in a particular slot, each slot is actually a buffer that stores a bunch of packed information, including the number of items in the buffer, the length of the string key, and the value itself. The idea is that this arrangement is much more conducive to caching (and hardware prefetches out of memory since each slot is a contiguous slab of memory). Contrast this to a more conventional approach in which each slot is a linked list that must be traversed to find the actual key/value pair that’s being retrieved.

I should note that there are a couple of implementations that I looked at before venturing out to make my own. The design is actually relatively simple: hash a provided key to map it onto one of many internal buckets. If that bucket is empty, allocate enough memory to store an 1) integer counter of the number of pairs in this buffer, 2) integer length of the provided key, 3) a byte copy of the key itself, and 4) a byte copy of the value.

For the hash, I chose my personal favorite, superfasthash. I actually began by having my implementation follow the STL style of being able to provide a memory allocator, and when I didn’t see the performance I wanted, I switched to `malloc` and `realloc` as prescribed in the paper. Even then, I did not see the performance I wanted. Of course, I imagine that my implementation could be improved, but I felt like it was certainly at least reasonable. I tried a number of alterations, including preallocating more memory than was needed in each slot with hopes that realloc would save me. No dice.

My benchmark was focused on speed. Memory was less of a concern for my needs, and so long as it stayed mostly unnoticeable (say, less than a couple hundred megabytes for a million keys), I was happy. I decided to give it a run against `std::map` (mostly to feel better about myself), and then `tr1::unordered_map` (mostly out of hubris). Although my rough implementation doesn’t (yet) include fancy-schmancy features like iterators, it barely edged out `tr1::unordered_map` for a small number of keys (less than 10k). When scaling up, however, the story was less than impressive.

This benchmark was performed using the first k words from Askitis’ `distinct_1` data set, and the strings were loaded into memory before running the tests. These numbers are each the best of 10 consecutive runs (hoping to warm the cache and cast each of these in the best possible light), with each of these containers being a mapping of `std::string` to `size_t`. Each key was associated with the value 1, and when querying, it was verified that the resulting value was still 1. The query was performed on the same input set of keys, and random query was run exactly the same, but after performing `std::random_shuffle` on the vector. It was compiled with `g++-4.2` with flags `-O3 -Wall` (though other optimization levels had almost no impact). I also tried with `clang-2.1` and the results were very similar. I encourage you to run the same bench on your own system and your own compiler version.

Insertion Time Relative to std::tr1::unordered_map

Query Time Relative to std::tr1::unordered_map

Random Query Time Relative to std::tr1::unordered_map

While `tr1::unordered_map` scaled better, at least for the purposes of HAT-trie, the number of items in the hash is relatively limited (roughly in the range of 10k). When testing the HAT-trie itself, I think the hash array map has earned at least a chance for a trial. For those curious, my source is available on github.

I have been waiting for an occasion to use dcramer’s taskmaster, which is a queueing system meant for large, infrequently-invoked (even one-off) tasks. In his original blog post brings up one of the features that was particularly striking to me — you don’t put jobs into the queue per se, but you describe a generator that yields all the jobs that should be put in the queue.

Occasionally at SEOmoz, we want to perform sanity checks on customer accounts, or transitioning from one backend to another, etc. In particular, we’ve been transitioning to a new queueing system, and we wanted to go through every customer and ensure that they had a recent crawl, and further, were definitely in the new system. Unfortunately, much of the data we have to check involves a lookup into Cassandra (that can’t be turned into a bulk operation very easily). Cassandra’s not necessarily the problem, but just the latency between requests. So, spawn off 20 or so workers with taskmaster, each given the details about the customer that we needed to verify.

The serial version takes 4-5 hours. It took 15 minutes to get taskmaster installed and grokked, and then the task itself took an hour. Already a win!

Redis 2.6 has support for server-side Lua scripting. Off hand, this may not seem like a big deal, but it offers some surprisingly powerful features. I’ll give a little bit of background on why I’m interested in this in the first place, and then I’ll show why this unassuming feature is so extremely useful for otherwise impossible atomic operations, as well as for easy language portability and performance.

For example, I’ve recently been working on a Redis-based queueing system (heavily inspired by Resque, but with some added twists) and a lot of functionality that I wanted to support would have been prohibitively difficult without Redis’ support for Lua. For example, I want to make sure that jobs submitted to this queueing system do not simply get dropped on the floor. A worker is given a temporary exclusive lock on a job, and must either complete it or heartbeat it within a certain amount of time. If that worker does not, it’s presumed that the worker dropped the job and it can be given to a new worker.

Now let’s imagine what this locking mechanism would have to look like in order to be correct. First, we’d probably maintain a list of jobs in a queue that have been popped off, but not yet completed, sorted by when their lock expires. When a client tries to get a job, it should first check for expired locks, and if it finds any, it should assume responsibility for those jobs. So this client sees an expired lock, and attempts to update the metadata associated with the job to reflect that it now has that job. In the mean time, the original client has swooped in and tried to complete the job despite the expired lock, removes the entry for the lock, and updates the job data to reflect its completion. It’s possible that the second client updates the job data after this, and inserts a new lock for itself, putting the system into an inconsistent state.

Yes, Redis has a mechanism for this, but it’s only so strong. There’s the `MULTI`, `WATCH` and `EXEC` combo, which allows you to detect the situation when another client has tried to modify a key for which you’re trying to perform an atomic operation and allows you to try the operation again. But for highly contentious keys, you can spend a lot of time backing off and failing. That’s frustrating.

Redis’ Lua support has an interesting guarantee: Lua scripts in Redis are guaranteed to be executed atomically. No other commands can be run on the Redis instance while the Lua script is running. With that in place, you are free to no worry in the slightest about these sorts of race conditions, because they just won’t happen. You can access as many keys as you’d like, without having to worry about `WATCH`-ing them for changes, and implement as simple or complex a locking mechanism as you’d like.

Another interesting feature that comes out of this is that if you implement your next Redis-based library as a collection of Lua scripts, then you can write bindings in other languages in a flash. The only requirement is that those new bindings must be able to read in a file, load the script, and then have Redis bindings to invoke those scripts. Clients no longer have to worry about mimicking any arbitrarily complex logic in their own language — they just rely on these Lua scripts that can be shared across all the bindings! This may go without saying, but maintaining bindings is something that can be a bit of a nuisance. One example that jumps to mind immediately is working with Redis from Node.js if a lot of successive commands have to be chained together. It can get extremely messy.

Not only this laundry list of wonderful features spring out of this Lua support, but it’s surprisingly performant. Without giving too much away, at SEOmoz, I recently implemented the queueing system I mentioned to support scheduled work items, heartbeating, priority and statistics collection in a collection of about 10-12 Lua scripts. Initial benchmarks have hit 4500 job pop/complete transactions per second on a 2011-ish MacBook Pro. At least for our purposes, this is plenty of room to roam. And let me assure you, these scripts are not always simple, and so the fact that Redis can still maintain good performance in the face of arbitrary scripts speaks volumes about the quality of Redis.

I was curious recently about how much of a performance penalty try/except blocks incur in python. Specifically, 1) does it incur much of a cost if no exception is thrown (accepting only a penalty when something exceptional happens) and 2) how does it compare to if/else statements where possible? A snippet to answer the first question:

It would appear that while catching exceptions is expensive, catching non-exceptions is very cheap. I imagine that the reason is mostly because when you throw an exception, you actually instantiate an exception object of some kind, which necessarily introduces some overhead. In the absence of that object creation, things can be relatively fast.

Now, for the second question. This particular question came up when deciding whether or not I should try fetching a key from a dictionary and catching an exception when it’s absent, or if I should use the get method and then check if the result is None.

I’m a big fan of python’s logging module. It supports configuration files, multiple handlers (for both writing to the screen while writing to a file, for example), output formatting like crazy, and many other delicious features. One that I’ve only recently encountered is its exception method.

The basic idea of the logging module is that you can get a logger from a factory (that allows multiple pieces of code to easily access the same logical logging entity). From there, you add handlers that output messages to various places (files, screen, sockets, HTTP endpoints, etc.). Every message you log is done at a specific level, and then the configuration of the logger determines whether or not to record messages of a certain severity:

importlogging# Get a logger instance
logger =logging.getLogger('testing')# Some initialization of handlers here, # unimportant in this context# Print out at various levels
logger.warn('Oops! Something happened')
logger.info('Did you know that X?')
logger.debug('Index is : %i' % ...)

What’s great about the module is that it separates your messages from how they’re displayed and where. For debugging, it’s nice to be able to flip a switch and turn on a more verbose mode. Or for production to tell it to shut up and only log messages that are really critical. What the ‘exception’ method does is to not only log a message as an error, but to also print a nice backtrace of where the error took place!

It’s a lesson that has now been hammered home repeatedly in my head: never trust callbacks. Just don’t. Go ahead and execute them, but if you trust them to not throw exceptions or errors, you are in for a world of unhappiness.

For me, I first learned this lesson when making use of twisted, writing some convenience classes to help with some of the somewhat odd class structure they have. (Sidebar: twisted is an extremely powerful framework, but their naming schemes are not what they could be.) Twisted makes heavy use of a deferred model where callbacks are executed in separate threads, while mission-critical operations run in the main thread. My convenience classes exposed further callbacks that could be overridden in subclasses, but I made the critical mistake of not executing that code inside of a try/except block.

Twisted has learned this lesson. In fact, their deferred model makes it very hard to throw a real exception. If your callbacks fail, execution takes a different path — calling errback functions. In fact, twisted is so pessimistic about callbacks (rightly so) that you just can’t make enough exceptions to break out of errback functions. However, wrapped in my convenience classes were pieces of code that were mission critical, and my not catching exceptions in the callbacks I provided was causing me a world of hurt.

That whole experience was enough to make me learn my lesson. Then, a few days ago I encountered it again in a different library, in a different language, in a different project, where I was exposing callbacks for user interface code in JavaScript. The logical / functional chunk of code exposed events that the UI would be interested in, but was too trusting, leading to errors in callbacks skipping over critical parts of the code.

All in all, when exposing callbacks, never trust a callback to not throw an exception. Even if you wrote the callbacks it’s executing (as was the case with both of these instances, at least in the beginning). Callbacks are a courtesy — a chance for code to be notified of an event, but like many courtesies, they can be abused.

Python has a pretty useful policy: named arguments. When you call a function, you can explicitly say that such-and-such value is what you’re providing for a particular argument, and can even include them in any order:

In fact, you can programmatically gain insight into functions with the inspect module. But suppose you want to be able to accept an arbitrary number of parameters. For example, for a printf equivalent. Or where I encountered it in wanting to read a module name from a configuration file, as well as the arguments to instantiate it. In this case, you’d get the module and class as a string and then a dictionary of the arguments to make an instance of it. Of course, Python always has a way. In this case, **kwargs.

This is actually dictionary unpacking, taking all the keys in a dictionary and mapping them to argument names. For example, in the above example, I could say:

hello(**{'last':'Lecocq','first':'Dan'})

Of course, in that case it’s a little verbose, but if you’re getting a dictionary of arguments programmatically, then it’s invaluable. But wait, there’s more! Not only can you use the **dict operator to map a dictionary into parameters, but you can accept arbitrary parameters with it, too!

Magic! No matter how you invoke the function, it has access to the parameters. You can even split the difference, making some parameters named and some parameters variable. For example, if you wanted to create an instance of a class that you passed a name in for, initialized with the arguments you give it:

def factory(module, cls, **kwargs):
# The built-in __import__ does just what it sounds like
m =__import__(module)# Now get the class in that module
c =getattr(m, cls)# Now make an instance of it, given the argsreturn c(**kwargs)
factory('datetime','datetime', year=2011, month=11, day=8)
factory('decimal','Decimal', value=7)