The hash package: hashes come to R

Perl has hashes. Python has dictionaries. Why doesn’t R have an equivalent? Hash tables and associative arrays are indispensable tools for the programmer. One of the most common and basic tasks of a programmer is to “look up” or “map” a key to a value. In fact, there are projects whose sole raison d’être is making the hash as fast and as efficient as possible.

R actually has two equivalents, both lacking. The first is R’s named vectors and lists. Elements of vectors and lists can be accessed by name, through the standard R methods:

obj$name
obj['name']
obj[['name']]

Vectors are not stored using internal hash tables and as they grow large, performance can suffer. The performance impact is tangible even on small lists. For programs doing many look-ups or look-ups on many objects, this can create a bottleneck.

R’s environments are much closer to Perl hashes and Python’s dictionary. The structure of the environment is a hash table internally and look-ups do not appreciably degrade with object size. To use a R environment, you need to create it and assign key-value pairs to it.

Usability. In designing, the S language, John Chambers put much thought into how the analyst and statistician interact with data. All varaibles are designed to be vectors and a standard set of accessors( $, [, [[ ) were defined to retrieve and set slices, subsets or elements of the data. The problem is that R environments don’t follow this pattern. And this is where the hash package comes in.

The hash package is designed to provide an R-syntax to R’s environments and give programmers a hash. The package provides one constructor function, hash that will take a variety of arguments, always doing the right thing. All of the following work:

Share this:

Like this:

LikeLoading...

Related

This entry was posted on Sunday, July 26th, 2009 at 11:50 pm and is filed under R. You can follow any responses to this entry through the RSS 2.0 feed.
You can leave a response, or trackback from your own site.

Much of the problem steps from how R handles environments. It is going to take a bit of thought and some effort this weekend. Until then I would consider hash in hash as unsupported. But I understand that is a very important use case for both me and you.

* * * * *

I am presently testing a fix. If all works well the package will be posted to CRAN soon.

Sorry for the late reply and thanks for the bug report. Indeed this was not behaving as expected. Moments ago, I released hash version 1.10.0 to address this and other minor annoyances. It should be trickling its way through the CRAN mirrors. Here is the result of your test case:

Yes. You can emulate hashes by using data.frames and the merge() function. You can also use named lists. And if all items of the hash are of the same base class, you can even use named vectors.

The problem with each of these emulations is that they do not scale well, O(n). There are fine when n is small but are disastrous when it grows. The hash packages solves this by using R’s environments, i.e. real hash tables. In fact, the hash packages only provides a more intuitive interface.

Try it. Compare using data frames and merge tables against hash on a million records.

It took me a month to find the time to fix it, but I uploaded version 1.99.3 of the hash packages to CRAN earlier today. It was tested on the development branch of R ( version 2.11 ), so it should work great for you. If you have any problems, please let me know.

I’ve found this package quite useful, thank you for creating it. However, I’ve noticed that with a moderately large hash (~5k items) inserting can become very slow due to the check to see if the item is present, which fetches the entire list of keys and searches it. It appears to be a trivial change to work around this by using a tryCatch block to catch the error generated by R if the key is missing:

Archives

Courses

We'll be giving courses and tutorials in cloud computing and analytics in the Fall. If you are interested in attending a course before then, please contact us at Open Data to arrange an in-house training.