Diving into How Hashes Work in Ruby

Ruby’s Hash is arguably the most useful data type in the language. Utilizing a structure that lends itself well to modelling real-world problems, combined with speedy lookups, it is frequently the backbone of scripts, web pages, and complex business logic. It was even once used internally by Ruby to store all your functions, variables, and constants (although modern Ruby approaches this differently by using branch prediction of hot paths and call-site method caching; that’s another post!) We often take it for granted that it “just works”. Today we’re going to delve into the internals of Hash, seeing exactly why it works, how it is so efficient, and how it is implemented. Onward!

Implementation

First, the runtime allocates a contiguous block of memory for the data structure. Next it hashes the first key:

:google.hash
=> 3592

The hash function is declared in Ruby’s kernel class, so it’s available to any class with that in its inheritance chain. The actual number will vary depending on a number of factors, but what is important is that it remains consistent across a session.

Next, we have to store the value. The value is placed in what is called a ‘bucket’ (or a ‘bin’ if you’re coming from other languages). A Ruby hash will allocate 11 buckets initially, so we need to decide in which bucket to store our hash value. For that, the runtime simply performs the following calculation:

:google.hash % 11
=> 5

This means that the value for :google, which is 801.23, is placed in the 5th bucket of the hash table, with an identifier for its key that is something like:

[:google, 801.23]

It repeats this process for the rest of the key/value pairs. To pull it back out, the interpreter simply repeats the calculation and pulls out the value from the bucket. Simple right? Let’s get into the edge cases.

Collisions

The more astute readers among you may have caught on to the fact that if there are only 11 possible buckets, sooner or later the calculation is going to place two values in the same bucket. What happens then? Well, Ruby’s hash implementation will store the items in the same bin in the form of a linked list. If the identifier (the key) matches the one stored (it checks this with equal?), it returns the value, and if it doesn’t, it simply continues traversing the list.

Of course, this process could be time consuming – what if an item shared a bucket with 1000 others? It could theoretically take a large amount of time to return the desired value, and the whole point of using a hash table is that it is fast. Ruby, of course, has you covered. It does this by keeping a running total of the average number of items in each bin across the whole table. Once this value exceeds 5, Ruby discards the array of 11 bins, creates a new one and it is at this point where Ruby will rehash the values. By doing this, the linked lists in bins are kept short, which decreases overall lookup time.

Lookup time is the key selling point for a hash. For the Big O Notation-inclined amongst you, a hash table will look up values in constant time; that is that as the number of values in the table increases, the lookup time remains the same. In practice, it’s closer to O (log n), but let’s not get too pedantic!

The Importance of a Good Hashing Algorithm

To cement our understanding of the hash implementation, let’s try a brief experiment: What would happen if a custom hashing function returned the same integer each time? Our mental model of the hash suggests that this will append each item to a linked list in the same bin, and the time to look up an item towards the end of that list will be much higher. Let’s try with some code:

First we create a class, GoodHash, which uses Ruby’s native hashing algorithm. Then we loop 17 times (65536 items seems a good place to stop!), exponentiating a base of 2 to the power (ex). We then insert a random value into test_hash using an instance of GoodHash as the key. We look up on that value 1000 times and benchmark the results, before repeating the process with BadHash

Results (readings taken from ‘real’ column):

Hash size

GoodHash

BadHash

1

0.002808

0.003077

2

0.002841

0.003663

4

0.003018

0.004167

8

0.005757

0.004750

16

0.003502

0.006898

32

0.003174

0.011138

64

0.002843

0.019534

128

0.003080

0.041806

256

0.003159

0.086643

512

0.003137

0.147724

1024

0.003179

0.287432

2048

0.002843

0.554049

4096

0.002828

1.101022

8192

0.005380

2.192948

16384

0.003037

4.385111

32768

0.002830

8.739381

65536

0.002830

17.314654

This proves our hypothesis fairly conclusively. As the number of items in the hash grows, the lookup time with the BadHash class grows exponentially, while the GoodHash implementation remains firmly grouped around the same sort of time. The outliers in the GoodHash column also illustrate Ruby rehashing once the buckets get too full.

Interestingly, Ruby 2.0 onwards doesn’t work like this for smaller hashes. If your hash is 6 items or fewer, it is stored as an array list and traversed until the key matches the lookup. While hash tables are quick, by far the biggest cost is calculating the hash value; a process neatly sidestepped with this approach. In the event that the hash grows to 7 items, the array is discarded and the traditional hash is used. This is illustrated in the results table as with the entry at a hash size of 8.

How Does Ruby Allocate Bins?

Since the number of bins don’t increase linearly with the number of items in them, how does Ruby decide how many bins a hash should have? Well, Ruby calculates a list of prime numbers close to powers of 2. The benefit of this is that the number when combined with the hash value is much less likely to share a common factor. If this wasn’t true, then the modulus calculation we covered earlier could yield the same number each time, which would put everything into the same bin, not unlike our experiment prior.

Conclusion

For a dive into the internals of Ruby, this was relatively straightforward. While we’ve covered hashes pretty extensively, it is important to know when and where to use them. By far the largest costs in hashing are the rehashing, hash calculation, and bucketing, so be mindful of these when performance is a factor (the JRuby implementation avoids relying on hashes internally for this exact reason). Generally however, a hash is an excellent tool to reach for, given its intelligent design and careful optimizations. We now know exactly how a hash works, why it is so performant, and never to override the hash function for an object!

David Bush is a Web Developer who travels the world and writes code. His favourite languages are Clojure, Ruby and Python. He enjoys learning new technologies, beer, good food and trying new things.
www.david-bush.co.uk