Friday, 13 February 2015

Understanding how hashing (HashMap and HashSet) work in Java

Background

Hashing is a very important concept in computer programming and a very popular concept for interview questions. Two important data structures related to hashing are HashMap and HashSet which is exactly what we are going to look at in this post.

Overview

If you are from a computer science background you would know HashMap is a data structure which stores key value pairs where as a HashSet stores unique data. In HashMap we have kind of buckets and each data added to a HashMap falls into one of the buckets depending on the hash value of it. Also you must have heard that adding and retrieving objects in HashMap happen in time complexity O(1).

But still there are open end question like -

What happens when two objects added to HashMap have same hash (code) value ? - a situation typically known as collision.

If above is handled how get and put work in hashmap?.... and so on.

We will address them now.

Understanding how HashMap works

You can visualize HashMap as follows-

So you have an Array and each array position is essentially a Linked List. As you know you don't have to specify size of the HashMap. It increases dynamically like ArrayList. The main data structure is essentially an array.

/**
* The table, resized as necessary. Length MUST Always be a power of two.
*/
transient Entry[] table;

When you create a HashMap you either choose to provide initial capacity or when you don't a default value is used.

/**
* The default initial capacity - MUST be a power of two.
*/
static final int DEFAULT_INITIAL_CAPACITY = 16;

When table is created it is created with either this initial default capacity or the capacity you provide in the constructor.

Next important thing is when should out table we resized to meet the dynamically changing data size of HashMap. Answer depends on a threshold that is determined by load factor.

How is data with keys generating same hash code stored in HashMap?

As mentioned earlier each index has reference to object of type Entry which is a Linked List. It has a next pointer. So if an entry is added to hash map it's hash code is computed which determines the index at which it should be put. If that index has an entry then new entry is added to the start of the linked list and existing linked list is appended to next of it.

Also note how null is handled. Yes null is an acceptable key in HashMap.

So how is data retrieved if two keys generate same hash code?

Same hash code will make searched for both keys data land on same index in the table. From there the each Entry object is iterated over and it's key compared with the search key. Yes both key and value are stored in the Node/Entry object! On successful match corresponding value is returned.

A good hash function

A hash function is a method that computes hash of a key where data is stored. It should obviously return value between 0 - (n-1) for an array of length n used to store the data.

A good has function will have take minimum computation time will evenly distribute keys in the array.

For array of size n and m elements inserted it's load factory would ideally be

α = m/n

A hash function can be thought of two parts -

Hash code Map (Key -> Integer)

Compression Map (Integer -> [0,N-1])

Simple hash function (Compression Map) for an array of size N would be

h(k) = k mod n where k is the key and n is the size of the array.

NOTE : In above function h(k) you need to take care that m is not a power of 2. If you do you are only using last m bits of the number to compute hash in that case which is not a good method.

So choose m close to n and m should be prime.

Another compression map can be -

h(k) =lowerbound ( k A mod 1) where k is the key, m is the size of the array and A is a constant between 0 and 1 i.e 1<A<0

NOTE : For String avoid adding ascii values of characters as it is not a good function. Multiple words may result in same hash and map to same bucket resulting in higher collision. Use polynomial function instead. So if your integers of ascii chars are c0, c1, c2 use polynomial like -

NOTE : In java if hashing is involved (lets say you are using HashMap or a HashSet) make sure you override equals() and hascode() method to suit your requirements. As much as is reasonably practical, the hashCode() method defined by class Object does return distinct integers for distinct objects. (This is typically implemented by converting the internal address of the object into an integer)

HashMap changes in Java8

The performance has been improved by using balanced trees instead of linked lists under specific circumstances. It has only been implemented in the classes

java.util.HashMap,

java.util.LinkedHashMap and

java.util.concurrent.ConcurrentHashMap.

This will improve the worst case performance from O(n) to O(log n).

Lastly lets see HashSet.

Understanding HashSet

Well if you are still guessing the data structure of HashSet then following would be a surprise -

LinkedHashMap also stores head and tail of this double linked list and thats how it maintains the insertion order. So even though put follows hashing storage and retrieval order is maintained using doubly linked list.

When iterating it iterates from head to tail thereby providing same order as insertion.

NOTE : before and after pointers are in addition to the next pointer which is inherited from HashMap.Node class. So next points to next node having same hash (collision scenario) thereby preserving O(1) lookups. Before and After pointers guarantee insertion lookup order.

Understanding TreeMap

TreeMap as you already know stores the data in sorted order. If the data it stores implements Comparable interface then it stores data in that natural order or you can pass a custom comparator to the TreeMap and it will use that to sort and store the data.

/**
* The comparator used to maintain order in this tree map, or
* null if it uses the natural ordering of its keys.
*
* @serial
*/
private final Comparator<? super K> comparator;
private transient Entry<K,V> root;