The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

Note: Based on the response frommy technical blog I am posting this article here so that it would be visible to wider audience.

java.util.HashMap.java

/**
* The maximum capacity, used if a higher value is implicitly specified
* by either of the constructors with arguments.
* MUST be a power of two <= 1<<30.
*/
static final int MAXIMUM_CAPACITY = 1 << 30;

It says the maximum size to which hashmap can expand, i.e, till 2^(30) = 1,073,741,824

java.util.HashMap.java

/**
* The default initial capacity - MUST be a power of two.
*/
static final int DEFAULT_INITIAL_CAPACITY = 16;
/**
* The load factor used when none specified in constructor.
*/
static final float DEFAULT_LOAD_FACTOR = 0.75f;

It says default size of an array is 16 (always power of 2, we will understand soon why it is always power of 2 going further) and load factor means whenever the size of the hashmap reaches to 75% of its current size, i.e, 12, it will double its size by recomputing the hashcodes of existing data structure elements.

Hence to avoid rehashing of the data structure as elements grow it is the best practice to explicitly give the size of the hashmap while creating it.

Do you foresee any problem with this resizing of hashmap in java? Since java is multi threaded it is very possible that more than one thread might be using same hashmap and then they both realize the need for re-sizing the hashmap at the same time which leads to race condition.

What is race condition with respect to hashmaps? When two or more threads see the need for resizing the same hashmap, they might end up adding the elements of old bucket to the new bucket simultaneously and hence might lead to infinite loops. FYI, in case of collision, i.e, when there are different keys with same same hashcode, internally we use single linked list to store the elements. And we store every new element at the head of the linked list to avoid tail traversing and hence at the time of resizing the entire sequence of objects in linked list gets reversed, during which there are chances of infinite loops.

Here it1. re-generates the hashcode using hash(int h) method by passing user defined hashcode as an argument2. generates index based on the re-generated hashcode and length of the data structure.3. if key exists, it over-rides the element, else it will create a new entry in the hashmap at the index generated in STEP-2

Steps3 is straight forward but Steps1&2 needs to have deeper understanding. Let us dive into the internals of these methods…

Note: These two methods are very very important to understand the internal working functionality of hashmap in openjdk

here:‘h’ is hashcode(because of its int data type, it is 32 bit)‘length’ is DEFAULT_INITIAL_CAPACITY(because of its int data type, it is 32 bit)

Comment from above source code says…Applies a supplemental hash function to a given hashCode, which defends against poor quality hash functions. This is critical because HashMap uses power-of-two length hash tables, that otherwise encounter collisions for hashCodes that do not differ in lower bits. What do this means???

It means that if in case the algorithm we wrote for hashcode generation does not distribute/mix lower bits evenly, it will lead to more collisions. For example, we have hashcode logic of “empId*deptId” and if deptId is even, it would always generate even hashcodes because any number multiplied by EVEN is always EVEN. And if we directly depend on these hashcodes to compute the index and store our objects into hashmap then1. odd places in the hashmap are always empty2. because of #1, it would leave us to use only even places and hence double the number of collisions

For example,

I am considering some hash codes which our code might generate, which are very valid as they are different, but we will prove these to be useless soon
1111110111011010101101010111110
1100110111011010101011010111110
1100000111011010101110010111110
I am considering these sequences directly (without using hash function) and pass it for indexFor method, where we do AND operation between 'hashcode' and 'length-1(which will always give sequence of 1's as length is always power of 2)'
As we are considering the length as default length, i.e, 16, binary representation of 16-1 is 1111
this is what happens inside indexFor method
1111110111011010101101010111110 & 0000000000000000000000000001111 = 1110
1100110111011010101011010111110 & 0000000000000000000000000001111 = 1110
1100000111011010101110010111110 & 0000000000000000000000000001111 = 1110

What is bucket and what can be maximum number of buckets in hashmap? A bucket is an instance of the linked list (Entry Inner Class in my previous post) and we can have as many number of buckets as length of the hashmap at maximum, for example, in a hashmap of length 8, there can be maximum of 8 buckets, each is an instance of linked list.

From this we understand that all the objects with these different hascodes would have same index which means they would all go into the same bucket, which is a BIG-FAIL as it leads to arraylist complexity O(n) instead of O(1)

Comment from above source code says…that otherwise encounter collisions for hashCodes that do not differ in lower bits.

Notice this sequence of 0-15 (2-power-4), its the default size of Hashtable

If we notice here, hashmap with power-of-two length 16(2^4), only last four digits matter in the allocation of buckets, and these are the 4 binary lower bit digit variations that play prominent role in identifying the right bucket.

Keeping the above sequence in mind, we re-generated the hashcode from hash(int h) by passing the existing hascode which makes sure there is enough variation in the lower bits of the hashcode and then pass it to indexFor() method , this will ensure the lower bits of hashcode are used to identify the bucket and the rest higher bits are ignored.For example, taking the same hascode sequences from above example

so here it is clear that becase of regenerated hashcode, the lower bits are will distributed/mixed leading to unique index which leads to different buckets avoiding collisions.

Why only these magic numbers 20, 12, 7 and 4. It is explained in the book: “The Art of Computer Programming” by Donald Knuth.>> Here we are XORing the most significant bits of the number into the least significant bits (20, 12, 7, 4). The main purpose of this operation is to make the hashcode differences visible in the least significant bits so that the hashmap elements can be distributed evenly across the buckets.

Going back to previous steps:1. re-generates the hashcode using hash(int h) method by passing user defined hashcode as an argument2. generates index based on the re-generated hashcode and length of the data structure.3. if key exists, it over-rides the element, else it will create a new entry in the hashmap at the index generated in STEP-2

Steps1&2 must be clear by now.

Step3:What happens when two different keys have same hascode?1. if the keys are equal, i.e, to-be-inserted key and already-inserted key’s hashcodes are same and keys are same (via reference or via equals() method) then over-ride the previous key-value pair with the current key-value pair.2. if keys are not equal, then store the key-value pair in the same bucket as that of the existing keys.

When collision happens in hashmap? it happens in case-2 of above question.

How do you retrieve value object when two keys with same hashcode are stored in hashmap? Using hashcode wo go to the right bucket and using equals we find the right element in the bucket and then return it.

How does different keys with same hascode stored in hashmap? Usual answer is in bucket but technically they are all stored in a single linked list. Little difference is that insertion of new element to the linked list is made at the head instead of tail to avoid tail traversal.

1. re-generates the hashcode using hash(int h) method by passing user defined hashcode as an argument2. generates index based on the re-generated hashcode and length of the data structure.3. point to the right bucket, i.e, table[i], and traverse through the linked list, which is constructed based on Entry inner class4. when keys are equal and their hashcodes are equal then return the value mapped to that key

Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.