Saturday 22 March 2014

Implementation of Java HashMap and significance of hashcode

Why hashcode() should be implemented along with equals() 


We all know that hashcode() method needs to be implemented when you are implementing equals() method. The significance of hashcode() implementation comes into picture when we store our object in a collection that uses hashing algorithm.
For a collection that uses hashing algorithm, the objects are stored in the array at an index that is calculated based on the hashing algorithm. 

The structure of an element in a HashMap is as follows:

static class Entry<K,V> implements Map.Entry<K,V> {
        final K key;
        V value;
        Entry<K,V> next;
        final int hash;
}

Handling Hash Collision

The HashMap is an array of the above element which is a linear linked list before Java 8. It is possible that multiple elements of the HashMap results in having the same resulting index based on the hashcode and the algorithm to calculate the index for storing the element.
If the HashMap index already have an element, then the new element is stored as the above mentioned linked list as the next element.
While we do a contains operation in a HashMap, based on the hashcode, the index in which the element is stored is calculated and using equals, the key value of the elements are compared to figure out if the element exists in HashMap.
The worst case search time is O(n) for linked list and Java 8 has done changes for performance optimizations.

The linked list in the HashMap array is replaced with an Array/TreeMap combination in Java 8.
So, worst case search time with a tree structure  for hash collision comes down to O(logn).
The hash collision data is stored in an array  till the trreify size - by default 8. After that data is maintained in a TreeMap data structure for faster search. TreeMap internally uses a red-black tree structure which is a self balancing tree.

Hash code and equals

If hashcode is not implemented along with equals, the HashMap could return wrong results due to the searching logic discussed above. The default implementation of hashcode would give different hashcode values for two object instances even when they satisfies equals method. This means that the index calculated based on hashcode for an object instance which satisfies equals condition could be different. This would result checking equals on objects stored in a wrong index and the HashMap search would fail.

Resizing of a hashmap


Another important thing to know about the implementation of HashMap is how the resizing has been handled. Based on the logic with which HashMap is storing the data, even without any resizing it is possible to store all the elements. But that means all the operations would eventually becomes slow. So, we need to resize HashMap at certain point in time so that the index calculation base on hashcode will give a distributed list. The more the indexes, the possibility of index getting distributed is more and the HashMap would have better performance but we need to utilize more memory. So, striking a balance between space complexity and time complexity, the optimum solution to resize HashMap is when the HashMap is filled 0.75 times it's actual size.

Load factor: 

The above said value is known as the load factor.  Load factor is used to calculate the arraysize of HashMap at which resizing should happen.0.75 is the default value of the load factor and can be modified by the user.

Capacity:

The total size of the HashMap array is known as capacity.

Threshold:

threshold is capacity * load factor. Threshold is the size of HashMap on reaching which the resizing of HashMap has to happen. 

HashMap always maintain size as power of 2 with the default initial size as 16. If you give an initial capacity which is not a power of 2, then the next highest power of 2 will be used as initial size. This is because the internal algorithms used in HashMap works best with power of 2.


No comments:

Post a Comment