Why we need hash function

Consider this situation: there is an array of objects, each object has a unique identifier, also called key. The most common way to access the array is to find the object which's key equals the given key.

If the objects are randomly stored in the array, you have to scan the whole array and compare each object's key to the given parameter. If the array is sorted by the key, you can use binary search to located the object, it takes O(lgN) time.

How to located the object even faster? The answer is hash, it can achieve O(1) time.

Principles

The principles behind hash function is very simple: using an algorithm to calculate the object's key, and get a hash value, this hash value also called bucket number. A bucket contains all the objects that its hash value equals to the bucket number. In most implementations the bucket number is the index of an array, the objects in one bucket stored as linked list. The array called hash table.

The simplest way to map a hash value to an array index is mod calculation: index = hash % tablesize. This is used by many implementations, like the Java HashMap.

The worst case is all the keys have the same hash value, all the objects will be stored in the same bucket, and the performance degrade to random sequence array. The best case is every object can map to different bucket, every bucket has only one object.

Hash functions are similar to random number generators in many ways, the randomness is the ultimate goal of a good hash function.The best case, or truly randomness is idealized and unrealistic , most function stay in between the best and worst.

Here are some illustrations:

 

Hash in JAVA

The root class Object has a method Object.hashCode(), will return the hash code of the object, so the hash concept applies for all Java objects of all time. For example the String object:

 
    public int hashCode() {
        int h = hash;
        if (h == 0 && count > 0) {
            int off = offset;
            char val[] = value;
            int len = count;
 
            for (int i = 0; i < len; i++) {
                h = 31*h + val[off++];
            }
            hash = h;
        }
        return h;
    }
 

Here is how HashMap calculate the array index:

 
 
static int indexFor(int h, int length) {
        return h & (length-1);
    }
 
 

The length is the size of the hash table, and the value usually is 2^n so the h & (length-1) is the same as h % length.

Hash in Memcache

Cache system like Memcache introduced a new type of hashing, in this system , object might distributed on multiple machine, so we need an algorithm to map the object to a specific machine. We can think every machine is a bucket, all the machine is a hashtable. The classic way, that the mod calculation can not work very well, because the machine may add or remove dynamically which cause bucket number and the responded machine change drastically.

Memcache using the consisting hash to resolve the problem, the hash should be implemented in the client side.

The range of key can be treat as a circle, for example the key length is 16 bits , the range is 0 to 65535. We can distribute the servers on the circle:

Each key will search along the round to find the machine that will store this object. If machine 0 was down , only the key reside between machine 1 and machine 0 were affected. The cache system will work as well as usually when the size of machine change dynamically.

Hash function and NoSQL

The NoSQL movement today is actually an application of hash function. All NoSQL databases are Key Value based, they can take the advantage of the hash function and gain performance.

In many scenarios , hash function is the key factor to improve performance of application. Because hash algorith is the only algorithm finishes in O(1) time.

Reference

The Art of Hashing

8.3 Hash Tables

memcache internals

Consistent Hashing in memcache-client