C++ Coding Example, Using Hash Algorithm to Remove Duplicate Number by O(n)



Hashing is an important technique in Computer Science. There are many cases that you can use Hash to speed up your algorithms (e.g. by reducing from O(n^2) to O(n) complexity).

Simply put, a hash table is a relatively large (considered bigger than the number of total elements) key-value pair array where the key is the index of the array. We use hash function to compute the hash key and put an element into the hash table. The two characteristics of hash functions are: (1) if two elements have the same value, their hash keys are the same. (2) if two elements are different, their hash keys are likely to be different. The first characteristic ensure that we know where to get the value after we put it into the hash table. The second characteristic is also important so that the collision is less likely to happen. If collision happens frequently, then the performance will be degraded and a list may be even faster. A collision happens when two different elements have the same hash key, and then we may e.g. have to change the key by shifting one position to the right.

Suppose, we have a very big array of 64-bit (type long) integer numbers and we would like to find out how many duplicate numbers we need to remove from the array so that each number only occurs once. Also the new array has to be printed (order doesn’t matter). It is easy to think of a O(n^2) naive solution or maybe O(nlogn) solution (e.g. sorting the array). However, when the total number n is large, then the running time grows significantly.

So, let’s use the hash to do this in a more efficient manner by sacrificing the space complexity. Let’s assume the hash table contains values of 32-bit unsigned integer.

1
2
3
4
typedef unsigned int DWORD;
const int HASHSIZE = 1048576;  // 20MB
const int HASH = HASHSIZE - 1;
DWORD hasharr[HASHSIZE];
typedef unsigned int DWORD;
const int HASHSIZE = 1048576;  // 20MB
const int HASH = HASHSIZE - 1;
DWORD hasharr[HASHSIZE];

The size of the hash table needs to be a lot bigger than the data size. In this case, for demonstration purpose, we define the following:

1
2
const int MAXN = 20;
long data[MAXN];
const int MAXN = 20;
long data[MAXN];

A simple and quick hash key function for a 64-bit integers can be (adding each 32-bit integers):

1
2
3
4
5
DWORD hashKey(long x) {
    int p1 = (int)x;              // lower 32 bits
    int p2 = (int)(x >> 32);      // upper 32 bits
    return p1 + p2;
}
DWORD hashKey(long x) {
    int p1 = (int)x;              // lower 32 bits
    int p2 = (int)(x >> 32);      // upper 32 bits
    return p1 + p2;
}

Now, if we initialise the data by random numbers that give some duplicates, like this:

1
2
    srand (time(NULL));
    for (int i = 0; i < MAXN; i ++) data[i] = rand() % MAXN;
    srand (time(NULL));
    for (int i = 0; i < MAXN; i ++) data[i] = rand() % MAXN;

Now, pick a unique identifier that means uninitialized (first time) for hash table, in this case, the data numbers can't be the value of HASHSIZE, so we can set up the hash table by using this value.

1
    for (int i = 0; i < HASHSIZE; i ++) hasharr[i] = HASHSIZE;
    for (int i = 0; i < HASHSIZE; i ++) hasharr[i] = HASHSIZE;

And, the following O(n) algorithm will try each number and check if it appears in the table first (almost O(1) lookup).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
    int count = 0;
    for (int i = 0; i < MAXN; i ++) {
        DWORD key = hashKey(data[i]); // unbound hash key
        key &= HASH;  // bound it to hash index
        for (;;) {
                if (hasharr[key] == HASHSIZE) { // slot is available
                    hasharr[key] = i;  // means the number appears first time
                    break;
                }
                else if (data[hasharr[key]] == data[i]) { // same number
                    count ++;
                    break;
                } else {
                    key = (key + 1) & HASH; // different number, collision happens
                }
        }
    }
    int count = 0;
    for (int i = 0; i < MAXN; i ++) {
        DWORD key = hashKey(data[i]); // unbound hash key
        key &= HASH;  // bound it to hash index
        for (;;) {
                if (hasharr[key] == HASHSIZE) { // slot is available
                    hasharr[key] = i;  // means the number appears first time
                    break;
                }
                else if (data[hasharr[key]] == data[i]) { // same number
                    count ++;
                    break;
                } else {
                    key = (key + 1) & HASH; // different number, collision happens
                }
        }
    }

Notice that the hash function returns the unbound key and we need to make it bounded to the range of table. There is a ugly inner 'endless' for loop however, assume the hash function is good enough (collision less likely to happen), then it is in fact in constant running time complexity.

We put the most likely condition check at first [check this post], and the less likely condition (collision) at last.

If key collision occurs (different numbers share same hash key), then we try next slot on the right until it is not used. The bit 'and' operation will make the key to the zero if it reaches the maximum (the wheels go round and round).

The next piece of code loops the hash table and print out the unique numbers.

1
2
3
4
5
6
    cout << count << " numbers removed." << endl;
    for (int i = 0; i < HASHSIZE; i ++) {
        if (hasharr[i] != HASHSIZE) {
            cout << data[hasharr[i]] << " ";
        }
    }
    cout << count << " numbers removed." << endl;
    for (int i = 0; i < HASHSIZE; i ++) {
        if (hasharr[i] != HASHSIZE) {
            cout << data[hasharr[i]] << " ";
        }
    }

The overall runtime complexity is O(n + H) where n is the data size and the H is the size of the hash table. A significant performance speedup can be observed if the n goes relatively large. The additional space O(H) is required.

hash-example C++ Coding Example, Using Hash Algorithm to Remove Duplicate Number by O(n) array C/C++ general hash implementation integer

hash example c++

–EOF (Coding For Speed) —

GD Star Rating
loading...
907 words Last Post: How Many One's Between Number 1 to N ?
Next Post: Simple C++ Coding Exercise - Use Lookup Table to Eliminate the Condition Checks

The Permanent URL is: C++ Coding Example, Using Hash Algorithm to Remove Duplicate Number by O(n) (AMP Version)

One Response

Leave a Reply