Hash tables

Hi there, in this article I want to talk about hash tables and implement it with c++, let's get started.

Hash table is data structure that implements an associative array. Hash tables use hash function to compute an index, which makes them really fast in finding elements. Because of this they're widely used in comuter software.

Hash table

I'll define HashTable class to implement data structure. Here you can see definition of the main functions of class.

#include <iostream>
#include <list>
#include <string> 

class HashTable{
private: 
    static const int hash_groups = 1000;
    static const int lower_limit = 1;
    static const int upper_limit = 100000;
     
    std::list<std::pair<int, std::string>> table[hash_groups];
public:
    void insert_item(const int key, const std::string& value);
    void remove_item(const int key);
    bool is_empty() const;
    const int hash_function(const int key);
    void print();

Hash functions

Now let's talk about hash functions, it's the most interesting part of hash tables. There're some bases for creating "right" hash function:

It should be efficiently computable, let's say in constant time and usually with using arithmetic operations
it should produce few collisions. Collision is situation, when hash function generates the same output for different inputs. To put in simply: h(x) = h(y), even though x != y.

Exist a lot of different hash functions.

Example of hash function (division hashing):

const int HashTable::hash_function(const int key){        
    // distribute numbers by last value of number        
    // 101 => first group; 607 => seventh group        
    return key % hash_groups;    
}

It satisfies first criteria of efficiency, but consecutive keys are mapped to consecutive entries, and this is does not do a good job of breaking up clusters.

Here are 3 more commonly used hash functions:

1) Multiplicative hash function:

h(x) = (ax) mod m:

const int HashTable::hash_function(const int key){
    int a = lower_limit + rand() % upper_limit;
    
    return (a * key) % hash_groups;
}

2) Linear hash function:

h(x) = (ax + b) mod m:

const int HashTable::hash_function(const int key){
    int a = lower_limit + rand() % upper_limit;
    int b = lower_limit + rand() % upper_limit;
    
    return (a * x + b) % hash_groups;
}

3) Polynomial hash function:

Polynomial hash function is extend of the Linear hash function. it gives us opportunity to handle a sequences of objects, such as strings or multi-dimensional coordinates.

h(x₀, . . . , xₙ) = (∑(k-1, i=0) cᵢ(p^i)) mod m:

A useful algorithm for computing polynomials is called Horner’s rule.

const int HashTable::hash_function(const std::string key){
    int p = lower_limit + rand() % upper_limit;
    int hash_value = 0;
    for(int i = key.length() - 1; i >= 0; i--){
        hash_value = (p * hash_value + static_cast<int>(key[i])) % hash_groups;
    }
    
    return hash_value;
}

Universal hashing

The main problem of listed hash functions, that it's easy to find elements which hash function's will have the same result. But at the same time with random elements we are able to talk about some kind of math prediction. Actually for random data it'll be working just fine, the average time of finding element'll be O(1). But the world is cruel and random data is a really rare thing. That's why hash function should be designed for each data in it's own way. Here comes the idea about "universal hashing".

A family of hash functions H = {hᵢ}, hᵢ: X->{0,1...M-1} is called universal if
Ɐx₁,y₁ є X, x₁ != y₁
P(hᵢ(x₁) = hᵢ(y₁)) <= 1/M. In words: A family of hash-functions is called universal if for two random not equal elements the probability of choosing hash-function gives collision not bigger, than 1/M. With universal hashing there can't be any type of data, which could make hash table works slow. It's just impossible because we choose hash function in a random way.

Let's have a look at the code for other functions.

Firstly, insert_item. I think there is nothing complicated about it. We iterate over list until find the right position for value variable based on hash value.

void HashTable::insert_item(const int key, const std::string value){
    //choose hash function depends on your needs 
    int hash_value = hash_function(key); 
    
    auto& cell = table[hash_value]; 
    auto begin_itr = std::begin(cell); 
    bool key_exists = false;
     
    for (; begin_itr != std::end(cell); begin_itr++){ 
        if (begin_itr->first == key){ 
            begin_itr->second = value; 
            key_exists = true; 
            std::cout << "Key exists, values was replaced" << std::endl; 
            break; 
         } 
     }
      
     if (!key_exists) 
         std::cout << "Key was inserted" << std::endl; 
         cell.emplace_back(key, value); 
}

Almost the same for remove_item. Iterate over list, until find key value.

void HashTable::remove_item(const int key){ 
    int hash_value = hash_function(key); 
    
    auto& cell = table[hash_value]; 
    auto begin_itr = std::begin(cell); 
    bool key_exists = false; 
    
    for (; begin_itr != std::end(cell); begin_itr++){ 
        if (begin_itr->first == key){ 
            begin_itr = cell.erase(begin_itr); 
            key_exists = true; 
            std::cout << "Key exists, values was erased" << std::endl; 
            break; 
         } 
     } 
     if (!key_exists) 
         std::cout << "Key doesn't exist" << std::endl; 
}

Typical and evident is_empty function:

bool HashTable::is_empty() const{ 
    int sum = 0; 
    for (int i = 0; i < hash_groups; i++){ 
        sum += table[i].size(); 
        if (sum > 0) 
            return false; 
     }
     
     return true; 
}

Function print also is evident.

void HashTable::print(){ 
    for (int i = 0; i < hash_groups; i++){ 
        if (table[i].size() == 0) 
            continue;
             
        auto begin_itr = table[i].begin(); 
        std::cout << i << " cell" << std::endl;
         
        for (;begin_itr != table[i].end(); begin_itr++){ 
            std::cout << "Key: " << begin_itr->first << " Value: " << begin_itr->second << std::endl; 
        } 
        std::cout << "- - -" << std::endl; 
     } 
}

We did a good work for now, but hash table still doesn't have any mechanism to resolve collisions and it's going to be our next topic for the next article.

Full code: https://github.com/rastr-0/Simple_algorithms/blob/master/hash_table/HashTable.cpp

see ya soon

@rastr

References: