Translate

Chủ Nhật, 4 tháng 12, 2016

Hash function

Created by Thang Vu

Problem discription.

Example: We have 100 different strings. We store them in an array (each element of the array represents a string). When we want to search a string X, the simplest way is consider each element of the array then compare with X. However, this procedure is not effective in processing time, especially, when the number of the array elements is huge. To solve this problem, a concept hash map is introduced. 

Definition. 

A hash map is a data structure used to implement an associative array, a structure that can map keys and values. A hash map uses a hash function to compute an index into an array of buckets, from which the desired value can be found.
Fig. 1. An example of hash map
In Fig. 1, a hash function transforms key (name) into index of bucket (contains telephone number).

Basic operations

Insert an entry
Search an entry
Delete an entry

Advantages

The main advantage of ash tables over the table data structures is speed. This advantage is more apparent when the number of entries is large. 

Disadvantages

The cost of a good hash function can be significantly higher than inner loop of the lookup algorithm for a sequential list or search tree. So hash maps are not effective when the number of entries is very small. 

A simple example

We need to create contact list (hash table) with name (key) and telephone number (value). 
First, the hash table is empty, we need to import contacts to hash table. The role of the hash function is mapping the name with the index of the hash table. For instance, the hash function returns the value that based on the first letter in the name. (Alexander -> 0, Bob -> 1, Dennis -> 3). Then we set the slot of index (0, 1, 3) with the respectively phone number. 
After creating (by insertion) the hash table, we can search for the phone number with a given key. For example, we want to search phone number of Alexander, hash function returns 0. We can know exactly the phone number is corresponding to index 0. 

Choosing a good hash function

A good hash function and implementation algorithm are essential for good hash table performance, but may be difficult to achieve. 
A basic requirement is that the function should avoid collisions. A collision means there are more than one key have the same bucket index (Fig 2)
Fig. 2. Collision in a Hash map
A basic requirement is that the function should provide a uniform distribution af hash values. A non-uniform distribution increases the number of collisions and the cost of resolving them. Uniformity is sometimes difficult to ensure by design, but may be evaluated empirically using statistical tests. 

Collision resolutions. 

Hash collision are practically unavoidable when hashing a random subset of a large set of possible keys. 

Separate chaining. 

In the separate chaining method, each bucket has some sort of list of entries with the same index (Fig. 2). The time for hash table operations is the time to find the bucket plus the time for list operation. 

Separate chaining with linked lists. 
Separate chaining with linked lists is popular because they require only basic data structures with simple algorithms. The cost of a table operation is that of the scanning the entries of the selected bucket for the desired key. The disadvantages of separate chaining with linked lists is the overhead of the next pointer in each entry record can be significant, and the traversing a linked list has poor performance. 

Separate chaining with other data structure
Instead of a list, we can use any other data structure that supports the required operations. For example, by using a self-balancing tree, the theoretical worst-case time of common hash table operations can be brought down to O(log n) rather than O(n). However, this approach is only worth when each slot has many entries. 
The variant called array hash table uses a dynamic array to store all the entries that hash to the same slot. 
An elaboration on this approach is using a second hash map for each bucket. This variant guarantees constant worst-case lookup time. 

Open addressing.

In open addressing strategy, all entry records are stored in the bucket array itself. With this method, a hash collision is resolved by probing through alternate locations in the array until either the target record is found or an unused array slot is found, which indicates that there is no such key in the table. 
Fig. 3. An example of open addressing technique

Implement a simple hash map in C programming language. 

A hash map will be the collection of entries.
?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
 
#define CHAR_SIZE 100
#define SIZE      100
 
struct entry_s {
 char *key;
 char *value;
 struct entry_s *next;
};
 
typedef struct entry_s entry_t;
 
struct hash_table_s {
 int size;
 struct entry_s **table;
};
 
typedef struct hash_table_s hash_table_t;
 
/* allocate memory for hash table */
hash_table_t *create_ht(int size) {
 hash_table_t *hash_table = NULL;
 if (size < 1) return NULL;
 
 /* Allocate the table itself */
 if ((hash_table = malloc(sizeof(hash_table_t))) == NULL) {
  return NULL;
 }
 
 /* Allocate pointers to the head nodes */
 if ((hash_table-> table = malloc(sizeof(entry_t *)*size)) == NULL) {
  return NULL;
 }
 
 int i;
 for (i = 0; i < size; i++) {
  hash_table->table[i] = NULL;
 }
 
 hash_table -> size = size;
 return hash_table;
}
 
/* Hash a string for a particular hash table */
int hash(hash_table_t *hash_table, char *key) {
 unsigned long int hash_val = 0;
 int i = 0;
 for (i = 0; i < strlen(key); i++) {
  hash_val += key[i];
 }
 return hash_val % hash_table->size;
}
 
/* create a new entry for a given key and value */
entry_t *create_entry(char *key, char *value) {
 entry_t *new_entry;
 if ((new_entry = malloc(sizeof(entry_t))) == NULL) {
  return NULL;
 }
 if ((new_entry -> key = malloc(CHAR_SIZE)) == NULL) {
  return NULL;
 }
 if ((new_entry -> value = malloc(CHAR_SIZE)) == NULL) {
  return NULL;
 }
 strcpy(new_entry -> key, key);
 strcpy(new_entry -> value, value);
 
 new_entry -> next = NULL;
 return new_entry;
}
 
/* Insert a entry into a hash table */
void put_entry( hash_table_t *hash_table, char *key, char *value) {
 int index;
 entry_t *new_entry;
 
 new_entry = create_entry(key, value);
 index = hash(hash_table, key);
 //insert entry
 if (hash_table -> table[index] == NULL) {
  hash_table -> table[index] = new_entry;
 } else {
  entry_t *temp_entry = hash_table ->table[index] -> next;
  new_entry -> next = temp_entry;
  hash_table -> table[index] = new_entry;
 }
}
 
/* get value from a key */
char *get_value(hash_table_t *hash_table, char *key) {
 int index;
 entry_t *entry;
 index = hash(hash_table, key);
 entry = hash_table->table[index];
 while (entry != NULL) {
  if (strcmp(entry->key, key) == 0) {
   return entry->value;
  }
  entry = entry->next;
 }
 return NULL;
}
 
int main() {
 hash_table_t *hash_table = create_ht(SIZE);
 put_entry(hash_table, "key1", "apple");
 put_entry(hash_table, "key2", "orange");
 put_entry(hash_table, "key3", "tomato");
 
 printf("%s \n", get_value(hash_table, "key1"));
 printf("%s \n", get_value(hash_table, "key2"));
 printf("%s \n", get_value(hash_table, "key3"));
 return 0;
}
Download full code here

References

Không có nhận xét nào:

Đăng nhận xét