Chapter 9: Hashing • Basic concepts • Hash functions • Collision resolution • Open addressing • Linked list resolution • Bucket hashing
1
Basic Concepts • Sequential search: O(n) • Binary search: O(log2n)
Requiring several key comparisons before the target is
found
2
Basic Concepts • Search complexity: Size
Binary
Sequential (Average)
Sequential (Worst Case)
16
4
8
16
50
6
25
50
256
8
128
256
1,000
10
500
1,000
10,000
14
5,000
10,000
100,000
17
50,000
100,000
1,000,000
20
500,000
1,000,000
3
Basic Concepts • Is there a search algorithm whose complexity is O(1)?
4
Basic Concepts • Is there a search algorithm whose complexity is O(1)? YES.
5
Basic Concepts memory addresses
keys
hashing
Each key has only one address
6
Basic Concepts Key
Vu Nguyen 102002 John Adams 107095 Sarah Trapp 111060
Address
001
Harry Lee
002
Sarah Trapp
005
Vu Nguyen
007
Ray Black
100
John Adams
005
Hash Function
100 002
7
Basic Concepts • Home address: address produced by a hash function. • Prime area: memory that contains all the home addresses.
8
Basic Concepts • Synonyms: a set of keys that hash to the same location. • Collision: the location of the data to be inserted is already occupied by the synonym data.
9
Basic Concepts • Ideal hashing: – No location collision – Compact address space
10
Basic Concepts Insert A, B, C hash(A) = 9 hash(B) = 9 hash(C) = 17 A [1]
[5]
[9]
[17]
11
Basic Concepts Insert A, B, C hash(A) = 9 hash(B) = 9
B and A collide at 9
hash(C) = 17
[1]
[5]
A
B
[9]
[17]
Collision Resolution
12
Basic Concepts Insert A, B, C hash(A) = 9 hash(B) = 9
B and A collide at 9
hash(C) = 17
[1]
C and B collide at 17
C
A
B
[5]
[9]
[17]
Collision Resolution 13
Basic Concepts Searh for B hash(A) = 9 hash(B) = 9 hash(C) = 17
[1]
C
A
B
[5]
[9]
[17]
Probing 14
Hash Functions • Direct hashing • Modulo division • Digit extraction • Mid-square • Folding • Rotation • Pseudo-random 15
Direct Hashing • The address is the key itself: hash(Key) = Key
16
Direct Hashing • Advantage: there is no collision. • Disadvantage: the address space (storage size) is as large as the key space
17
Modulo Division Address = Key MOD listSize + 1 • Fewer collisions if listSize is a prime number • Example: Numbering system to handle 1,000,000 employees Data space to store up to 300 employees hash(121267) = 121267 MOD 307 + 1 = 2 + 1 = 3
18
Digit Extraction Address = selected digits from Key
• Example: 379452 121267 378845 160252 045128
→ → → → →
394 112 388 102 051
19
Mid-square Address = middle digits of Key2 • Example: 9452 * 9452 = 89340304 → 3403
20
Mid-square • Disadvantage: the size of the Key2 is too large • Variations: use only a portion of the key 379452: 379 * 379 = 143641 → 364 121267: 121 * 121 = 014641 → 464 045128: 045 * 045 = 002025 → 202
21
Folding • The key is divided into parts whose size matches the address size Key = 123|456|789 fold shift 123 + 456 + 789 = 1368 ⇒ 368
22
Folding • The key is divided into parts whose size matches the address size Key = 123|456|789 fold shift
fold boundary
123 + 456 + 789 = 1368 ⇒ 368
321 + 456 + 987 = 1764 ⇒ 764
23
Rotation • Hashing keys that are identical except for the last character may create synonyms. • The key is rotated before hashing. original key 600101 600102 600103 600104 600105
rotated key 160010 260010 360010 460010 560010
24
Rotation • Used in combination with fold shift original key 600101 600102 600103 600104 600105
→ → → → →
62 63 64 65 66
rotated key 160010 260010 360010 460010 560010
→ → → → →
26 36 46 56 66
Spreading the data more evenly across the address space 25
Pseudorandom Key
Pseudorandom Number Generator
Rando m Number
Modulo Division
Address
y = ax + c For maximum efficiency, a and c should be prime numbers
26
Pseudorandom • Example: Key = 121267
a = 17
c=7
listSize = 307
Address = ((17*121267 + 7) MOD 307 + 1 = (2061539 + 7) MOD 307 + 1 = 2061546 MOD 307 + 1 = 41 + 1 = 42
27
Collision Resolution • Except for the direct hashing, none of the others are one-to-one mapping ⇒ Requiring collision resolution methods
• Each collision resolution method can be used independently with each hash function
28
Collision Resolution • A rule of thumb: a hashed list should not be allowed to become more than 75% full. Load factor:
α = (k/n) x 100
n = list size k = number of filled elements
29
Collision Resolution • As data are added and collisions are resolved, hashing tends to cause data to group within the list ⇒ Clustering: data are unevenly distributed across the list
• High degree of clustering increases the number of probes to locate an element ⇒ Minimize clustering 30
Collision Resolution • Primary clustering: data become clustered around a home address. Insert A9, B9, C9, D11, E12 A B C D E [1]
[9] [10] [11] [12] [13]
31
Collision Resolution • Secondary clustering: data become grouped along a collision path throughout a list. Insert A9, B9, C9, D11, E12, F9 A B D E [1]
C
[9] [10] [11] [12] [13] [14]
F [23]
32
Open Addressing • When a collision occurs, an unoccupied element is searched for placing the new element in. • There are different methods:
Linear probe Quadratic probe Double hashing Key offset
33
Linear Probe • When a home address is occupied, go to the next address (the current address + 1). • Alternatives: add 1, subtract 2, add 3, subtract 4, ...
34
Linear Probe
Harry Eagle 166702
Hash
001
Mary Dodd
(379452)
002
Sarah Trapp
(070918)
003
Bryan Devaux (121267)
008
John Carver
(378845)
306
Tuan Ngo
(160252)
307
Shouli Feldman (045128)
002
35
Linear Probe
Harry Eagle 166702
Hash
001
Mary Dodd
(379452)
002
Sarah Trapp
(070918)
003
Bryan Devaux (121267)
004
Harry Eagle
(166702)
008
John Carver
(378845)
306
Tuan Ngo
(160252)
307
Shouli Feldman (045128)
002
36
Linear Probe • Advantages:
quite simple to implement data tend to remain near their home address (significant for disk addresses)
• Disadvantages:
produces primary clustering
37
Quadratic Probe • The address increment is the collision probe number squared (12, 22, 32, ...). • Modulo is used not to run off the end of the address list.
38
Quadratic Probe • Disadvantage: time required to square the probe number. (n + 1)2 = n2 + 2n + 1
Increment Factor
39
Quadratic Probe Probe Collision Probe2 and New Number Location Increment Address
Increment Factor
Next Increment
1
1
12 = 1
02
1
1
2
2
22 = 4
06
3
4
3
6
32 = 9
15
5
9
4
15
42 = 16
31
7
16
5
31
52 = 25
56
9
25
6
56
62 = 36
92
11
36
7
92
72 = 49
41
13
49
40
Quadratic Probe • Limitation: it is not possible to generate a new address for every element in the list. (address + n2) MOD listSize + 1 ⇒ Use a list size that is a prime number (more elements reachable)
41
Pseudorandom Probe • Generate a pseudorandom number as a new address from the collision address. newAddress = (a x collisionAddress + c) MOD listSize + 1 a should be a large prime number
42
Pseudorandom Probe • Significant limitation (of linear and quadratic probes as well): there is only one collision path for all keys. ⇒ produces secondary clustering
43
Secondary Clustering hash(A) = hash(B) = hash(C) = 9 hash(D) = 11
[1]
A
C
B
[9]
[11]
[13]
D [15]
Linear probes Search/find location for D
44
Key Offset • Generate a new address as a function of the collision address and the key. offset = [key / listSize] newAddress = (offset + collisionAddress) MOD listSize + 1
45
Key Offset Key
Key Offset
Probe 1
Probe 2
166702
Home Address 2
543
239
169
572556
2
1,865
026
050
067234
2
219
222
135
offset = [166702 / 307 = 543] address = (543 + 02) MOD 307 + 1 = 239 address = (543 + 239) MOD 307 + 1 = 169 46
Linked List Resolution • Major disadvantage of Open Addressing: each collision resolution increases the probability for future collisions. ⇒ use linked lists to store synonyms
47
Linked List Resolution 001
Mary Dodd
(379452)
002
Sarah Trapp
(070918)
003
Bryan Devaux (121267)
Harry Eagle
(166702)
Chris Walljasper (572556)
008
John Carver
(378845)
306
Tuan Ngo
(160252)
307
Shouli Feldman (045128)
overflow area prime area
48
Bucket Hashing • Hashing data to buckets that can hold multiple pieces of data. • Each bucket has an address and collisions are postponed until the bucket is full.
49
Bucket Hashing Mary Dodd
(379452)
Sarah Trapp
(070918)
Harry Eagle
(166702)
Ann Georgis
(367173)
001
002
Bryan Devaux (121267) 003
Chris Walljasper(572556)
linear probe
Shouli Feldman (045128) 307
50