Bilgisayar Mühendisliği Bölümü
DATA STRUCTURES AND ALGORITHMS Lecture Notes 11 Sets and Maps Spring 2008
GIT – Computer Engineering Department
Chapter Outline The Set and Map containers and how to use them Hash coding and its use in efficient search & retrieval Two forms of hash tables: – Open addressing – Chaining – Their relative benefits and performance tradeoffs
Implementing both hash table forms Introduction to implementation of Maps and Sets Applying Maps and Sets to previous problems GIT – Computer Engineering Department
2
Standard Library Containers
GIT – Computer Engineering Department
3
Associative Containers
GIT – Computer Engineering Department
4
The Set Abstraction A set is a collection containing no duplicate elements Operations on sets include: – Testing for membership – Adding elements – Removing elements – Union – Intersection – Difference – Subset GIT – Computer Engineering Department
5
Set Implementation set is a template class in standard library with template parameters – Key_Type : type of the items – Compare : function class to order items
Comparison is not required for set abstraction – Compare has default value
GIT – Computer Engineering Department
6
The set class Function
Behavior
Construct a set from the sequence of objects represented by the iterator range first .. last. iterator begin() Returns an iterator to the first item in the const iterator begin() const set. iterator end() Returns an iterator to one past the last item const iterator end() const in the set. bool empty() Returns true if the set is empty. size_t size() Returns the number of item in the set. pair
insert(const Inserts an item into the set. If the item is not Key_Type& item) in the set, the iterator will reference the inserted item and the bool will be true. If the item is already in the set the iterator will reference the item currently in the set and the bool will be false. template void Inserts the items from the range first .. insert(II first, II last) last into the set. Duplicates are ignored. void erase(const Key_Type& item) Removes the item from the set. iterator find(const Key_Type& Returns an iterator to the item in the set. If template set(II first, II last)
GIT – Computer Engineering Department
7
Implementing the Union Operator /** Construct the union of two sets. */ template std::set operator+( const std::set& left, const std::set& right){ typename std::set result(left); result.insert(right.begin(), right.end()); return result; }
GIT – Computer Engineering Department
8
Implementing the Set Difference /** Construct the difference of two sets. */ template std::set operator-( const std::set& left, const std::set& right) { typename std::set result(left); for (typename std::set::const_iterator itr = right.begin(); itr != right.end(); ++itr) result.erase(*itr); return result; }
GIT – Computer Engineering Department
9
Implementing Intersection /** Construct the intersection of two sets */ template std::set operator*(const std::set& left, const std::set& right) { return left - (left - right); }
GIT – Computer Engineering Department
10
Overloaded ostream extraction operator // Overloading the ostream insertion operator template std::ostream& operator<<(std::ostream& out, const std::set& a_set) { out << "{"; bool first = true; for (typename std::set::const_iterator itr = a_set.begin(); itr != a_set.end(); ++itr) { if (first) out << *itr; else out << ", " << *itr; first = false; } return out << "}"; } GIT – Computer Engineering Department
11
Set Functions in Function
Behavior
template OI set_difference(II1 first1, II1 last1, II2 first2, II2 last2, OI result, Compare less)
Forms the set difference of the elements from the range first1 .. last1 and first2 .. last2. The result is stored in the sequence beginning with result. The end of the result sequence is returned. Forms the set intersection of the elements from the range first1 .. last1 and first2 .. last2. The result is stored in the sequence beginning with result. The end of the result sequence is returned. Forms the set union of the elements from the range first1 .. last1 and first2 .. last2. The result is stored in the sequence beginning with result. The end of the result sequence is returned.
template OI set_intersection(II1 first1, II1 last1, II2 first2, II2 last2, OI result, Compare less) template OI set_union(II1 first1, II1 last1, II2 first2, II2 last2, OI result, Compare less)
GIT – Computer Engineering Department
12
Set Example int main() { set<string> set<string> set<string> set<string> set<string>
set1; set2; set_u; set_d; set_i;
string data1[] = {"Apples", "Oranges", "Pineapples"}; string data2[] = {"Peaches", "Apples", "Grapes"}; set1.insert(data1, set2.insert(data2, cout << "set1 is " cout << "set2 is "
data1+3); data2+3); << set1 << endl; << set2 << endl;
set_union(set1.begin(), set1.end(), set2.begin(), set2.end(), inserter(set_u, set_u.begin())); cout << "set1 + set2 is " << set_u << endl;
GIT – Computer Engineering Department
13
Set Example (2) set_difference(set1.begin(), set1.end(), set2.begin(), set2.end(), inserter(set_d, set_d. begin())); cout << "set1 - set2 is " << set_d << endl; set_intersection(set1.begin(), set1.end(), set2.begin(), set2.end(), inserter(set_i, set_i. begin())); cout << "set1 * set2 is " << set_i << endl; bool is_member = (set1.find(string("Apples")) != set1.end()); cout << "\"Apples\" is an element of set1 is " << boolalpha << is_member << endl; return 0; }
GIT – Computer Engineering Department
14
vector vs. set Sets allow no duplicates Sets do not have positions, so no operator[] Set iterator can produces elements in sorted order How to implement a set???
GIT – Computer Engineering Department
15
The multiset The multiset is the same as the set except that duplicate entries are allowed. The insert always inserts a new item. The erase removes all occurrences of an item. The functions lower_bound and upper_bound may be used to select a range from a set. – In a multiset these functions can also return all occurrences of a given value
How to implement a multiset?? GIT – Computer Engineering Department
16
Example of a multiset int count_occurences(const multiset<string>& words_set, const string& target) { multiset<string>::const_iterator first_itr = words_set.lower_bound(target); cout << "*first_itr == " << (first_itr != words_set.end() ? *first_itr : "end()") << endl; multiset<string>::const_iterator last_itr = words_set.upper_bound(target); cout << "*last_itr == " << (last_itr != words_set.end() ? *last_itr : "end()") << endl; int count = 0; for (multiset<string>::const_iterator itr = first_itr; itr != last_itr; ++itr) ++count; return count; } GIT – Computer Engineering Department
17
The std::pair The std::pair is defined in It is a struct (data members are public) that contains two data items: first and second. There is a template function make_pair that can be used to construct pair objects.
GIT – Computer Engineering Department
18
Definition of pair template struct pair { Type1 first; Type2 second; pair(const Type1& x, const Type2& y): first(x), second(y) {} pair() : first(Type1()), second(Type2()) {} template pair(const pair& other) { first = other.first; second = other.second; } }; template pair make_pair(const Type1& x, const Type2& y) {return pair(x, y);}
GIT – Computer Engineering Department
19
Maps and the Map Interface Map is related to Set: it is a set of ordered pairs Ordered pair: (key, value) – In a given map, there are no duplicate keys – Values may appear more than once
Can think of key as “mapping to” a particular value Maps support efficient organization of information in tables Mathematically, these maps are: – Many-to-one (not necessarily one-to-one) – Onto (every value in the map has a key)
GIT – Computer Engineering Department
20
Map Picture
GIT – Computer Engineering Department
21
The map functions Template parameters – Key_Type: The type of the keys – Value_Type: The type of the values – Compare: The function class that compares the
keys.
All functions defined for the set are defined for the map, taking a pair – pair
If set is defined map can be defined easily Except the index operator: the index operator (operator[]) is also defined for the map GIT – Computer Engineering Department
22
The map operator[] The map can be used like an array, except that the Key_Type is the index. – Example: map<string, string> a_map; a_map["J"] = "Jane"; a_map["B"] = "Bill"; a_map["S"] = "Sam"; a_map["B1"] = "Bob"; a_map["B2"] = "Bill";
Note: – If a mapping exists, assignment will replace it. – If a mapping does not exist, a reference will create one with a default value GIT – Computer Engineering Department
23
EX: Using a map to build an index Index the words in a text with their line numbers – Use a map (map<string, list>) – Each word (string) is a key – List of line numbers (list) is a value
GIT – Computer Engineering Department
24
Using a map to build an index typedef map<string, list > map_type; void build_index(istream& in, map_type& index) { string next_line; // Each data line int line_num = 0; // Line number // Keep reading lines until done while (getline(in, next_line)) { line_num++; String_Tokenizer tokenizer(next_line, " ,.:-!?/%\'\""); // Insert each token in the index while (tokenizer.has_more_tokens()) { string word = tokenizer.next_token(); to_lower(word); index[word].push_back(line_num); } } }
GIT – Computer Engineering Department
25
Maps How to implement Maps? – What is the efficiency?
GIT – Computer Engineering Department
26
Chapter Outline The Set and Map containers and how to use them Hash coding and its use in efficient search & retrieval Two forms of hash tables: – Open addressing – Chaining – Their relative benefits and performance tradeoffs
Implementing both hash table forms Introduction to implementation of Maps and Sets Applying Maps and Sets to previous problems GIT – Computer Engineering Department
27
Hash Tables Goal: access item given its key (not its position) Therefore, want to locate it directly from the key In other words, we wish to avoid much searching Hash tables provide this capability – Constant time in the average case! – Linear time in the worst case
O(1) O(n)
Searching an array: O(n) Searching BST: O(log n)
GIT – Computer Engineering Department
28
Hash Codes Suppose we have a table of size N A hash code is a number in the range 0 to N-1 We compute the hash code from the key – a “default position” when inserting – a “position hint” when looking up
A hash function is a way of computing a hash code Desire: The set of keys should spread evenly over the N values When two keys have the same hash code: collision GIT – Computer Engineering Department
29
EX: Character Frequencies in a Text To build a Huffman tree, we need the number of occurrences of each character Hash key: character Hash value: character frequency Assume text includes ASCII characters – There are 128 characters in ASCII
Hash table size is 128 – Character ASCII value is the index in the table • HashTable[65] is frequency of character A
Hash function: returns the character value What if the characters are Unicode GIT – Computer Engineering Department
30
EX: Character Frequencies in a Text There are 216 possible characters, but ... – Maybe only 100 or so occur in a text
Choosing table size of 216 results in waste of space Approach: hash characters to range 0-199 – That is, use a hash table of size 200
A possible hash function for this example: int hash = uni_char % 200;
Collisions are certainly possible (see later) – Mapping 216 possible characters to 200 indices
GIT – Computer Engineering Department
31
EX: An ideal hash table Each key is mapped to a different index ! – Not always possible • many keys, finite indices
Even distribution Considerations : Devising a hash function Decide on table size Decide what to do when collision
GIT – Computer Engineering Department
32
Devising Hash Functions Simple functions often produce many collisions – ... but complex functions may not be good either!
It is often an empirical process If keys are integers – hash function : key % table.size() • Ex: TableSize = 10 Keys = 120, 330, 1000
– TableSize should be prime • Provides even distribution GIT – Computer Engineering Department
33
Devising Hash Functions For strings: Adding letter values in a string – same hash for strings with same letters in different order – If TableSize is large and number of characters is small • TableSize = 10000 & number of characters in a key = 8 127*8=1016 < 10000
Use all characters ∑ 32i Key [KeySize -i -1 ] – Early characters does not count • Use only some number of characters • Use characters in odd spaces
Use first three characters 729*key[2] + 27*key[1] + key[0] – If the keys are not random some part of the table is not used
GIT – Computer Engineering Department
34
Devising Hash Functions For strings: Better approach: size_t hash = 0; for (size_t i = 0; i < s.size(); ++i) hash = hash * 31 + s[i]; hash = has % table.size()
This is the hash function used by Java in its String class
GIT – Computer Engineering Department
35
Devising Hash Functions The String hash is good in that: – Every letter affects the value – The order of the letters affects the value – Hash function distributes the hash values well over the integers • Probability of two strings having the same hash value is quite small • The hash function can be assumed to produce random hash values
The hash value may not be unique – Too many possible strings
Collision is still possible GIT – Computer Engineering Department
36
Devising Hash Functions Guidelines for good hash functions: – Spread values evenly: as if “random” – Cheap to compute
Generally, number of possible values much greater than table size GIT – Computer Engineering Department
37
Collision Problem Will consider two ways to organize hash tables – Open addressing – Chaining
GIT – Computer Engineering Department
38
Open Addressing
Hashed items are in a single array Hash code gives position “hint” Handle collisions by checking multiple positions Each check is called a probe of the table If collision try an alternate cell h0(x), h1(x), h2(x), … hi(x) = (hash(x) + F(i)) % table.size() F(0) = 0
GIT – Computer Engineering Department
39
Linear Probing
Probe by incrementing the index –
F(i) = i
If “fall off end”, wrap around to the beginning –
Take care, not to cycle forever!
1. Compute index as hash_fcn() % table.size() 2. if table[index] == NULL, item is not in the table 3. if table[index] matches item, found item (done) 4. Increment index circularly and go to 2 Why must we probe repeatedly? – –
hashCode may produce collisions remainder by table.size may produce collisions GIT – Computer Engineering Department
40
Search Termination Ways to obtain proper termination – Stop when you come back to your starting point – Stop after probing N slots, where N is table size – Stop when you reach the bottom the second time – Ensure table never full • Reallocate when occupancy exceeds threshold
GIT – Computer Engineering Department
41
Hash Table Example
Table of strings, initial size 5 Add “Tom”, hash 84274 4 Add “Dick”, hash 2129869 4 Add “Harry”, hash 69496448 3 Add “Sam”, hash 82879 4 Add “Pete”, hash 2484038 3
Slot 4 Slot 0 (wraps) Slot 3 Slot 1 (wraps) Slot 2 (wraps)
Note: many lookups will probe a lot! Size 11 gives these slots: 3, 5, 10, 56, 7
GIT – Computer Engineering Department
42
Hash Table Example Insert keys {89, 18, 49, 58, 69} When 49 is inserted collision occurs – Put into the next available index 0 58 collides with 18, 89, and 49 – Inserted at index 1
GIT – Computer Engineering Department
43
Hash Table Considerations Cannot traverse a hash table – Order of stored values is arbitrary – Can use an iterator to produce in arbitrary order
When item is deleted, cannot just set its entry to null – Doing so would break probing – Must store a “dummy value” instead – Deleted items waste space and reduce efficiency • Need to go through dummy values during search and even during insertion
– Operations take O(n) time in the worst case
Higher occupancy causes more collisions GIT – Computer Engineering Department
44
Reducing Collisions By Growing Choose a new larger size, e.g., doubling Similar to reallocating a vector – But, elements can move around in reinsertion • Simple copy is not possible
(Re)insert non-deleted items into new array Install the new array and drop the old Rehashing distributes items at least as well Deleted items (dummy values) are not inserted GIT – Computer Engineering Department
45
Rehashing Example If 23 is inserted, the table is over 70 percent full.
A new table is created 17 is the first prime twice as large as the old one; so Hnew (X) = X mod 17 GIT – Computer Engineering Department
46
Rehashing Rehashing is an expensive operation – Running time is O(N)
Rehashing frees the programmer from worrying about table size Amortized Analysis: Average over N operations – Operations take: O(1) time
GIT – Computer Engineering Department
47
Quadratic Probing Linear probing – Tends to form long clusters of keys in the table – This causes longer search chains
Ex:
GIT – Computer Engineering Department
48
Quadratic Probing Quadratic probing can reduce the effect of primary clustering Index increments form a quadratic series – F(i) = i2
Hash values: s, s+12, s+22, s+32, s+42, etc. – (all % TableSize)
Direct calculation involves multiply, add, remainder – Incremental calculation better Initially: size_t index = ... 1st probe slot ... int k = -1; At each iteration: k += 2; index = (index + k) % table.size(); GIT – Computer Engineering Department
49
Quadratic Probing When 49 collides with 89, next position attemped is one cell away 58 collides at position 8. The cell one away is tried, another collision occurs. It is inserted into the cell 22=4 away
GIT – Computer Engineering Department
50
Quadratic Probing Probe sequence may not produce all table slots – A loop around full cells may happen – Hash table not full but empty space not found
Theorem : If the table size is prime and more than half of the table is empty new element can always be inserted. Problem : Secondary clustering!... GIT – Computer Engineering Department
51
Double Hashing Double Hashing solves the clustering problem Use second hash function F(i) = i * hash2(x) Poor example : hash2(x) = x mod 9 hash1(x) = x mod 10 TableSize = 10 If x = 99 what happens ? hash2(x) ≠ 0 for any x GIT – Computer Engineering Department
52
Double Hashing Good choice : hash2(x) = R – (x mod R) R is a prime and < TableSize
GIT – Computer Engineering Department
53
Double Hashing
hash2(x) = 7 – (X mod 7) GIT – Computer Engineering Department
54
Chaining Alternative to open addressing Each table slot references a linked list – List contains all items that hash to that slot – The linked list is often called a bucket – So sometimes called bucket hashing
Examines only items with same hash code – As opposed to open addressing (search chains may overlap)
Insertion about as complex Deletion is simpler Linked list can become long rehash GIT – Computer Engineering Department
55
Chaining Picture
Two items hashed to bucket 3 Three items hashed to bucket 4
GIT – Computer Engineering Department
56
Performance of Hash Tables Load factor = # filled cells / table size – Between 0 and 1
Load factor has greatest effect on performance Lower load factor better performance – Reduce collisions in sparsely populated tables
GIT – Computer Engineering Department
57
Performance of Hash Tables For open addressing, linear probing – Knuth gives expected # probes
1 1 I & U ⇒ 1 + 2 (1 − λ )2 1 1 S ⇒ 1 + 2 (1 − λ ) – As approaches 1
ג
• Number of probes increases • insertions might fail
– Rehashing with larger TableSize • if > ג0.5 • if insertion fails GIT – Computer Engineering Department
58
Performance of Hash Tables For chaining – גis the avarage length of a list – Successful Find ג/2 comparisons + time to evaluate hash function – Unsuccessful Find & Insert גcomparisons + time to evaluate hash function
Good choice ~ ג1 Here גcan be greater than 1 Disadvantage of separate chaining is allocate/deallocate memory ! GIT – Computer Engineering Department
59
Performance of Hash Tables
L 0 0.25 0.5 0.75 0.83 0.9 0.95
GIT – Computer Engineering Department
Number of Probes Linear Probing
Chaining
1.00 1.17 1.50 2.50 3.38 5.50 10.50
1.00 1.13 1.25 1.38 1.43 1.45 1.48
60
Random Collision Resolution Performance of double hashing is better than linear probing Assume: Random collision resolution – Probes are independent – No clustering problem Unsuccessful search and Insert – Number of probes until an empty cell is found (1- = )גfraction of cells that are empty 1 / (1- = )גexpected number of probes until an empty cell Successful search P(X)=Number of probes when the element X is inserted 1/N∑ P(X) approximately
1
λ
λ ∫0
1 1 1 dx = ln 1− x λ 1− λ
GIT – Computer Engineering Department
61
Performance of Hash Tables (3) Hash table: – Insert: average O(1) – Search: average O(1)
Sorted array: – Insert: average O(n) – Search: average O(log n)
Binary Search Tree: – Insert: average O(log n) – Search: average O(log n)
But balanced trees can guarantee O(log n) GIT – Computer Engineering Department
62
Implementing Hash Tables Class hash_map: used for both implementations typedef std::pair Entry_Type;
File Hash_Table_Open.h: implements open addressing Class Hash_Table_Chain: implements chaining Further implementation concerns
GIT – Computer Engineering Department
63
Class hash_map Function
Behavior
iterator begin() const_iterator begin() const iterator end() const_iterator end() const bool empty() size_t size() pair insert(const Entry_Type& entry>
Returns an iterator to the first entry in the map. Returns an iterator to one past the last entry in the map. Returns true if the map is empty. Returns the number of entries in the map. Inserts an entry into the map. If the key is not in the map, the returned iterator references the inserted entry, and the bool will be true. If the key is currently in the map, the returned iterator references the existing entry, and the bool will be false. Removes the item from the map. Returns an iterator that references the item in the map. If not present, end() is returned. Returns a reference to the value mapped to the key. If the key is not currently in the map, a default value is inserted, and a reference to the inserted item is returned.
void erase(const Key_Type& key) iterator find(const Key_Type& key) Value_Type& operator[](const Key_Type& key)
GIT – Computer Engineering Department
64
Hash_Table_Open Data Fields and Private Functions Data Field
Attribute
hash hash_fcn
The function object that will compute the hash function for the Key_Type. The number of keys in the table, excluding deleted keys. The number of deleted keys. The vector to hold the hash table. The maximum load factor. The initial capacity
size_t num_keys size_t num_deletes std::vector<Entry_Type*> the_table const double LOAD_THRESHOLD static const size_t INITIAL_CAPACITY const Entry_Type dummy const Entry_Type* DELETED
A dummy entry to represent deleted keys A pointer to the dummy entry.
Function
Behavior
size_t locate(const Key_Type& key)
Returns the index of the specified key if present in the table; otherwise; returns the index of the first free slot. Doubles the capacity of the table and permanently removes deleted items.
void rehash()
GIT – Computer Engineering Department
65
The hash template class The hash template class is a function class that implements the hash function for its template parameter. It is defined as follows: template struct hash { size_t operator()(const Key_Type&); };
We will describe some implementations later
GIT – Computer Engineering Department
66
The locate function size_t locate(const Key_Type& key) { size_t index = hash_fcn(key) % the_table.size (); while (the_table[index] != NULL && (the_table[index] == DELETED || the_table[index]->first != key)) index = (index + 1) % the_table.size (); return index; }
GIT – Computer Engineering Department
67
The insert function std::pair insert(const Entry_Type& entry) { double load_factor = double(num_keys + num_deletes) / the_table.size(); if (load_factor > LOAD_THRESHOLD){ rehash(); // Double the size of the table. } // Find the position in the table. size_t index = locate(entry.first); // See whether it is empty. if (the_table[index] == NULL) { // Create a new entry. the_table[index] = new Entry_Type(entry); num_keys++; return std::make_pair(iterator(this, index), true); } else { // Item is already in the table. return std::make_pair(iterator(this, index), false); } }
GIT – Computer Engineering Department
68
The index operator Value_Type& operator[](const Key_Type& key) { // Try to insert a dummy item. std::pair ret = insert(Entry_Type(key, Value_Type())); // Return a reference to the value found or inserted. return ret.first->second; }
GIT – Computer Engineering Department
69
The rehash function void rehash() { // Create a new table whose size is double the current table. std::vector<Entry_Type*> other_table(the_table.size() * 2, NULL); // Swap this table with the current table. the_table.swap(other_table); // Reinsert all items from old table to new. num_deletes = 0; for (size_t i = 0; i < other_table.size(); i++) { if ((other_table[i] != NULL) && (other_table[i] != DELETED)) { size_t index = locate(other_table[i]->first); the_table[index] = other_table[i]; } } }
GIT – Computer Engineering Department
70
The erase function Algorithm for erase 1. Find the first table element that is empty or that contains the key. 2. if an empty element is found 3. done 4. else 5. Delete the Entry_Type object pointed to. 6. Set the pointer in this entry to DELETED 7. Increment num_deletes 8. Decrement num_keys Implementation is left as an exercise GIT – Computer Engineering Department
71
Copy Constructor, etc. Because the vector<Entry_Type*> contains pointers to dynamically created objects, we must define the copy constructor, assignment operator, and destructor. hash_map(const hash_map& other) : hash_fcn(hash()), num_keys(0), the_table(other.the_table.size(), NULL), LOAD_THRESHOLD(0.75), num_deletes(0) { for (size_t i = 0; i < other.the_table.size(); i++) { if (other.the_table[i] != NULL && other.the_table[i] != DELETED) insert(Entry_Type(other.the_table[i]->first, other.the_table[i]->second)); } }
GIT – Computer Engineering Department
72
Hash_Table_Chain.h Data Field
Attribute
hash hash_fcn size_t num_keys std::vector<std::list<Entry_Type> > the_buckets static const size_t INITIAL_CAPACITY static double LOAD_THRESHOLD
The hash function object. The number of keys A vector of lists containing the items
GIT – Computer Engineering Department
The initial capacity The maximum load threshold before rehashing.
73
The insert function std::pair insert(const Entry_Type& entry) { // Check for the need to rehash. double load_factor = double(num_keys) / the_buckets.size(); if (load_factor > LOAD_THRESHOLD){rehash();} // Find the position in the table. size_t index = hash_fcn(entry.first) % the_buckets.size(); // Search for the key. typename std::list<Entry_Type>::iterator pos = the_buckets[index].begin(); while (pos != the_buckets[index].end() && pos->first != entry.first) ++pos; if (pos == the_buckets[index].end()) { // Not in table the_buckets[index].push_back(Entry_Type(entry)); num_keys++; return std::make_pair(iterator(this, index, --(the_buckets[index].end())), true); } else { // Already there return std::make_pair(iterator(this, index, pos), false); } } GIT – Computer Engineering Department
74
Copy Constructor, etc. Since Hash_Table_Chain.h uses a std::vector<std::list<Entry_Type> > to hold the entries: – The vector and list both define the copy constructor, etc. to make a deep copy and to clean up any dynamically allocated objects when destroyed.
Therefore, no special implementation is required for Hash_Table_Chain.h
GIT – Computer Engineering Department
75
Implementation Considerations The hash template class defines the hash function, but provides no implementation. The user of the hash_map (either Hash_Table_Open.h or Hash_Table_Chain.h) must define a specialization for the Key_Type. Example for string: template<> struct hash<std::string> { size_t operator()(const std::string& s) { size_t result = 0; for (size_t i = 0; i < s.length(); i++) { result = result * 31 + s[i]; } return result; } };
GIT – Computer Engineering Department
76
Specialization for int One possibility for int is to merely return the integer value, cast to size_t. – This does not produce a “random” distribution of keys.
An alternative is to multiply by some large prime and take the result modulo the computer’s word size. template<> struct hash { size_t operator()(int i) { return size_t(4262999287U * i); } };
GIT – Computer Engineering Department
77
Defining Your Own hash function The hash_map class implementations use the hash function to locate the initial search position. It then they use the objects equality operator (operator ==) to determine if there is a match. Therefore, your hash function should obey the following constraint: – If obj1 == obj2 then hash()(obj1) == hash()(obj2)
GIT – Computer Engineering Department
78
Applying Maps: Phone Directory string Phone_Directory::add_or_change_entry( const string& name, const string& number) { string old_number = the_directory[name]; the_directory[name] = number; modified = true; return old_number; } string Phone_Directory::lookup_entry(const string& name) const { const_iterator itr = the_directory.find(name); if (itr != the_directory.end()) return itr->second; else return ""; } GIT – Computer Engineering Department
79
Applying Maps: Phone Directory (2) string Phone_Directory::remove_entry(const string& name) { string old_number = the_directory[name]; the_directory.erase(name); modified = old_number != string(); return old_number; }
GIT – Computer Engineering Department
80
Applying Maps: Huffman Coding Algorithm for build_frequency_table 1. while there are more characters in the input file 2. Read a character 3. Increment the entry in the map associated with this character 4. for each entry in the map 5. 6.
Store its data as the weight-symbol pair in the vector Return the vector
GIT – Computer Engineering Department
81
The build_frequency_table function vector Huffman_Tree::build_frequency_table(istream& in) { map frequencies; char c; while (in.get(c)) { frequencies[c]++; } vector result; for (map::iterator itr = frequencies.begin(); itr != frequencies.end(); ++itr) { result.push_back(Huff_Data(itr->second, itr->first)); } return result; }
GIT – Computer Engineering Department
82
build_code
Additional data field: std::map code_map;
Starter function void build_code() { code_map.clear(); build_code(huff_tree, Bit_String()); }
Recursive function void Huffman_Tree::build_code(const Binary_Tree& tree, const Bit_String& code) { if (tree.is_leaf()) { Huff_Data datum = tree.get_data(); code_map[datum.symbol] = code; } else { Bit_String left_code(code); left_code.append(false); build_code(tree.get_left_subtree(), left_code); Bit_String right_code(code); right_code.append(true); build_code(tree.get_right_subtree(), right_code); } }
GIT – Computer Engineering Department
83
encode void Huffman_Tree::encode(std::istream& in, std::ostream& out) { Bit_String result; char next_char; while (in.get(next_char)) { result += code_map[next_char]; } result.write(out); }
GIT – Computer Engineering Department
84