Modification of System cache for effective implementation of Bloom Filters Gurudatha.M M050224CS Guided by: Dr. Priya Chandran April 26, 2006
Department Of Computer Engineering National Institute Of Technology, Calicut Kerala, 673601 2005
1
ACKNOWLEDGEMENT I have been fortunate to have Dr. Priya Chandran, Department of Computer Engineering, as my guide whose timely guidance, advice and inspiration helped me in the preparation of this mini–project. I express my sincere gratitude to her for having guided me through this work. I am thankful to Dr. M.P.Sebastian, Head of Computer Engineering Department for his encouragement and for giving me this opportunity. I also thank all my friends who helped me throughout this work. Gurudatha.M (M050224CS)
2
CERTIFICATE This is to certify that the seminar entitled ”Modification of System cache for effective implementation of Bloom Filters” is a bonafide record of the mini–project presented by Mr M.gurudatha (M050224CS) under our supervision and guidance. The seminar report has been submitted to Department of Computer Engineering of National Institute of Technology, Calicut in partial fulfilment of the degree of Master Of Technology in Computer Science And Engineering.
Dr. M. P. Sebastian Head Of Department Dept. of Computer Engg. NIT Calicut.
Dr. Priya Chandran Assistant Professor Dept. of Computer Engg. NIT Calicut.
3
Contents 1
Abstract
5
2
Introduction
6
3
Spectral Bloom Filters 3.1 Minimal Increase . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Recurring Minimum . . . . . . . . . . . . . . . . . . . . . . . 3.3 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7 8 8 9
4
DineroIV
10
5
Modifications
11
6 Conclusion
12
7 References
13
4
1
Abstract
Spectral Bloom filter is a space efficient data structure that are used to efficiently find the existence of an object in any given set. The optimizations for this randomized structure are Recurring minimum and minimal Increase and then comparison of these two data structures is done. Then DineroIV the cache simulator tool and its uses are discussed. For the effective utilization of the Bloom filters it is more efficient that data is checked in small amount of time even if the algorithm is fast. For this Bloom filters and all other search techniques need to be fast the best way is to keep as much as possible. Then finally modifications that are to be done to the DineroIV code so that we can implement this Spectral Bloom filters for the better implementation of filters.
5
2
Introduction
Data mining is the process of selecting the required data from the huge amount of data. Nowadays in most of the applications it has become increasingly important to search the data with in the less amount of time. This led to the idea of hash functions with an average search time of each element is constant or O(1). The need arose even beyond hash functions. Spectral Bloom filters are the kind of data structures with the ability to compactly represent a set, and filter out effectively any element that does not belong to the set, with small error probability. With this data structure it is easy to search a data with in a set with small false positive error. In this Spectral bloom filters. In minimal selection method the element with the minimum minimum value will be returned. in the case of Recurring Minimum the elements with the single minimum are kept in an secondary SBF and then looked for recurring minimum and then returned. DineroIV is a cache simulator tool which simulates the cache and given the input trace it records all the memory requests and counts the number of misses, hits and memory references of the trace and also outputs the results. Now in this paper we formally explain about the modifications to be done to the dinero tool so the implementation of the Spectral Bloom filter can be done efficiently. By efficiently modifying the tool such as the no of cache levels, no of types of memory in each cache we can develop the use specific system.
6
3
Spectral Bloom Filters
The Spectral Bloom Filter (SBF) is a data structure that stores the set S of elements in a bit vector V with a vector of m counters, C. The counters in C roughly represent multiplicities of items in S, all the counters in C are initially set to 0. In the basic implementation, when inserting an item s, increase the counters Ch1 (s) , Ch2 (s) , ...Chk (s) ; by 1. The SBF stores the frequency of each item, and it also allows for deletions, by decreasing the same counters. Consequently, updates are also allowed ( by performing a delete and then an insert ).
Let S be a multi-set of keys taken from a universe U. For x ∈ U let fx be the frequency of x in S. Let vx = {Ch1 (s) , Ch2 (s) , ...Chk (s) } be the sequence of vxk } values stored in the k counters representing x’s value, and vˆx = {ˆ vx1 , vˆx2 ,...ˆ be a sequence consisting of the same items of vx , sorted in non-decreasing order; i.e. mx = vˆx1 is the minimal value observed in those k counters. To add a new item x ∈ U to the SBF, the counters {Ch1 (s) , Ch2 (s) , ...Chk (s) } are increased by 1. The Spectral Bloom Filter for a multi-set S can be computed by repeatedly inserting all the items from S. The same logic is applied when dealing with streaming data. While the data flows, it is hashed into the SBF by a series of insertions. Querying the SBF : A basic query for the SBF on an item x ∈ U returns an estimate on f x. The SBF error, denoted ESBF , to be the probability that for an arbitrary element z (not necessarily a member of S), ˆfz 6= fz. The basic estimator, denoted as the Minimum Selection (MS) estimator is ˆfx = mx . Deletions and sliding window maintenance Deleting an item x ∈ U from the SBF is achieved simply by reversing the actions taken for inserting x, namely decreasing by 1 the counters {Ch1 (x) , Ch2 (x) , ...Chk (x) } In sliding windows scenarios, in cases data within the current window is available (as is the case in data warehouse applications), the sliding window can be maintained simply by preforming deletions of the out-of-date data. The optimizations that are used in the implementation for improving the query performance, when the threshold T is greater than one. the threshold is the minimum fx for which fx > T
7
3.1
Minimal Increase
The minimal counter holds better information of a hashed item than the other counters as the minimal counter holds less data items hashed to it so the minimal counter has the better info. So the Minimal Increase method takes this result into account for effective finding of an element in the set. In Minimal Increase method insert an element into the set and then increase only the minimal counters of that element and so the other counters will not be increased until the other minimal counters catches them up. so this can be done in this way : Minimal Increase : When performing an insert of an item x, increase only the counters that equal mx (its minimal counter). When performing lookup query, return mx . For insertion of r occurrences of x this method can be executed iteratively, or instead increase the smallest counter(s) by r, and set every other counter to the maximum of their old value and mx + r. Minimal Increase and deletions.. This method has a big drawback as it does not support deletions as the minimal counter is only incremented during . In fact, when allowing deletions the insertions. Minimal Increase algorithm introduces a new kind of errors - false-negative errors. This result is salient in the experiments dealing with deletions and sliding-window approaches, where the Minimal Increase method becomes unattractive because of its poor performance, mostly because of false negative errors.
3.2
Recurring Minimum
In this method the assumption is that most of the elements in the set S have many minimums and also the number of elements with single minimum are less. The elements with singular minimums have high probability of being an error and taken as separate cases. To reduce further complexity consider two caches they are primary cache consisting elements with the recurring minimum and then it introduces the Secondary SBF consisting of elements with singular minimum. To search for a particular element in the set first calculate hash functions and sort them in the serial order and then it is found if there are recurring minimum if there are then it is success and report other wise if there is singular minimum then search for the element in secondary SBF.
8
The algorithm: When adding an item x, increase the counters of x in the primary SBF. Then check if x has a recurring minimum. If so, continue normally. Otherwise (if x has a single minimum), look for x in the secondary SBF. If found, increase its counters, otherwise add x to the secondary SBF, with an initial value that equals its minimal value from the primary SBF. When performing lookup for x, check if x has a recurring minimum in the primary SBF. If so return the minimum. Otherwise, perform lookup for x in secondary SBF. If returned value is greater than 0, return it. Otherwise, return minimum from primary SBF. Deletions and sliding window maintenance Deleting x when using Recurring Minimum is essentially reversing the increase operation: First decrease its counters in the primary SBF, then if it has a single minimum (or if it exists in Bf ) decrease its counters in the secondary SBF, unless at least one of them is 0. Since insertions are performed both to the primary and secondary SBF, there can be no false negative situations when deleting items. Sliding window is easily implemented as a series of deletions, assuming that the out-of-scope data is available.
3.3
Comparison
Error rates The MS algorithm provides the same error rates as the original Bloom Filter. The Minimal increase method performs better over the RM and MS algorithms. The RM algorithm is not as good, but is consistently better than the MS algorithm. Memory overhead The RM algorithm performs better over the MS algorithm. But the requirement of maintaining second SBF make the RM need heavier memory. The MI is the best among the three as the no of updations and data kept in memory are very less than the other two. Complexity The RM algorithm is the most complex method, because of the hashing into two SBFs. The complexity naturally increases.The MS method is the simplest. Deletions. Both the MS and RM methods support these actions. The MI algorithm does not, and may produce false-negative errors if used.
9
4
DineroIV
Dinero IV is an uniprocessor trace driven cache simulator tool. Trace is a sequence of memory references made during the execution of the code. Its main work to simulate the memory requests and cache misses and keeping all the info regarding the execution of a particular program. The input to the DineroIV is the sample traces of the execution of a program. DineroIv takes the sample code in the .din format. It then simulates the behavior of the cache. for this the dinero IV maintains the blocks in the form of stack. The stack contains the memory blocks that are requested by the sample traces. Each stack item can also be hashed to a no of blocks. This tool takes the sizes, associativity, replacement policies, and sample traces as input. It then checks for the validity of the arguments with respect to the cache levels and then it initializes the stack to the tone of the arguments. After initializing the stack and setting the replacement policies it keeps on getting the traces and keeps updating the respective details and then it finally summarizes to give output. This DineroIV tool devides the cache into a maximum of five level caches where each cache is divided into 3 parts the instruction, data and unified. and for each it maintains the no of blocks and subblocks, associativity etc. Dinero IV keeps track of the size of each access, along with address and access type.When reading din format, Dinero IV forces all accesses to be 4-byte aligned and assumes a size of 4. the same trace read by Dinero IV in din format may produce different results than if read directly in a richer format, e.g., extended din or one of the pixie formats.he richer formats should have correct size information, while Dinero IV must guess the size in the din format (always 4 bytes).References can cross block or sub-block boundaries. A multi-block reference is handled by doing the first block and deferring the rest, as a separate reference. The number of times this is done is reported in the d4cache multi block field.
10
5
Modifications
For the effective utilization of the Bloom filters it is more efficient that data is checked in small amount of time even if the algorithm is fast. For this Bloom filters and all other search techniques need to be fast the best way is to keep as much as possible. so the necessary modifications are: • The stack which the dineroIV is using must be divided into two parts one for the ordinary cache purposes and another for storing the contents of the set S which the element needs to be searched.so that most of the set S remains in the memory rather than bringing every time. • Some of the arguments need to be given statically rather than dynamically like the no of levels of cache must be set to 2 and then the sizes must also be fixed and accordingly intialize cache() function must be modified. • Allocation policies need to be modified and giving some default options such as prefetch always for the second cache containing the set S and modifications are needed in d4rep *,d4prefect * functions • d4 find(),d4findnth() functions which help in finding the node in stack needs to be modified so as to allow searching. • Code of d4ref function must be modified for the way the element can be searched in the set S whether it is Minimal Increase or the Recurring Minimum. which allow for both random search and also varying size such that the hash time is O(1).
11
6
Conclusion
The Spectral Bloom filters can be used effectively than many searching methods with small error rate and also effective in implementation with the available requirements.The DineroIV can be effeciently modified to incorporate the required static or dynamic policies. The searching of an element from a set or datamining can be done efficiently by combing the two methods namely the bloom filters and the modifying the cache for the implementation of set and ordinary cache to make the fetching of the elements easy. Hence with the specified modifications there is better chance of reducing the searching time of an element.
12
7
References
References [1] Saar Cohen,Yossi Matias, ”Spectral Bloom Filters”,Proceedings of the 2003 ACM SIGMOD international conference on Management of data, 2003,San Diego, California. [2] B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7):422–426, 1970. [3] Mark D. Hill,http://www.cs.wisc.edu/ markhill/DineroIV/
13