Proceeding of the 3rd International Conference on Informatics and Technology, 2009 INFLUENCED FACTORS ON COMPUTATION AMONG QUICK SEARCH, TWO-WAY AND KARP-RABIN ALGORITHMS Atheer Akram Abdulrazzaq, Nur’Aini Abdul Rashid, Hazrina Binti Yusof Hamdani, Rana M. Ghadban, Ammar Waysi Mahmood 1
School of Computer Science, Unversiti Sains Malaysia, 11800, Penang, Malaysia
[email protected],
[email protected],
[email protected],
[email protected],
[email protected]
Abstract String matching problem has become highly important in the last two decades due to the advancement in technology. It has also become necessary to solve this problem because of its application in many fields. One of the main factors associated with this problem is the consumed time. The consumed time is influenced by type of algorithm, data type, data size and length of pattern used. This study showed lowest consumed time after using Quick search algorithm than the other two algorithms studied which are Two way and Karp-Rabin. Consumed time was being lowest in DNA and Source code when 50 MB and 200 MB, respectively. Keyword: Consumed time, exact string matching algorithm, data type, data size and length of pattern Introduction
Consumed time of exact string matching is the computing time during the execution of the algorithm. String matching is a searching operation carried out to check the optimal alignment by comparing the two finite-length strings [1]. It has been considered as one of the most important problems related issue. This is because of the advancement in technology in the last two decades [2]. It has become necessary to solve string matching problem because of its application in many fields such as in the field of the computation biology [3]. In this field, the biological molecules are approximated as sequences nucleotides or amino acids. In the literature research, string matching is use in the inclusion of memorized data and dictionaries [4]. String matching is also use in web search engine, artificial intelligence, retrieval of information [2] and intrusion detection system [5]. In all the applications, the consumed time grows as data grows, therefore, it is better to use an efficient algorithm in order to complete the matching with higher speed [6]. In this paper, we studied three exact string matching algorithm which are Quick search, Karp-Rabin and Two way algorithms. Two way algorithm [Crochemore-Perrin] is used to improve Brute force algorithm technique. The algorithm depends on the matching by specific order [4] which divides the pattern into two parts by factorization [7]. Karp-Rabin algorithm [Karp-Rabin] depends on the hashing function in matching. The behavior of this algorithm depended on left to right order [4]. It measures the content of window and compares it with the content of the pattern. This process showed high efficiency than matching of characters [8]. The purpose of using hash function is to avoid the quadratic number during comparisons in order to obtain fast matching process [9]. This is because the usage of integer numbers will reduce the computational time and simplify the procedure. The main properties of this algorithm are efficient computation, high discrimination and hash which is characterized by easy way in matching [8]. Quick Search algorithm [Sunday] is similar to Boyer-Moore algorithm and Horspool algorithm which used the bad character shift [10]. The behavior of this algorithm depended on left to right order [4]. It is characterized by higher efficiency, easy implemented and capability to apply on short and large pattern [11]. Methodology The general framework of this study is done in two phases as shown in Fig. 2. First phase is to implement three exact string matching algorithms, while the second phase is to compare these algorithms based on the data type, the length of pattern and size of data used. The computation time of each algorithm is measured in the first phase. The involved algorithms were; Two way algorithm (from Reverse Factor family), Quick Search algorithm (from Boyer Moore family) and Karp-Rabin algorithm (from Hashing approach). The techniques of these algorithms were depended on the preprocessing and searching phase of each algorithm, these techniques were depicted in the Fig 1.
©Informatics '09, UM 2009
RDT4 - 100
Proceeding of the 3rd International Conference on Informatics and Technology, 2009
Hash computing for text window
Choosing the critical factorization |L| = {max | l | , | l | } Possibilities |u| < period (x), |u| minimal
Quick search bad character (qsBc) table
Comparison of characters in pattern suffix and prefix with text
Number of attempts and number of characters comparison (a) Two way algorithm
Searching phase
Rehash computing
Number of attempts and number of characters comparison
(b) Karp-Rabin algorithm
Number of attempts and number of characters comparison
Preprocessing phase
x [i] = x [i + p]
Searching phase
Pattern period determination
Hash computing of pattern
Preprocessing phase
Г (w, k) = {z | z on the k in w} ∂ (w,k) = min{|z| | z ∈ Г (w,k)}
Searching phase
Determination of local period (Internal and external choices )
Preprocessing phase
Division
(c) Quick search algorithm
Fig. 1 (a), (b) and (c) preprocessing and searching phases for the three algorithms (Two way, Karp-Rabin and Quick search) In the second phase, four types of data were downloaded from Benchmark Standard. They are DNA sequence, Protein sequence, English text and Source program code. Different lengths of pattern were used in this study in order to demonstrate the effect of pattern length on the consumed time for the three algorithms. This selection was performed depending on previous study in which the length of pattern was chosen randomly [12]. In this study, the pattern lengths ranged from 5 to 95 characters and the increment was done by 10 characters i.e. 10 readings were
©Informatics '09, UM 2009
RDT4 - 101
Proceeding of the 3rd International Conference on Informatics and Technology, 2009 applied. For data sizes, the common sizes used depending on Benchmark Standard were 50 MB and 200 MB. The framework of this study depicted in Fig 2.
Databases
Two way
Phase 1
Quick Search
Karp- Rabin
Computation of consumed time
Comparison
Data type
Data size
Phase 2
Pattern length
Fig. 2: Research Methodology
Two way algorithm has two phases; first preprocessing phase, this phase depends on the factorization method. Factorization method has several steps and extends to the critical factorization. These steps are division, characters local periods determination, pattern period determination and choosing the critical factorization. Division is needed to divide the pattern into two substrings. Suppose the pattern is denoted by (x) and both the prefix and suffix of x are called u and v respectively. The word that need to be search in the pattern is denoted by (w), where the w ∈ x. Determination of local periods need to find the local periods. The local period is defined as the length of smallest central repetition. The central repetition is denoted by (z) and the position at this repetition is denoted by (k). First, find the choices or the possibilities of central repetition that occur on k positions. These choices are either being internal or external and on the right or left side. In Pattern period determination, the period of pattern was determined by choosing the smallest period. Choosing the critical factorization in this step, the critical factorization for the pattern was determined and this occurred after obtaining the all local periods and pattern period. This step has respective choices until it reaches the critical factorization position that will cut the pattern into two parts. First determine and then choose the higher local period. If there is only one local period as higher value then it will choose directly. Second, the arrangements of the characters have to be noticed alphabetically i.e. have to check the direction and choose which direction it have to follow. This direction depends on the forward and inverse directions. The forward direction means the characters are arranged alphabetically. The inverse means the characters are arranged alphabetically from the inverse direction. Third is the computation of the maximal suffix which is the higher length of all chosen characters that arranged lexicographically. Second searching phase, this phase included two choices in order to perform the comparisons during matching operation. First, the comparisons will start with pattern suffix (v) which can compare between text and pattern characters. The movement in v will start from left to right. If there is mismatch then the pattern will shift toward right by
©Informatics '09, UM 2009
RDT4 - 102
Proceeding of the 3rd International Conference on Informatics and Technology, 2009 the length of character position (k). If there is matching then the characters will continue comparisons until it complete all v characters and if they are completed then will change to the second choice. Second choice, the comparisons will start in prefix (u) characters. The movement here will occur from right to left. If there is mismatch then the pattern will shift toward right by the pattern period (p). But if matching occurs then the comparisons will continue until it complete all u characters. If all u characters are matching, then the pattern will shift by pattern period also [7]. Karp-Rabin algorithm has two phases, first preprocessing phase, this phase depends on the hash property. The function of hash is to convert the characters into integer values which can deal with in mathematical operations. Supposing the word (w) is found in both pattern (x) and text (y), and the length of window (during comparisons) is denoted by m. In preprocessing the hashing, it is easy to calculate for all pattern characters. Second Searching phase, in this phase, first the comparison will occur between the hash number of text window and the hash number of pattern. The movement of pattern will be start from left to right. If there is mismatch then the window will shift toward right by one character. Rehash property require to compute the new window hash after addition of new character on right side and remove the hash value of the first left character in the previous step. The rehash property will continue to occur after every shifting process. Supposing the first character was left from the previous step is denoted by (a) and the new character on the right side of window is denoted by (b). Second, if the hash number of pattern is equal to window hash number, it will compare all character (one by one) of pattern with the window characters. When the matching occurs for all characters, the shifting of pattern will move by one character only [9]. Quick search algorithm has two phases, first preprocessing phase, this phase depends on the quick search bad character (qsBc) table. This table represents the minimal resulted values from the difference between the position (i) of the character (c) and the pattern length (m-1). Where, m refers to the position of character that comes after the window during text matching. When a character is not presented in pattern, then the value must be m in qsBc table. Second searching phase, in this phase, the matching process will occur between the pattern and text characters. The movement of matching starts from left to right. If there is mismatch, then the shifting of the pattern depends on m value in qsBc table. But if matching occurs then the comparisons will continue to another character. If all characters are matching, then the shifting will depend on m also [10]. Implementation The experiment was run on our local machine (HP Intel(R) Core(TM) 2 CPU (1.66 GHz), 2.49 GB RAM, Windows XP service pack-2 32-bit Operating System, the compiler used was Intel C++. Results We execute the program using the experiments by using type of data 5 times for every data type at different times and then takes the average.
©Informatics '09, UM 2009
RDT4 - 103
3 30
Consumed time (sec.)
Consumed time (sec.)
Proceeding of the 3rd International Conference on Informatics and Technology, 2009
2.5 25 2 20 1.5 15 1 10 0.55 0 0 Tw o Way algorithm
Karp-Rabin algorithm
440 3.5 35 330 2.5 25 220 1.5 15 110 0.55 0 0
Quick search algorithm
Pattern Length
95 85 75 65 Tw o Way algorithm
Karp-Rabin algorithm
Quick search algorithm
55 45
(a) DNA sequence
(b) Protein sequence
35
Consumed time (sec.)
Consumed time (sec.)
25 440 3.5 35 330 2.5 25 220 1.5 15 110 0.55 0 0 Tw o Way algorithm
Karp-Rabin algorithm
(c) English text
Quick search algorithm
440 35 3.5 330 2.5 25 220 1.5 15 110 0.5 5 00
15 5
Tw o Way algorithm
Karp-Rabin algorithm
Quick search algorithm
(d) Source code
Fig. 3 (a), (b), (c) and (d) Evaluations for the consumed time when 50 MB types of data is used As shown in Fig. 3, the Quick search algorithm has showed the lowest consumed time followed by Karp-Rabin algorithm and then Two way algorithm when 50 MB size is used. The consumed time found in DNA sequence was lowest compared to other types of data used. Naturally, the consumed time was increases when the length of pattern increases i.e. 95 characters of pattern length has consume time higher than others. According to the results, Quick search algorithm had lowest consumed time when DNA used in 50 MB size. Quick search was given lowest consumed time in DNA sequence data type when minimum pattern length (5) character is used. The maximum pattern length (95) character was given lowest consumed time for quick search algorithm in DNA sequence. KarpRabin algorithm had lowest consumed time in DNA sequence when the minimum and maximum pattern length use. Two way algorithm had lowest consumed time in DNA sequence when minimum and maximum pattern length use.
©Informatics '09, UM 2009
RDT4 - 104
Proceeding of the 3rd International Conference on Informatics and Technology, 2009
12 120 Consumed time (sec.)
Consumed time (sec.)
20 200 15 150 10 100 5 50 0 0 Tw o Way algorithm
Karp-Rabin algorithm
Pattern Length
10 100 880 660
95
440
85
220
75
00
Quick search algorithm
65 Tw o Way algorithm
Karp-Rabin algorithm
Quick search algorithm
55 45
(a) DNA sequence
35
(b) Protein sequence
990 880 770 660 550 440 330 220 110 00
Consumed time (sec.)
Consumed time (sec.)
25
Tw o Way algorithm
Karp-Rabin algorithm
(c) English text
Quick search algorithm
15
90 9 80 8 70 7 60 6 50 5 40 4 30 3 20 2 10 1 00
5
Tw o Way algorithm
Karp-Rabin algorithm
Quick search algorithm
(d) Source code
Fig. 4 (a), (b), (c) and (d) Evaluations for the consumed time when 200 MB types of data is used
The consumed time showed lowest in Quick search algorithm than other two algorithms when 200 MB data size used. Also, it was found that the consumed time was lowest when Source code used when compared with other data types. The consumed time was increased when length of pattern increased too. However, the quick search was showed lowest consumed time at Source code at 200 data size (as shown in Fig 4). Quick search algorithm was given lowest consumed time in Source code when the minimum pattern length (5) and maximum pattern length (95) use. Karp-Rabin was given lowest consumed time in English text for both maximum (95) and minimum (5) pattern length use. Two way algorithm had lowest consumed time in Source code data type when use the pattern length (5) and (95). The quick search algorithm showed lowest time than others. This is because the number of attempts was lower if compared with Karp-Rabin and Two way algorithms depending on previous studies. The consumed time was effected by the data size because they are differ in the types of characters included in each data type i.e. the consumed was showed lowest in DNA when use the pattern length 50 MB while when increased the types of characters then the Source code showed lowest consumed time in 200 MB. Conclusion The consumed time is changeable finding and it influenced by several factors. These factors are type of algorithm, data type, data size and length of pattern. The Quick search is given lowest consumed time for different data types, data sizes and pattern lengths. This change majorly depends on the technique of the algorithm used.
©Informatics '09, UM 2009
RDT4 - 105
Proceeding of the 3rd International Conference on Informatics and Technology, 2009 Acknowledgment This research is partly supported by USM short term grant 304/pkomp/ 639037 References [1] J. Gregor and R.S. Harris "String matching with left-to-right networks". Pattern Recognition Letter, Vol 16, 1995, pp. 213-218. [2] R.S.Viswanadha, B.A. Vinaya and M.Mrudula. "Backend Engine for Parallel String Matching Using Boolean Matrix". International Symposium on Parallel Computing in Electrical Engineering, 2006, pp.281-283. [3] L.Zheng., C.Xin., J.Borneman. and J. Tao (). “A fast algorithm for approximate string matching on gene th sequences,” in Symposium. 16 Annu. Combinatorial Pattern Matching. Springer-Verlag, Vol 3537,2005, pp.7990. [4] A. A. Hassan. "Mixed Heuristic Algorithm for Intelligent String Matching for Information Retrieval". Proceedings of the Sixth International Conference on Computational Intelligence and Multimedia Applications. IEEE, 2005, pp. 11-16. [5] K. Bi, N. Gu., K. Tu, X. Liu. and G. Liu. "A Practical Distributed String Matching Algorithm Architecture and Implementation". Proceeding of world academy of science, engineering and technology, Vol 10, 2005, pp.13076884. [6] M. Miyzaki., A. Shinohara. and M. Takeda. "An improved pattern matching algorithm for string in terms of straight line programs". Journal of Discrete Algorithms, Vol 1:1, 1997, pp.187-204. [7] M. Crochemore. and D. Perrin. "Two way string matching". Journal of the Association for Computing Machinery, vol 38, 1991, pp. 65I -675. [8] C.Charras. and T. Lecroq. "Handbook of Exact String_Matching Algorithms". King's College Publications, France, 2004. [9] R.M Karp. and M.O.Rabin. "Efficient randomized pattern-matching algorithms", IBM Journal of Research and Development, vol 31, 1987, pp. 249–260. [10] D.M. Sunday. "A very fast substring search algorithm". Communications of the Association for Computing Machinery, vol 33, 1990, pp. 132–142. [11] A.A. Hnaif, M. Alhalaiqah, O. Abouabdalla, S. Ramadass. and M. K. Mohammed.. "Parallel Quick Search Algorithm to Speed Packet Payload Filtering in NIDS", Paper accepted in Journal of Engineering Science and Technology (JESTEC), 2008. [12] P.D. Michailidis. and K.G Margaritis. "On-line String Matching Algorithms: Survey and Experimental Results". International Journal of Computer and Mathematic, Vol 76, 2000, pp. 411-434.
©Informatics '09, UM 2009
RDT4 - 106