Proceeding Of The 3rd International Conference On Informatics And Technology,

  • June 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Proceeding Of The 3rd International Conference On Informatics And Technology, as PDF for free.

More details

  • Words: 3,247
  • Pages: 7
Proceeding of the 3rd International Conference on Informatics and Technology, 2009 BM-KMP HYBRID ALGORITHM FOR EXACT AND SUBSEQUENCE STRING MATCHING 1

2

3

Ammar Waysi Mahmood , Nur'Aini binti Abdul Rashid , Atheer A. Abdul Rozaq UniversitySains Malaysia, 11800, Pulau Penang, Malaysia: [email protected] School of computer Science 2 UniversitySains Malaysia, 11800, Pulau Penang, Malaysia: [email protected] School of computer Science 3 UniversitySains Malaysia, 11800, Pulau Penang, Malaysia: [email protected] School of computer Science 1

ABSTRACT This study focuses in hybridizing two well-known exact string matching algorithms which are Boyer-Moore and Knuth-Morris-Pratt string matching algorithms. The hybrid algorithm employs main ideas of the two phases of Boyer-Moore and Knuth-Morris-Pratt algorithms. The hybrid algorithm employs good prefix idea from Knuth-MorrisPratt algorithm and bad character shifting from Boyer-Moore algorithm. Then the proposed hybrid algorithm was adapted for subsequence matching with some modifications in preprocessing phase and search phase. Both the hybrid and enhanced algorithms were tested using four types of alphabet which are binary alphabets, DNA alphabet, protein alphabet and English alphabet. The results of these algorithms show better results compared to Boyer-Moore and Knuth-Morris-Pratt algorithms in terms of time and the number of characters compared in different sizes of alphabet. The two algorithms also showed better results than Knuth-Morris-Pratt algorithm in all types of alphabet, and better result than Boyer-Moore algorithm in binary, DNA and protein alphabet. However, the algorithms performed worse than Boyer-Moore algorithm in English alphabet. Keywords: Boyer Moore, BM, Knuth-Morris-Pratt, KMP, Exact String matching, subsequence matching, Hybrid algorithm 1.0 INTRODUCTION String matching is the process of finding patterns in a text is the oldest and often used operations in text processing, which is one of the fields in computer science. It is used in all word processing systems in the form of finding and replacing texts. This operation is becoming more and more complex due to the drastic increase in the size of text. The current interest in the applications of string matching algorithms in computational biology and information retrieval keeps the research of this field alive and active. Although the research started a long way back, there are still a lot of interests in the research especially towards creating simpler ideas that work better in practice. In this research we proposed a new algorithm which is the hybrid of two existing string matching algorithms. The existing algorithms are Boyer-Moore and Knuth Morris-Pratt algorithms. Both algorithms are different in nature. The proposed hybrid algorithm will take advantage of the two existing algorithm and is adjusted to suit a new set of data which is biological sub-sequences. 2.0 PREVIOUS STUDIES In this section we will take about previous studies of hybrid algorithms for exact string matching and algorithms for subsequence matching. 2.1 Hybrid Algorithms for Exact String Matching Exact string matching algorithms have different behavior and different speed, and the speed of these algorithms changed depending on the size of alphabet. Therefore, it is useful to combine two (or more) of these algorithm to get the advantage of both of them in increasing the speed of searching to reduce the time needed for matching strings and making the new algorithm compatible for any size of alphabet as much as possible. [1] 2.1.1 Faster Hybrid Algorithm Boyer-Moore algorithm is one of the fastest and most important algorithms for exact string matching. Therefore, it is normal to see some studies trying to hybrid this algorithm with other algorithms to get a new and fast algorithm. The hybrid Boyer-Moore algorithms with Quick search algorithm for exact string matching is done by utilizing the continuous skip over the text, this happened by large shift on text. This algorithm, like pure Boyer-Moore algorithm, has two phases: preprocessing phase and searching phase. It is based on bad character and good suffix by trying to get the benefit from both of them and comparison of the characters between text and pattern in this algorithm

©Informatics '09, UM 2009

RDT4 - 81

Proceeding of the 3rd International Conference on Informatics and Technology, 2009 during search phase done from right to left. The results of this algorithm are faster than other algorithms like BoyerMoore algorithm, Quick search algorithm, especially if the pattern length is small. [2] 2.1.2 Berry-Ravindran Fast Search hybrid algorithm The BRFS hybrid algorithm has been presented to combining between two searching algorithms they are Fast Search algorithm and Berry-Ravindran algorithm to improve the performance by increasing the distance of the shifting. This algorithm contained two phases: preprocessing and search phases. In preprocessing phase there are two functions, the first is good suffix for Boyer-Moore algorithm and the second bad character heuristics for Berry-Ravindran algorithm. The search phase happened from right to left, this algorithm has high performance and faster than any algorithms in small alphabets and long patterns, which exactly a biological sequence properties. [4] 2.1.3 TVSBS Hybrid Algorithm The TVSBS hybrid algorithm done by combining Berry-Ravindran algorithm and SSABS algorithm, this algorithm using Berry-Ravindran algorithm bad character shifting to reduce character comparisons. It is designed for nucleotide and amino acid sequences. The preprocessing phase calculates Berry-Ravindran bad character for all alphabets not only for one character. [5] 2.1.4 Franek - Jennings – Smyth Hybrid Algorithm Knuth–Morris-Pratt algorithm is also an important algorithm in the exact string matching. This algorithm is combined with Sunday algorithm to produce new hybrid. Where is the comparison of the characters between text and pattern in this algorithm done from left to right. It depends on partial matching, if there is a partial matching between pattern and text the Knuth–Morris-Pratt algorithm will be used for determine the comparison for the next step, otherwise, it is also has preprocessing and search phases the shifting of a Sunday algorithm will be used to determined the next position of the comparison. The algorithm produced better result (about 5-10% better), compared to the other exact string matching algorithms especially with alphabet size above twenty characters. [6] 2.2 Subsequences Matching In previous studies many different methods and algorithms were tried to deal with subsequence problem using different methods for different purposes. 2.2.1 Subsequence Automaton Algorithm The subsequence automaton algorithm is deterministic finite automaton has been used for constricting subsequences in multi texts, which can accept all subsequence of a set of texts. This method can be used as preprocess to a given set of texts with any query and return the number of the texts which contains the query for subsequence. [7] 2.2.2 Aligned Subsequence Matching Algorithm In novel algorithm, called the aligned subsequence, matching is based on segmentation of real numbers sequence running in linear time, after the segmentation a calculation of the distance of the similarity will be done, this similarity can be calculate using special function. They also presented an indexing method based on suffix tree to speed up the linearity for this algorithm; this algorithm designed for retrieval of similar subsequence. The performance of the algorithm is 6.5 times faster compared to results of other algorithms from UC and KDD archive. [8] 2.2.3 Subsequence Matching for Communications Subsequence matching, as we have mentioned is used in nature language texts and bioinformatics sequence, the additional use of it is in communication. In a study by Mercier et al used as error-correction of insertions and deletions for a packets to and by present a framework of finding the number of subsequences when deleting any number of symbols from the string, this framework can apply only to no binary strings alphabet with technique called a sandwich technique.[9]

©Informatics '09, UM 2009

RDT4 - 82

Proceeding of the 3rd International Conference on Informatics and Technology, 2009

2.7.4 WASP Algorithm WASP (Windows- Accumulated Subsequence matching Problem) algorithm depends on windows accumulated to matching subsequence in long text, the window is the subsequence and this window sliding over the text to discover all possible sequence that may the window’s pattern be the subsequence of it within the size of the window by generalizing KMP algorithm to be used in subsequence matching. [10] 3.0 METHODOLOGY In our methodology, we will get the benefit from a bad character shift in BM algorithm shift and prefix shift from KMP algorithm. As already known, both of the algorithms have two phases; the preprocessing phase and the search phase; therefore, our algorithm too has two phases. 3.1.1 Preprocessing Phase for Hybrid Exact String Matching Algorithm The preprocessing phase of KMP algorithm only produces a single table; this table contains the information that KMP algorithm uses for prefixes of the pattern, our algorithm uses same table as one of the two tables our algorithm needs to. This table was created by a preprocessing function in the KMP algorithm which is usually called “create next”. The preprocessing phase of BM algorithm produces two tables. The first table is for a bad character and the second table is for a good suffix. We will use the idea of the bad character in our algorithm. The bad character table contains the value of the shifting to get the right most position for each character in the pattern. For other characters that do not belong to the pattern, the shifting value will be maximized and will be assigned to the value “m”, where “m” is the size of the pattern. Our proposed algorithm uses a bad character table not only for the right most characters as in BM, but a bad character table for each position in the pattern. The following table shows this idea for the pattern “character”: Table1: Bad character table for all positions in the pattern 0 1 2 3 4 5 6 7 8

c 0 1 2 3 4 0 1 2 3

h 1 0 1 3 3 4 5 6 7

a 1 2 0 2 0 1 2 3 4

r 1 2 3 0 1 2 3 4 0

t 1 2 3 0 5 6 0 1 2

e 1 2 3 4 5 6 7 0 1

Others 1 2 3 4 5 6 7 8 9

3.1.2 Searching Phase for Hybrid Exact String Matching Algorithm In our algorithm the first comparison of the characters of the pattern and text begin from the right most character (the character has index m-1). To get benefit from maximum shifting as much as possible which equal m. This comparison from right most character is done only once after each shifting. The second comparison is done in the first element on the left (the character has index 0), then it will continue until the element has index m-2. This type of comparison takes advantage from the prefix of KMP table and bad character shifting from bad character table. The Fig.1 shows the arrangement of the character comparison:

comparison

2

3

4

Index

0

1

2

…………

1

………… m-1

Fig.1: The character comparison arrangements

©Informatics '09, UM 2009

RDT4 - 83

Proceeding of the 3rd International Conference on Informatics and Technology, 2009 If the mismatching occurs in the first comparison (the right most character) then the next shift will depend on the value in the bad character table. Otherwise it will depend on the bad table or KMP table, depending on which table has a bigger shifting. To calculate the shifting in KMP algorithm, we will use the KMP prefix table. 3.2 HYBRID BM-KMP ALGORITHM FOR SUBSEQUENCE MATCHING We extended the hybrid of BM-KMP for subsequence matching, to match all possible subsequences of the pattern in the text. The modification is done at both the preprocessing and searching phase. 3.2.1 Preprocessing phase for subsequence string matching algorithm The preprocessing phase for exploring subsequences of the pattern in the text contains a table. It is the same idea of Table1 (bad character table) but, with a single difference. Table1 carries the values of that shift to right, and the new table has the same idea of shifting, but, with left direction, for example the table for the pattern “character” shows is shown below: Table 2: The bad character shifting for left direction

0 1 2 3 4 5 6 7 8

c 0 -1 3 2 1 0 -1 -1 -1

h 1 0 -1 -1 -1 -1 -1 -1 -1

a 2 1 0 1 0 -1 -1 -1 -1

r 3 2 1 0 4 3 2 1 -1

t 6 5 4 3 2 1 0 -1 -1

e 7 6 5 4 3 2 1 0 -1

The table is used to determine character was deleted in the subsequence from the original string and will be used in the search phase. 3.2.2 Searching Phase for Subsequence String Matching Algorithm Although in the search phase of exploring subsequences has the same idea with sliding window, there are some modifications in the search phase that makes the algorithm convenient to exploring subsequence, these changes can be summarized as following: • If there is a match in the prefix, then a mismatch will occur and the character which causes the mismatch is not in the pattern, then, we will put the symbol “-“ instead this character and complete the matching. • If there is a match at the prefix and then a mismatch occurs and the character which causes the mismatch is at the right of mismatch position, then there is a shifting to the left of the pattern suffix that depends on the value of bad character table for left shifting. This shifting is done only if there is no another matching between the characters from mismatch positions to the character will be shifted, otherwise this mismatch replace the character with symbol “-“. 4.0 RESULT AND EVALUATION The algorithm was tested in four types of alphabet and fourteen different pattern lengths. There are two types of result collected. The first type is the number of characters comparisons for all the four types of alphabet and the second is the execution time for all the four types of the alphabet. We have implemented the program using a personal computer with 2.8GHz Intel processor, and 512 MB of RAM. The operating system used in this experiment is Microsoft windows XP service pack2, with Microsoft Visual C++ compiler. 4.1RESULTS OF CHARACTER COMPARISON The results of character comparison are collected by comparing characters in the pattern and in the text. The size of text used for all type of alphabet is 1MB. We used English, protein, DNA and Binary alphabet. As shows:

©Informatics '09, UM 2009

RDT4 - 84

Proceeding of the 3rd International Conference on Informatics and Technology, 2009

Fig2: Number of the character comparisons in English alphabet

Fig3: Number of the character comparisons in protein alphabet

Fig4: Number of the character comparisons in DNA alphabet

Fig5: Number of the character comparisons in Binary alphabet

4.2 RESULT OF EXECUTION TIME EVALUATION We compare the BM, KMP, BM-KMP hybrid and Subsequence algorithms in terms of the execution time. The text size used in this evaluation is 10MB for all four types of alphabet. The range of pattern length “m” used is the same the previous experiments. The time is measured using milliseconds.

Fig 6: Execution time for string matching in English alphabet

©Informatics '09, UM 2009

Fig 7: Execution time for string matching in protein alphabet

RDT4 - 85

Proceeding of the 3rd International Conference on Informatics and Technology, 2009

Fig 8: Execution time for string matching in DNA alphabet

Fig 8: Execution time for string matching in Binary alphabet

The BM-KMP hybrid behaves similarly to the to BM algorithm when tested with English alphabet. The results become much better and more efficient than BM algorithm when we apply the BM-KMP hybrid algorithms to small size of alphabet. The reason is BM-KMP hybrid algorithm has good prefix function which reduces the number of comparison when the pattern and the text have good prefix. The possibility of finding the good prefix increases in small size of alphabet, while the bad character shifting function in BM algorithm is not very useful in small size of alphabet. The execution of subsequence matching algorithm is a little bit more than KMP algorithm. This happened because in subsequence matching the algorithm should pass over all the text without any shifting to match all possible subsequence of the pattern in the text. 5.0 Conclusion In this study we integrate two well-known exact string matching algorithms which are Boyer-Moore and KnuthMorris-Pratt. We employed the good prefix shifting rules from KMP algorithm and bad character shifting rules from BM algorithm. The enhanced algorithm has two phases, preprocessing phase and search phase. In preprocessing phase, the algorithm generates two types of tables. The first table is the same table as in KMP algorithm for good suffix, and the second table is similar to the table of bad character shifting with one difference where instead of keeping one position, the new table keeps all position of pattern. During search phase, when a character mismatch if found the shifting will be done according to the good prefix table or bad character shifting depending on which one gives maximum shifting. We modify the preprocessing and the search phase of the hybrid algorithm for subsequence matching. In preprocessing phase, we change the bad character table, and during search phase the algorithm will check all the character in the sliding window. The hybrid algorithm for exact string matching shows better result than BM and KMP algorithms, especially when we have small size of alphabet. The algorithm shows a significance difference when compared to BM algorithm. The hybrid algorithm for exact string matching shows very big difference in results when compared to KMP algorithm. The subsequence matching algorithm behaves like KMP algorithm but executes much longer and has larger number of comparison compared to KMP algorithm. REFERENCES MICHAILIDIS, P. D. & MARGARITIS, K. G. (2001), "On-line String Matching Algorithms: Survey and [1] Experimental Results", International Journal of Computer Mathematics, Vol. 76. No 4, pp. 411-434. [2] KESONG, H., YONGCHENG, W. & GUILIN C. (2000), "Research on a Faster Algorithm for Pattern Matching", Proceedings of the Fifth International Workshop on Information Retrieval with Asian Languages, Hong Kong, China, pp. 119-124. [3] HUANG, Y., PING, L., PAN X. & CAI G. (2008), "A Fast Exact Pattern Matching Algorithm for Biological Sequences", BioMedical Engineering and Informatics, 2008. BMEI 2008. International Conference on, Vol. 1, pp. 8-12. TATHOO, R. VIRMANI, A., LAKSHMI, S. S., BALAKRISHNAN, N. & SEKAR, K. (2006), "TVSBS: A Fast [4] Exact Pattern Matching Algorithm for Biological Sequences", Journal of Indian Academy of Sciences, Vol. 91, No 1, pp. 47-53. [5] FRANEK F., JENNINGS C. J. & SMYTH, W. F. (2007), "A Simple Fast Hybrid Pattern-Matching Algorithm", J. of Discrete Algorithms, Vol. 5, No. 4, pp. 682-695.

©Informatics '09, UM 2009

RDT4 - 86

Proceeding of the 3rd International Conference on Informatics and Technology, 2009 [6]

[7]

[8]

[9]

HOSHINO, H., SHINOHARA, A., TAKEDA, M. & ARIKAWA, S. (2000), "Online Construction of Subsequence Automata for Multiple Texts", in String Processing and Information Retrieval, 2000. SPIRE 2000. Proceedings. Seventh International Symposium on, pp. 146-152. PARK, S., LEE, D. & CHU, W. W. (1999), "Fast Retrieval of Similar Subsequences in Long Sequence Databases", Proceedings of Workshop on Knowledge and Data Engineering Exchange, pp. 60-67. MERCIER, H., KHABBAZIAN, M. & BHARGAVA, V. K. (2008), "On the Number of Subsequences When Deleting Symbols From a String", IEEE Transactions on Information Theory, Vol. 54, No. 7, pp. 32793285. BOASSON L., PATRICK, C., GUESSARIAN, I. & MATIYASVICH Y. (2001), "Window-Accumulated Subsequence Matching Problem is Linear", Annals of Pure and Applied Logic. Elsevier, Vol. 113. No 1, pp. 59-80.

©Informatics '09, UM 2009

RDT4 - 87

Related Documents