String Processing Engr. Tazeen Muzammil
Basic Terminologies • Each programming language contains a character set that is used to communicate with the computer. The usually indicates the following: • Alphabet: A,B,C,D…..,Z • Digits: 0,1,2,3,4,5,6,7,8,9 • Characters: +, -, /, *, ^, &, %, = etc.
• A finite sequence of 0 or more characters is called a string. • The number of characters in a string is called its length. • The string with zero characters is called the empty string or null string.
Storing Strings Strings are sorted in there types of structures 1. Fixed-length structure 2. Variable-length structure 3. Linked Structure
Fixed-Length Storage • Record-Oriented
– In fixed-length storage each line of print is viewed as a record, where all records have the same length, i.e. each record accommodate the same number of characters. Assume our record has length 80 unless otherwise stated.
• Suppose the input consists of a program. Using a record-oriented, fixed length storage medium, the input data will appear in memory as shown in figure, where we assume that 200 is the address of the first character of the program.
Program C PROGRAM PRINTING TWO INTEGERS IN INCREASING ORDER READ *,J,K IF(J,LE,K) PRINT *,J,K ELSE PRINT *,K,J EFNDIF STOP END
Record stored sequentially in computer C
P ROG R AM
200
210 R E AD
208
P R I N T I NG
T WO
220
J , K PRINTING TWO INTEGERS IN C* ,PROGRAM INCREASING ORDER
READ *,J,K 290 IF(J,LE,K) PRINT *,J,K ELSE PRINT *,K,J EFNDIF STOP END
300
Record stored sequentially in computer I F ( J , L E , K )
360
840
E ND
T H E N
370 380 C PROGRAM PRINTING TWO INTEGERS IN INCREASING ORDER READ *,J,K 850 IF(J,LE,K) PRINT *,J,K ELSE PRINT *,K,J EFNDIF STOP END
860
Advantages • Advantages – The ease of accessing data from any given record – The ease of updating data in any given record (as long as the length of the new data does not exceed the record length)
• Disadvantages – Time is wasted reading an entire record if most of the storage consists of blank spaces. – Certain records may require more space that available. – When the correction consists of more or fewer characters than the original text, changing a misspelled word requires the entire record to be changed.
Variable-Length Storage with Fixed Maximum • Although string may be stored in fixed-length memory location as above, there are advantages in knowing the actual length of each string; one does not have to read the entire record when the string occupies only the beginning part of the memory location. C PROGRAM PRINTING TWO INTEGERS IN variable-length • The storage of INCREASING ORDER strings in memory cells with fixed lengths can be done in two general ways: 1. 2.
READ *,J,K One can useIF(J,LE,K) a marker that is two $$ signs, to signal the end of the string. PRINT *,J,K One can list ELSE the length of the string as an additional PRINT *,K,J . item in the pointer array EFNDIF STOP END
Linked Storage • Computer must be able to correct and modify the printed matter, which usually means deleting, changing, and inserting words, phrases, sentences and even paragraphs in the text. The fixed-length memory cells do not easily lend themselves to these operations. For this reason strings are stored by means of linked lists.
Linked List
• A linked list, or one-way list is a linear collection of data elements called nodes, where linear order is given by means of pointer.
Linked Lists A
B
C
∅
Head
• A linked list is a series of connected nodes • Each node contains at least – A piece of data (any type) – Pointer to the next node in the list
• Head: pointer to the first nodenode • The last node points to NULL
A dat a
pointe r
Linked Storage • String may b used in a linked list as follows. Each memory cell is assigned one character or a fixed number of characters, and a link contained in the cell gives the address of the cell containing the next character or goup of characters in the string. For example: To be or not to be, that is the question.
Linked Storage T
B
O
One character per node
T
O
B
E Four character per node
O
R
String Operations •
Substring ( substr(pos,len)) – Accessing a substring form a given string requires two piece of information. 1. The position of the first character of the substring, and 2. The length of the substring .
•
Indexing (find()) – Indexing refers to finding the location of the substring. find(string) find(string, positionFirstChar) find(string, positionFirstChar, len) rfind()-(Find last occurrence of string or substring)
•
Concatenation – String concatenation is the operation of joining two character strings end to end. For example, the strings "snow" and "ball" may be concatenated to give "snowball".
•
Length( length(), size()) – The number of characters in the strng is called the length or size of string.
Example • Substring s = s2.substr(1,4); s = s2.substr(1,50);
• Length i = s.length(); i = s.size();
• Concatenation s2 = s2 + "x"; s2 += "x";
• Find i = s.find("ab",4);
string s = "abc def abc"; string s2 = "abcde uvwxyz"; char c; char ch[] = "aba daba do"; char *cp = ch;
Word Processing • The operations usually associated with word processing are: – Replacement • Replacing one string in the text by another replace(pos1, len1, string) replace(pos1, len1, string, pos2, len2)
– Insertion • Inserting a string in the middle of the text insert()
– Deletion • Deleting a string from the text. erase(positionFirstChar) erase(positionFirstChar,len)
Example • Replace s.replace(4,3,"x"); • Erase s.erase(4,5); s.erase(4);
string s = "abc def abc"; string s2 = "abcde uvwxyz"; char c; char ch[] = "aba daba do"; char *cp = ch;
Question A. A text T and a pattern P are in memory. Write an algorithm which B. A.[Find [Findthe theindex indexof ofP] P]Set SetK=Find(T,P) K=Find(T,P) deletes occurrence of P in T Repeat Repeatwhile whileevery k=!0 k=!0 a) a)
[Replace [Delete PPfrom fromT] Q] Set SetT=Replace(T,P,Q) T=Delete(T, Find(T,P),Length(P)) a) a) [Update [Updateindex] index]Set SetK= K=Find(T,P) Find(T,P) [End [Endof ofloop] loop] Writ WritTT Exit Exit
B. A text T and a pattern P and Q are in memory. Write an algorithm which replaces every occurrence of P in T by Q.
Pattern matching Algorithm • Given strings T (text) and P(pattern), the pattern matching problem consists of finding a substring of T equal to P • T: “the rain in spain stays mainly on the plain” • P: “n th”
• We assume that the length of pattern does not exceed the length of text. • Applications: – Text editors – Web search engines (e.g. Google)
The Brute Force Algorithm • Check each position in the text T to see if the pattern P starts in that posi tion
T: a n d r e w P: r e w
T: a n d r e w P: r e w
P moves 1 char at a time through T
The Brute Force Algorithm • The first pattern matching algorithm is the one in which we compare a given pattern P with each of the substring of T, moving from left to right, until we get a match. • Let Wk denote the substring of T having the same length as P and beginning with the Kth character of . Wk = Substring(T,K,LENGTH(P)) • First we compare P, character by character, with first substring W1 • If all the characters are the same, then P= W1 and so P appears in T and Index(T,P)=1. • If some characters of p is not the same as corresponding character W1 . Then P is not equal to W1 and we can move on to the next substring W2 • The process stops when we find the match of P with some substring Wk and so P appears in T and Index(T,P)=K, or • We exhaust all the Wk with no match that means P does not appear in T. • The maximum value of substring K is equal to Length(T)-Length(P) +1.
The Brute Force Algorithm • P and T are strings with length R and S, respectively, and are stored as array with one character per element. The algorithm finds the Index of P in T 1. [Initialize] Set K= 1 and MAX=S-R+1 2. Repeat Step 3 to 5 while K<=MAX 3. Repeat for L=1 to R [Test each character of P] If P[L]!= T[K+L-1], then: Go to step 5. [End of inner loop] 4. [Success] Set INDEX=K, and Exit 5. Set K=K+1 [End of Step 2 outer loop] 6. [Failure] Set INDEX=0 7. Exit.
Analysis • Brute force pattern matching runs in time O(mn) in the worst case. • But most searches of ordinary text take O(m+n), which is very quick. • Example of a worst case: – T: "aaaaaaaaaaaaaaaaaaaaaaaaaah" – P: "aaah" • Example of a more average case: – T: "a string searching example is standard" – P: "store"
The Boyer-Moore Algorithm • The Boyer-Moore pattern matching algorithm is based on two techniques. • 1. The looking-glass technique
– find P in T by moving backwards through P, starting at its end
• 2. The character-jump technique
– when a mismatch occurs at T[i] == x – the character in pattern P[j] is not the same as T[i]
• There are 3 possible cases, tried in order.