Models For Computation: Part Iii

Uploaded by: Sam Myo Kim
0
0

July 2020
PDF

Download

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA

Overview

Download & View Models For Computation: Part Iii as PDF for free.

More details

Words: 34,728
Pages: 221

Preview
Full text

Models of Language Generation: Grammars

287

11. Ambiguity of Context-free Grammars Ambiguity in a language occurs either when a symbol or an expression has more than one meaning (e.g., story), or when an expression can be (grammatically) parsed in two different ways. The former is called lexical (or semantic) ambiguity, and the later syntactic (or structural) ambiguity. For example, in natural language, the sentence “A man entered the room with a picture” can be interpreted (i.e., parsed) into two different grammatical structures as follows.

man A

entered

room

the

with

picture a

man entered room A with picture the a This sentence is syntactically ambiguous. With no further information, it is impossible to know which way the sentence should be translated. In formal language, given a grammar G and a sentence x (i.e., a string, in the formal language jargon), parsing shows how x can be derived by the grammar. If x can be derived in two different ways, grammar G is ambiguous. Parsing is one of the main functions of the compiler of a programming language. In this chapter we will study syntactic ambiguity.

288

Ambiguity 11.1 Parse tree 290 11.2 Parse Tree and Ambiguity 293 11.3 Eliminating Ambiguity of an ambiguous CFG 295 Using parenthesis, Fixing the order of rule applications Eliminating redundant rules Setting up precedence and associativity Rumination 305 Exercises 306

Dear Dad & Dear Son

Break Time

Dear Dad, $chool i$ really great. I am making lot$ of friend$ and $tudying very hard. With all my $tuff, I $imply can`t think of anything I need, $o if you would like, you can ju$t $end me a card, a$ I would love to hear from you. Love, Your $on The Reply: Dear Son, I kNOw that astroNOmy, ecoNOmics, and oceaNOgraphy are eNOugh to keep even an hoNOr student busy. Do NOt forget that the pursuit of kNOwledge is a NOble task, and you can never study eNOugh. Love, Dad - Adrea -

289

Ambiguity

11.1 Parse Tree In the formal language, the syntax (i.e., structure) of a string generated by a grammar depends on the rules applied as well as their order of application. Syntax provides critical information for the compiler to translate the string (i.e., a program) into an object code. Thus, if a string can be derived in two different ways, it is impossible to give a unique translation. A parse tree is an efficient data structure to represent the syntax of a string derived by the grammar and to translate it by the compiler. For example, figure (b) below shows a parse tree for string pqr generated by grammar G in figure (a). S

S 

A

G: S  SS | SS | S | A Ap|q|r (a)

p

S S 

S

A

A

q

r

(b) 290

Ambiguity

Parse Tree

Given a derivation (i.e., sequence of rules applied) for a string w, the parse tree for w with respect to the derivation is constructed as follows. First put the root node with the start symbol S. Then, for each leaf node on the current tree with a nonterminal label, say A, recursively expand the tree as follows: Suppose that A →  is a rule applied next to derive w. For each symbol X appearing in , in the order from left to right, a child node with label X is add to A. This procedure repeats until the tree has no leaf nodes (labeled with a nonterminal symbol) to expand. Reading all the leaf nodes left to right on the final tree should give the string w. This string of terminal symbols is called the yield of the parse tree. S

S 

A

G: S  SS | SS | S | A Ap|q|r (a)

p

S S 

S

A

A

q

r

(b) 291

Ambiguity

parse tree

In general, given a source code in programming environments, the compiler constructs a parse tree of the code and then traversing the tree bottom up, left to right, generates an object code (machine language program). For example, suppose that for the string pqr the compiler has generated the parse tree in figure (b) below. The compiler generates machine language instructions that will access the values of variables q and r, compute qr and store the result (usually in a register), access the value of p, and finally execute the OR operation with the stored result of qr.

Because of the way of traversing the tree, bottom up, left to right, to generate the object code, the order of logical operations depends on the tree. For the example, the compiler generates an object code that will execute the logical expression pqr in the order of p( qr ). S S G: S  SS | SS | S | A Ap|q|r (a)

S



A p

S 

S

A

A

q

r

(b) 292

Ambiguity

11.2 Parse Tree and Ambiguity The two parse trees in figures (b) and (c) below show two parse trees yielding the same string pqr. In other words, it can be derived by grammar G in figure (a) in two ways. Consequently, the two parse trees imply that the expression pqr can be evaluated in two different ways, i.e., p(qr) and (p)qr. This implies that for grammar G, the operator precedence between the two logical operations  (OR) and  (AND) is ambiguous. S S A G: S  SS | SS | S | A A p| q| r (a)

S S

S





S

S S

S  S

A

A

A

A

A

r

q

r

p

q

p

(b) p(qr)

(c) (pq)r

293

Parse Tree and Ambiguity

Ambiguity

As we saw in the previous example, the existence of two parse trees yielding the same string is a problematic property, called ambiguity, of a CFG that should be eliminated. In real application, we cannot expect the correct result from a program written in the language of an ambiguous grammar. Before we discuss how to eliminate ambiguity from a CFG, we need a formal definition of it. Definition (Ambiguity): A CFG G is ambiguous if there is a string x L(G) for which there are two parse trees yielding x. Unfortunately, it is an unsolvable problem to decide whether an arbitrary CFG is ambiguous or not. Also, there is no algorithm available that given an ambiguous CFG, converts it to an unambiguous grammar. However, for a certain restricted construct, it is possible to solve the problems. In this section we will present several techniques with some examples.

294

Ambiguity

11.3 Eliminating Ambiguity of a CFG (1) Binding with parenthesis. Example: We know that the CFG G1 below is ambiguous because there are two parse trees yielding the same string pqr. The ambiguity occurs because it can generate the same string by applying S  SS followed by S  SS, or vice versa, as shown, respectively, in figure (a) and figure (b). S S G1: S  SS | SS | S | A Ap|q|r

A

S S



S 

S

S S

S  S

A

A

A

A

A

r

q

r

p

q

p

(a)

(b)

295

Eliminating Ambiguity

Ambiguity

This ambiguity can be eliminated by parenthesizing the right side of those two rules as shown in G2 below. The parentheses make the yields of the two parse trees different as shown in figures (a) and (b). Ambiguous G1: S  SS | SS | S | A Unambiguous G2: S  (SS) | (SS) | S | A S

(

S A

)

(

S



A p| q| r Ap |q | r

S S

)

A

A

S  A S  S ) r A A

q

r

p

( S S

p

(a): (p  (q  r))

)

(

q

(b): ((p  q )  r)

296

Ambiguity

Eliminating Ambiguity The parenthesizing technique is simple, but has a serious drawback, because we are altering the language by adding new terminal symbols, i.e., the parentheses. However, this is a popular technique used in programming languages. Instead of the parentheses they use other notations, for example, in Pascal “begin” and “end,” and in C and C++, the braces „{‟ and „}‟. (2) Fixing the order of applying rules. Example 1. The language generated by CFG G3 below is {bicbj | i, j  0}. This grammar is ambiguous because, for example, the string bcb can be derived by generating the left side b first then the right side b, or vice versa, as shown below.

G3 : S  bS | Sb | c

b

S

S

S

S

S c

b

b

b

S c 297

Ambiguity

Eliminating Ambiguity We can simply modify the grammar G3 to G4 as shown below such that left side b‟s, if any, are always generated first. Figure (b) is the only parse tree for string bcb. Grammar G4 is unambiguous. Ambiguous

G3 : S  bS | Sb | c

Unambiguous G4 : S  bS | A

A  Ab | c S

b

S

S

S

S

S

b

b

c (a)

b

b

S A

S

A

c

c

b

(b) 298

Ambiguity

Eliminating Ambiguity Example 2. Using the technique of fixing the order of derivation, the ambiguous CFG G1 that we have examined can be converted to an unambiguous grammar G5 shown in figure (a). Notice that this grammar generates the operators left to right in the order they appear in the string. S

Ambiguous: G1: S  SS | SS | S | A Ap|q|r Unambiguous:

G5 : S  AS | AS | S | A Ap|q|r (a)



S A 

S A  S

p q

A  S

q (b): pqqp

A p

299

Ambiguity

Eliminating Ambiguity (3) Eliminating redundant rules

The CFG G6 below is ambiguous because it can generate ab either by B or D. We can simply delete one of the two and make the grammar unambiguous (see G7). Ambiguous G6 : S  BD Unambiguous G7 : S  BD

B  abb B  abb

D  abd Dd

The CFG G8 below is ambiguous because it can generate  in two ways. Applying the technique for minimizing the number of -production rules, we can convert it to an unambiguous grammar G9. Ambiguous G8 : S  BD B  bBc Unambiguous G9 : S  BD B  bBcbc D  dDede

D  dDe

300

Ambiguity

Eliminating Ambiguity (4) Implementing operator precedence and associativity Operator precedence and associativity are important rules for evaluating mathematical expressions. In programming languages, the rules are defined by the grammar. As we know, multiplication (*) and division (/) are given higher precedence than addition (+) and subtraction (-). The assignment operator (=) is given the lowest. Operator = is right associative, and all the others are left associative. According to this order of precedence and associativity, the mathematical expression in figure (a) will be evaluated as shown in figure (b).

a=b+c*d–e/f (a)

a = ((b + (c * d)) – (e / f)) (5) (3) (1) (4) (2) (b)

301

Ambiguity

Eliminating Ambiguity Example. Assume that the logical operators , , and  are right associative with precedence given in that order (i.e.,  is at the top followed by , and  at the bottom). The ambiguous CFG G1 (repeated below) can be modified into an unambiguous G10 by implementing the precedence and associativity into the production rules. Notice that every OR () operator in the string should be generated by S  DS before generating others. Then AND () operators, if any, are generated by D  CD, and finally NOT () operators. Also notice that with D, there is no way to derive , and with C neither  nor  can be derived. Ambiguous G1: S  SSSSSA A  pqr Unambiguous G10 : S  DSD C  CA

(a)

D  CDC A  pqr

S D  S  S C D D ~ C C C  D C A A A r A p q p (b)

302

Ambiguity

Eliminating Ambiguity

This fixed order of rule applications facilitates the compiler generating an object code to evaluate the string (i.e., a logical expression) according to the operator precedence and associativity. For example, because NOT () operators are generated last, they appear on a subtree of the parent node of a leaf node labeled with  or . Hence, the compiler, traversing the tree bottom up, left to right, generates the instructions which will execute the NOT () operators before the instructions executing other operators. Similarly, we can see how the compiler generates the instructions to execute AND () operators before OR () operators. S S

D  G10 : S  DSD C  CA p  q  r  p

D  CDC A  pqr

 ((p)  (q  (r  p))) (1) (4)

(a)

(3) (2)

C ~

D



S D

C

C

C  D

A

A

A

C

p

q

r

A

(b)

p 303

Eliminating Ambiguity

Ambiguity

Now, for the associativity, notice that identical operators are generated left to right in the order they appear in the string. For example, all OR () operators are derived by applying S  DS recursively. Thus, the compiler, traversing the parse tree bottom up, left to right, will generate an object code to evaluate the logic expression following the order of right associativity. If we want to implement left associativity for an operator used in the language, we can simply reverse the right side of the rule which derives it. For example, we use S  SD instead to make operator  left associative. S D  S  S C D D G10 : S  DSD D  CDC C  CA A  pqr ~ C C C  D C A A A p  q  r  p  ((p)  (q  (r  p))) (1) (4) (3) (2) r A p q

(a)

(b)

p 304

Ambiguity Rumination (1): ambiguity We learned that for a simple CFG, it is possible to analyze it to decide whether it is ambiguous or not, and if it is ambiguous, convert it to an unambiguous one. However, as we mentioned before, most of the problems concerning ambiguity of CFG‟s in general are very difficult or even unsolvable. Here are some interesting facts. • There are some languages that can only be generated by ambiguous context-free grammars. Such languages are call inherently ambiguous. The following language L is an example. L = {anbncm | m, n  1 }  {anbmcm | m, n  1}

• It is an unsolvable problem to tell whether an arbitrary CFG is ambiguous or not. • It is an unsolvable problem to convert an arbitrary ambiguous CFG, whose language is not inherently ambiguous, to an unambiguous CFG.

Today‟s Quote

Break Time

Be slow in choosing a friend, slower in changing. - Benjamin Franklin -

305

Exercises

Ambiguity

11.1 Show that each of the following CFG‟s is ambiguous, and convert it to an unambiguous CFG. (a) S  Sa | A | a

A  a | bA

(b) S  AB | 

A a|

B b|

11.2 (a) Show that the following CFG is ambiguous. G: S  S + S | S * S | T

T a|b

(b) In the CFG G above, let + and * be the addition and multiplication operators, respectively, and a and b be integer variables. Convert G to a CFG G‟ that satisfies the following conditions. (i) L(G) = L(G‟) (ii) Operator * has higher precedence than operator +. (iii) Both operators are left associative. You should also present a parse tree showing that your grammar meets the required order of operator precedence and associativity.

306

Ambiguity

Exercises

The following problem is concerned with the ambiguity of the if-statement, which appears often in the text. Question (a) is easy, but question (b) is challenging. 11.3 The syntax flow graph below defines a simplified of the Pascal programming language. Following the convention, symbols in a circle or oval are terminals, and the words in rectangles correspond to nonterminal symbols. (a) Transform the flow graph to a CFG G and show that G is ambiguous. (b) Convert grammar G to an unambiguous CFG and explain in detail how you got your answer.

IF

<statement>

(

)

then

<statement>

else

<statement>

a-b

c d

a+b

307

Hierarchy of the Models

308

12. Hierarchy of the Models: Proper Containment In Chapter 7 we saw the Chomsky hierarchy among languages, automata, and other models and proved the horizontal relations (i.e., the characterization) at the lowest level of the hierarchy. Figures (a) and (b), respectively, are the copies of the relations among the classes of languages and automata presented in Chapter 7 (see next slide). Recall that if a language is a member of a class of languages at a lower level, then it is also a member of an upper class. If a language is recognizable by an automaton at a lower level, then it is also recognizable by an automaton at an upper level. But the reverse of these relations does not hold. An upper lever language class has a member that does not belong to a lower level class, and there is an automaton belonging to an upper level whose language cannot be recognized by any automaton belonging to a lower level of the hierarchy.

TM Type-3L

Type-0L Type-1L

Type-2L

(a) Containment relations of the language classes

LBA PDA FA

(b) Containment relations of automata capability 309

Hierarchy: Proper Containment Review

The Chomsky Hierarchy

Languages (grammars) Recursively Enumerable Sets (type 0)

Context-sensitive Languages(type 1)

Machines Turing Machines (TM)

Linear-bounded Automata (LBA)

Context-free Languages(type 2)

Pushdown Automata (PDA)

Regular Languages(type3)

Finite State Automata (FA)

Other Models Post System, Markov Algorithms, -recursive Functions

. . . . .  : Containment  : Characterization

Regular Expression 310

Hierarchy: Proper Containment To prove the proper containment relations of the hierarchy it requires in-depth logical arguments that we have developed in Chapter 1. In this chapter we will prove the containment relations up to the class of context-sensitive languages. In Chapter 15 we will complete the proof of the hierarchy by proving the characterizations of context-free, context-sensitive and phrase structured languages, together with the proof of the last proper containment relation between type 0 and type 1. The logics involved in these proofs are so elegant that it worth a challenge.

12.1 Relationship of the language classes: proper containment 312 12.2 The pumping lemma 316 The pumping lemma and proof 12.3 Applications of the pumping lemma 323 Examples 12.4 The pumping lemma for context-free languages 328 12.5 Application of the pumping lemma for context-free languages 331 12.6 Ogden’s lemma 336 The lemma and an application 12.7 A proof of the pumping lemma for context-free languages 339 Rumination 346 Exercises 350

311

Hierarchy: Proper Containment

12.1 The containment relations Before the proof of the proper containment relations () between two levels of the hierarchy, we will prove the containment relations () by the following theorem.

Theorem 12.1 Let Type-iL denote type i (i = 0, 1, 2, 3) language class. Then for all i < 3, the following relation holds. Type-iL  Type-(i+1)L Proof. Let  be a rule of a grammar G = (VT, VN, P, S). We know that by definition, type 3 (regular) grammar rules should have the form (i.e., || = 1) of type 2 (context-free) grammar rules with additional restriction that either  = xB or  = x , where x  (VT)* and B  VN. It follows that every type 3 grammar is a type 2 grammar, and hence, all type 3 languages belong to the class of type 2 languages, i.e., Type-2L  Type-3L.

312

Containment Relations

Hierarchy: Proper Containment

Proof (continued). By definition, type 1 grammar (CSG) rules are type 0 grammar rules with the additional restriction that the left side of each rule cannot be longer than the right side. It follows that every type 1 (context-sensitive) grammar is also a type 0 (phrase structured) grammar. That is, Type-0L  Type-1L Now, we prove that type 1 language class contains type 2 (context-free) language class. Recall that type 1 grammar rules are non-contracting. In other words, type 1 grammar rules are type 0 grammar rules with the restriction that for any rule , the left side cannot be longer than the right side, i.e., ||  ||. Type 2 grammar rules are type 0 grammar rules having one nonterminal symbol on their left side, i.e., || = 1.

Both type 1 and type 2 grammars are defined by adding a restriction on type 0 grammar rules. So for the proof of the containment relation between type 1 and type 2 language classes, we cannot apply the simple logic that we used for the other relations between other language classes.

313

Containment Relations

Hierarchy: Proper Containment

Clearly, since every CFG rule has exactly one nonterminal symbol on its left side, as far as the left side of grammar rules are concerned, CFG rules belong to CSG rules. The only problem is the -production rules allowed in CFG‟s, which violate the requirement that the right side of CSG rules cannot be shorter than the left side, except for S  . All the other CFG rules, which do not produce , are CSG rules. Fortunately, we have a way to overcome this problem. Recall Theorem 10.1 repeated below. Theorem 10.1 Given a CFG G, we can construct a CFG G' that satisfies the following conditions. • L(G) = L(G').

• If   L(G), G' has no -production rule. Otherwise, S   is the only -production rule in G'. Grammar G' is a CSG because every rule in G' is non-contracting except for the rule S  , if any. This implies that every CFL can be generated by a CSG. It follows that Type-1L  Type-2L. 314

Containment Relations

Hierarchy: Proper Containment

In summary, we showed the following relations. Type-0L  Type-1L  Type-2L  Type-3L Now, we will show that all these containment relations are proper, i.e., Type-0L  Type-1L  Type-2L  Type-3L

For the proof it is enough to show that for each i = 0, 1, 2, there is a language in class Type-iL that does not belong to class Type-(i+1)L. In this chapter we will first show that the familiar type 2 language {aibi | i  0} does not belong to Type-3L, i.e., the class of regular languages. Then we will show that the language {aibici | i  0 }, which was presented as a typical type 1 language in Chapter 2, does not belong to Type-2L. These two results prove the following relations. Type-1L  Type-2L  Type-3L We will defer the proof of the last part Type-0L  Type-1L until Chapter 15.

315

12.2 The Pumping Lemma

Hierarchy: Proper Containment

To prove the relation Type-2L  Type-3L, it is enough to find a context-free language that is not regular. Our approach is to first find a common property of all regular languages, and then show a context-free language that does not satisfy this property. Since every finite language is regular (we leave the proof of this claim for the reader), the property we are looking for must be found among infinite regular languages. The following lemma presents such property. Pumping Lemma: For every infinite regular language L, there is a constant integer n that satisfies the following conditions: Let  be the alphabet of the language. Then for every string z  L whose length is greater than or equal to n, there are u, v, w  * such that • z = uvw, |uv|  n, |v|  1, and • for all i  0, uviw  L

316

Pumping Lemma

Hierarchy: Proper Containment

Proof: For the proof we need the following two terminologies; a transition path is a directed path on a state transition graph and a path label is the string constructed by picking the input symbols along a transition path. For example, in the state transition graph shown below, aabba is the path label on the state transition path 1  2  4  3  2  4. Given the string aabba as an input, the DFA will take a sequence of transitions along the transition path 1  2  4  3  2  4.

To help the reader understand the argument, we will use the following state transition graph throughout the proof, together with other figures when needed. 2

a

a start

11

a

b 33

4

b

b

317

Pumping Lemma

Hierarchy: Proper Containment

Given an infinite regular language L, let M be a DFA that recognizes L and let n be the number of states of this automaton. For an input string z  L, whose length is greater than or equal to n (i.e., |z|  n), examine the state transition path with path label z. Notice that because M is deterministic, there is only one transition path with path label z. Since L is infinite and the number of states n is finite, there must be a string z in L such that |z|  n. Otherwise, L cannot be infinite.

Let m be the length of z (i.e., |z| = m  n ) and let z = a1a2 …am. 2

a a start

11

a b

33

4

b

b

318

Pumping Lemma

Hierarchy: Proper Containment

Since z  L, the transition path with label z must end in an accepting state.

Because |z| ≥ n, the path involves a cycle and hence, there should be a state, say q, visited a second time by reading some j-th symbol aj (j  n) in z as illustrated in figure (b) below. (This can be proved by the pigeonhole principle. Refer the application example of the pigeonhole principle in Chapter 1.) |z| = m  n 2

a a start

11

a b

33

4

a1

b

start

a2

. ..

ai

aj+1 . q

ai+1

aj

.

. a m

b M (a) Transition path with path label aabba

(b) Transition path with path label z = a1a2 …am, m  n 319

Pumping Lemma

Hierarchy: Proper Containment

As shown in figure (b), let v be the path label of the cyclic transition path, and let u and w be, respectively, the prefix and the suffix of z partitioned by v. (If q is the start state, then u = , and if q is the last state in the path, w = .) Since the cycle involves at least one edge, |v|  1. Since j  n, we have |uv|  n. z = a1a2 …ai ai+1 . . . . aj aj+1 . . . .am v

u

u = a, v = abb, w = a 2

a a

start

11

|uv|  n

|v|  1

a b

33

|z| = m  n

w

4 a1

b

b (a) Transition path with path label aabba

start

a2

. ..

ai ai+1

aj+1 . q

.

aj

. a m w

u M

v

(b) Transition path with path label z = a1a2 …am, m  n 320

Pumping Lemma

Hierarchy: Proper Containment

Sine uvw is in L, M accepts uvw. It follows that all the following strings corresponding to the path labels of the transition paths with the cycle repeated zero or more times are also accepted by M. uw = uv0w, uvvw = uv2w, uvvvw = uv3w, . . . . That is, for all i  0, uviw  L. (Refer to figure (a) for an example.) z = a1a2 …ai ai+1 . . . . aj aj+1 . . . .am

v

u

u = a, v = abb, w = a

|z| = m  n

For all i  0, a(abb)ia  L

w

|uv|  n

|v|  1

For all i  0, uviw  L 2

a a start

11

b 33

b

a a1

4 start

b

(a) Transition path with path label aabba

a2

. ..

ai ai+1

aj+1 . q

.

aj

. a m w

u

M

v

(b) Transition path with path label z = a1a2 …am, m  n 321

Pumping Lemma

Hierarchy: Proper Containment

Summarizing the observations, we get the pumping lemma repeated here. Pumping Lemma: For every infinite regular language L, there is a constant integer n which satisfies the following conditions: Let  be the alphabet of the language. Then for every string z  L whose length is greater than or equal to n, there are u, v, w  * such that • z = uvw, |uv|  n, |v|  1, and • for all i  0, uviw  L

Job Applicant Bloopers

Break Time

Interviewer: “Do you think you can handle a variety of tasks?” Applicant: “I should say so. I‟ve had nine totally different jobs in the past five months.” - Jim -

322

Hierarchy: Proper Containment

12.3 Application of the Pumping Lemma Now, as an application of the pumping lemma we will show that there is a CFL which is not regular. Specifically, using the lemma, we will prove that the language {aibi | i  1} does not satisfy the pumping lemma to show that Type-2L  Type-3L, the proper containment relation between the classes of type 2 and type 3 in the Chomsky hierarchy. For the convenience of the application, the lemma is rewritten below to expose its logical structure. (Notice the quantified parts highlighted.) (1) For every infinite regular language L, //  infinite regular language L, (2) there exists a constant n such that //  a constant n such that . . . (3) for every string z  L, such that | z |  n, //  string z  L, | z |  n (4) there exist strings u, v, w  * such that //  u, v, w that satisfy . . . (i) z = uvw, (ii) |uv|  n, (iii) |v|  1, (iv) for all i  0, uviw  L. //  i  0, uviw  L.

323

Application of the Pumping Lemma

Hierarchy: Proper Containment

Before using the pumping lemma for the proof, recall the techniques (in Section 1.3) for dealing with existential quantification () and universal quantification () in a statement to prove. In this example, we are going to prove that for the given contextfree language, the pumping lemma is not true. Hence, for the existentially quantified parts of the lemma, i.e., the constant n and strings u, v, and w, we should consider all possible cases. For the universally quantified parts, i.e., regular language L, string z  L, and integer i  0, it is enough to consider a single case, which can be chosen for our convenience of the proof. The argument of the proof will run as a game between the proponent of the lemma and us who are going to show that the lemma does not hold for the given language.

(1) For every infinite regular language L, //  infinite regular language L, (2) there exists a constant n such that //  a constant n such that . . . (3) for every string z  L, such that | z |  n, //  string z  L, | z |  n (4) there exist strings u, v, w  * such that //  u, v, w that satisfy . . . (i) z = uvw, (ii) |uv|  n, (iii) |v|  1, (iv) for all i  0, uviw  L. //  i  0, uviw  L. 324

Application of the Pumping Lemma

Hierarchy: Proper Containment

Now, we show that our familiar CFL L = {aibi | i  1} does not satisfy the pumping lemma, and hence, it is not regular. Suppose that L is a regular language. (By this sentence the reader will promptly recognize that we are going to use the proof-bycontradiction technique.) We will carefully analyze the pumping lemma to find a contradicting part against our supposition. (The numbers in the following arguments match the ones that we put in the lemma.) Pumping lemma says (1) For every infinite regular language L, (2) there exists a constant n s.t. (3) . . . . . (4) . . . .

Our argument L is infinite regular language. Hence, there should be a constant n such that conditions (3) and (4) are satisfied. Let n be just that constant. Using n as a variable, we are going to consider all possible constant values of n in our argument.

325

Application of the Pumping Lemma Pumping lemma says (3) For all strings z  L such that |z|n, (4) there exist strings u, v and w that satisfy the following . (i) z = uvw, (ii) |uv|  n, (iii) |v|  1, (iv) for all i  0, uviw  L. n

n

aa . . . . . . . .a bb . . . . . . . . .b u v w |uv|  n

Hierarchy: Proper Containment Our argument We choose string z = anbn in L, and examine whether for all strings u, v, w satisfying conditions (i) – (iii), the condition (iv) is satisfied. Let z = anbn = uvw according to (i). Then by (ii) and (iii), v contains only (and at least one) a‟s. (See the figure at the bottom left.) It follows that string uv2w contains more a‟s than b‟s. That is, for i = 2, we claim that uv2w  L. The language L does not satisfy the pumping lemma. This contradicts the assumption that L is regular. Therefore, L is not regular.

326

Application of the Pumping Lemma

Hierarchy: Proper Containment

We know that L = {aibi | i  1 } is a CFL. Since we have just shown that L is not regular, the containment relation Type-2L  Type-3L that we have established in the previous section can be refined to the proper containment relation, i.e., Type-2L  Type-3L.

Secrets of Success

Break Time

The other day I had the opportunity to drop by my department head's office. He's a friendly guy and on the rare opportunities that I have to pay him a visit, we have had enjoyable conversations. While I was in his office yesterday I asked him, "Sir, What is the secret of your success?" He said, "Two words." "And, Sir, what are they?" "Right decisions." "But how do you make right decisions?" "One word." He responded. "And, sir, What is that?" "Experience." "And how do you get Experience?" "Two words." "And, Sir, what are they?" "Wrong decisions." - Rubin -

327

Hierarchy: Proper Containment

12.4 The pumping lemma for context-free languages Now, we prove the proper containment relation between the next upper two levels of the language classes, i.e., Type-1L  Type-2L. We use the same approach that we took for proving the containment relation Type-2L  Type-3L. With no proof we first present a common property of infinite context-free languages, called the pumping lemma for context-free languages (see below). Then, as an application of the lemma, we will show that the context-sensitive language {aibici | i ≥ 1 }, which we introduced in Section 2.3, does not satisfy the lemma. We shall present the proof of the lemma after showing the application. The pumping lemma for context-free languages. For every infinite CFL L, there is a constant integer p that satisfies the following conditions: Let  be the alphabet of the language. Then for every string z  L whose length is greater than or equal to p, there are u, v, w  * such that • z = uvwxy, |vwx|  p, |vx|  1 • for all i  0, uviwxiy  L

328

Pumping Lemma for CFL’s

Hierarchy: Proper Containment

The pumping lemma for CFL‟s looks very similar to the one for regular languages. The figure below shows the difference. Both lemmas claim the existence of a constant (i.e, n for regular languages and p for CFL‟s). The lemma for regular languages divides string z into three segments (i.e., z = uvw), while the lemma for CFL‟s divides it into five segments (i.e., z = uvwxy). The lemma for regular languages provides one site, named v, to pump (see the up arrow), while the lemma for CFL‟s provides two sites v and x to pump simultaneously. The lemma for CFL‟s

The lemma for regular language • z = uvw, |uv|  n, |v|  1 • for all i  0, uviw  L

• z = uvwxy, |vwx|  p, |vx|  1 • for all i  0, uviwxiy  L

u

v

w

x

y

u

v

w

z vwx|  p

|uv|  n 329

Pumping Lemma for CFL’s

Hierarchy: Proper Containment

Recall that when we applied the pumping lemma for regular languages, we took advantage of the range of the pump site v that is restricted within the prefix of length n of string z, given by the condition |uv|  n. Unfortunately, for the pumping lemma for CFL‟s the segment vwx, which contains the two pump sites, can be flanked by u to the left and by y to the right. So the pump sites can be located anywhere in z in the window of width |vwx|  p. When we apply the lemma to prove a language is not context-free, we must consider all possible locations of the window vwx, from the left end of z up to the right end. The lemma for CFL‟s

The lemma for regular language • z = uvw, |uv|  n, |v|  1 • for all i  0, uviw  L

• z = uvwxy, |vwx|  p, |vx|  1 • for all i  0, uviwxiy  L

u

v

w

x

y

u

v

w

z vwx|  p

|uv|  n 330

Hierarchy: Proper Containment

12.5 Application of the pumping lemma for CFL’s Again, for the clarity of the argument, we rewrite the lemma as follows. (1) For every infinite CFL L, //  infinite CFL L, (2) there exists a constant p such that //  a constant p such that . . . (3) for every string z  L, such that | z |  p, //  string z  L, | z |  p (4) there exist strings u, v, w, x and y that //  u, v, w, x, y that . . . satisfy the following conditions. (i) z = uvwxy, (ii) |vwx|  p, (iii) |vx|  1, (iv) for all i  0, uviwxiy  L. //  i  0, uviwxiy  L.

331

Application of CFL Pumping Lemma

Hierarchy: Proper Containment

Now, we show that the CSL L = {aibici | i  1}, that we have introduced in Section 2.3, does not satisfy the pumping lemma for CFL‟s. Our argument is similar to the one for proving that language {aibi | i  1} is not regular. Suppose L is a CFL. It should satisfy the lemma. With L we walk along the lemma and come out with a contradiction. Pumping Lemma’s claim (1) For every infinite CFL L, (2) there exists a constant p s.t. (3) . . . . . (4) . . . .

Our claim L is a infinite CFL. Hence, there should be a constant p such that conditions (3) and (4) are satisfied. Let p be the constant. Using p as a variable, we are going to consider all possible constant values of p in our argument.

332

Application of CFL Pumping Lemma

Hierarchy: Proper Containment

Pumping Lemma’ claim

Our claim

(3) For all strings z  L such that |z|p, (4) there exist strings u, v, w, x, y that satisfy the following. (i) z = uvw, (ii) |uv|  n, (iii) |v|  1, (iv) for all i  0, uviw  L. p

We choose string z = apbpcp in L and examine whether, for all strings u, v, w satisfying conditions (i) – (iii), it satisfies condition (iv). By (i) let Let z = apbpbp = uvwxy, and examine what kinds of symbols can be in the two pump sites v and x. Recall that vwx, whose length  p, can be anywhere in z. p

p

aa . . . . . . . . .abb . . . . . . . .bcc. . . . . . . . . c vwx y u

333

Application of CFL Pumping Lemma

Hierarchy: Proper Containment

As we can see in the figure below, by condition (ii) |vwx|  p, the pump sites v and x together can contain no more than two different symbols (a‟s and b‟s or b‟s and c‟s). By condition (iii) |vx|  1, strings v and x together should contain at least one symbol. It follows that in uv2wx2y the numbers of a‟s, b‟s, and c‟s cannot be the same, which implies that uv2wx2y  L. The language does not satisfy condition (iv) of the lemma when i = 2. Hence, L is not context-free.

(i) z = uvwxy, (ii) |vwx|  p, (iii) |vx|  1, (iv) for all i  0, uviwxiy  L.

p

p

p

aa . . . . . . . . .abb . . . . . . . .bcc. . . . . . . . . c vwx y u

|vwx|  p

334

Application of CFL Pumping Lemma

Hierarchy: Proper Containment

For the language {aibici | i  1 }, it was fairly straightforward to apply the lemma and show that the language is not context-free. However, this lemma is so weak that there are many non-context-free languages for which it is either hard or impossible to apply it. For example, consider the following language L, which is not context-free. Whichever string z (of length greater than or equal to p) we pick from this language, it is impossible to pump and come to a contradiction. L = {xaix | i > 0, x  {0,1}+}  {x | x  {0,1}+} p p

p p

For example, suppose that we picked string z = 1 0 aaa1 0 to see if z = uvwxy can be pumped to come up with a contradiction. For the case of v = a, w = a, and x = a, it p p p p satisfies that uviwxiy  L, for all i  0. Even when there is one a (i.e., z = 1 0 a1 0 ), if we set v = a, and w = x =  , we get uviwxiy  L, for all i  0. Notice that picking a binary string for z from the second part of the language doesn‟t work either. Ogden has introduced a lemma to strengthen this weakness.

335

12.6 Ogden’s Lemma

Hierarchy: Proper Containment

Ogden‟s lemma, shown below, is very similar to the original lemma except for the additional (highlighted) parts restricting the locations of the pump sites. (1) For every infinite CFL L, (2) there exists a constant p such that (3) for every string z  L such that |z|  p, if we mark p symbols in z (4) there exist strings u, v, w, x and y that satisfy the following conditions. (i) z = uvwxy, (ii) |vwx|  p, (iii) |vx|  1, and string vx contains at least one marked symbol, (iv) for all i  0, uviwxiy  L. We omit the proof of this lemma. Interested readers are referred to the book “Introduction to Formal Languages, Automata Theory, Languages and Computation” by Hopcroft, Motwani, and Ullman (2001. Addison Wedley). 336

Ogden’s Lemma

Hierarchy: Proper Containment

Now, we can apply Ogden‟s lemma and show that the following language is not CFL, which we failed to prove with the original lemma. L = {xaix | i > 0, x  {0,1}+}  {x | x  {0,1}+} Suppose L is a CFL. Then it should satisfy Ogden‟s lemma. Let p be the constant. p p p p p Pick string z = 1 0 a1 0 , let z = uvwxy, and mark the prefix 1 . According to the lemma, |vwx|  p, and string vx should contain at least one marked symbol. It follows that string vx can never include the symbol a in the middle. We see that string z' = uv2wx2y  L, because in z' the binary string on the left side of a must be different from the string on the right side of a. We have a contradiction. It follows that L is not context-free. By the two application examples of the pumping lemmas for regular languages and context-free languages, we have proven the relations of proper containment between the lower two levels of Chomsky hierarchy (repeated below). We will complete the remaining proofs of other relations in Chapter 15.

337

Review

Chomsky Hierarchy

Languages (grammars) Recursively Enumerable Sets (type 0)

Context-sensitive Languages(type 1)

Machines Turing Machines (TM)

Linear-bounded Automata (LBA)

Context-free Languages(type 2)

Pushdown Automata (PDA)

Regular Languages(type3)

Finite State Automata (FA)

Other Models Post System, Markov Algorithms, -recursive Functions

. . .  : Containment .  : Characterization . * Colored arrows indicate proven relations. Regular Expression 338

Hierarchy: Proper Containment

12.7 Proof of the pumping lemma for CFL’s In this section, we will present a proof of the lemma with a CFG to support the argument with some illustrations. (Detailed formal proof is omitted.) Consider the CFG below, which is in Chomsky normal form (CNF) and has no -production rule. (Recall that every CFL can be generated by a CFG in CNF with no -production rules, except for S   in the case when the language contains .) S  AB C  DB | c

A  DE F  HC

B  FD Dd

E  FC Hh

Let‟s examine the parse tree of the grammar yielding the following string. d(hd)3hc(d)3dchdchd

339

Proof of the CFL Pumping Lemma

Hierarchy: Proper Containment S

A Notice that on the longest trunk, D E there appears a repeating F subsequence of nonteminals, d C H C S-A-E-F-C-B-F-C-B-F-C-B-F-C B h D D H B F (see the highlighted part). Also d D notice that in the yield of this tree F D d H h C 3 3 d(hd) hc(d) dchdchd, d H C d the highlighted substrings that are h D B c h produced by the repeating d F D nonterminals also repeat. d G: H C Since every nonterminal has h S AB D B two children, it can produce C DB | c d D “fruits” (i.e., a terminal string) F F HC either on its left side branches or d H C D d on its right side branches. In the h c example, F, C and B, respectively, yield: d(hd)3hc(d)3dchdchd produce h, d and d.

B F

D C

d

c

A DE B  FD E FC H h

340

Hierarchy: Proper Containment

Proof of the CFL Pumping Lemma

S

On the longest trunk, F-C-B repeats. This repeating segment can be inserted (or deleted), proportionally increasing (or decreasing) the “fruits” on both sides of the trunk. Notice that each repeating segment F-C-B bears hb to the left and d to the right.

For CFG G, it satisfies the condition that for all i ≥ 0, z = d(hd)ihc(d)idhcdhcd is in the language L(G).

A D

d

F H

h

B C

C

D

d H

C

h D d

B

D

F D

H C

d

F H

d h

d

h c

D C

D

H

C

h

c

d

d

c

G:

S C d D F F d H C D D B

F B

D

B F

h

E

AB DB | c HC d

A DE B  FD E FC H h

z = d ( h d )3 h c (d)3 d h c d h c d u v w x y 341

Proof of the CFL Pumping Lemma

Hierarchy: Proper Containment

We can make the trunk of the suffix tree grow by inserting more repeating subsequences of F-C-B, or cut it short by deleting them. Since the number of repeats can be arbitrary, there can be infinite number of parse trees. Every repeating trunk has a repeating pattern of branches to its left or right and every branch should bear “fruit” (i.e., a terminal symbol). (Recall that there is no -production rule except S → .) Hence, the parse tree can yield a repeating substring either to the left (bh in the example) or to the right (b in the example) of the trunk. For CFG G , we can claim that for all i  0, string d(hd)ihc(d)idchdchd is a yield of a parse tree of G, i.e., d(hd)ihc(d)idchdchd  L(G). We can generalize the above observation as follows. Let G be a CFG in Chomsky normal form whose language is infinite. Consider a string z in L(G) whose length is long enough (i.e., some constant which is proportional to the size of the nonterminal alphabet) such that the parse tree yielding z has a trunk where a nonterminal symbol A repeats.

342

Hierarchy: Proper Containment

Proof of the CFL Pumping Lemma

Let AA be a pattern appearing on a trunk of a parse tree, where A is a repeating nonterminal symbol, and α and  are strings of nonterminals. In general this parse tree will have the structure shown below. Let v and x be the yields, respectively, from the left branches and the right branches of A as illustrated in the figure below. Let w be the yield generated by the branches from the “tip” A, and let u and y be, respectively, the yields from the left subtrees and the right subtrees, if any, of the trunk. S A u

α

y

v A x 

w 343

Proof of the CFL Pumping Lemma

Hierarchy: Proper Containment

We see that string uvwxy  L(G), and for all i  0, uviwxiy  L(G), because if AA is a legal pattern on a trunk, then for all i  0, (A)iA is also a legal pattern. Let VN be the nonterminal alphabet of the grammar. The length of string z for which every parse tree has a trunk with a repeating pattern is proportional to | VN|, i.e., for some constant p, |z|  p = c|VN|. It follows that |vwx|  p and |vx|  1. S S

A A

u

α

y

u v

v A x  w

y

α A

x

α v

A

x

 w

344

Proof of the CFL Pumping Lemma

Hierarchy: Proper Containment

Summarizing the above observations, we have the following pumping lemma for CFL‟s. The pumping lemma for context-free languages. For every infinite CFL L, there is a constant integer p that satisfies the following conditions: Let  be the alphabet of the language. Then for every string z  L whose length is greater than or equal to p, there are u, v, w  * such that • z = uvwxy, |vwx|  p, |vx|  1 • for all i  0, uviwxiy  L

345

Hierarchy: Proper Containment

Rumination (1): Application of the Pumping Lemma

•

To prove that {aibi | i  1} is not regular, we showed that for this language the pumping lemma does not hold. Let‟s review

how we dealt with the parts of the lemma quantified existentially (i.e., prefixed with “there is,” “there exist” or “there are”) or universally (i.e., prefixed with “for all” or “for every”). To show that the lemma does not hold for the language, we must consider all possible cases for each part existentially quantified. To contrary, for each part universally quantified, it is enough for us to pick just one case (for our convenience) that will lead to a contradiction. The lemma uses an existential quantification in the following two parts; (2) There is a constant n … (4) There are strings u, v and w . . . So, in the proof, by using n as a variable, we took into consideration of all possible constant values of n, and examined all possible strings u, v, and w that satisfy the conditions given in (i) – (iii) of the lemma. In the following three parts, the lemma uses a universal quantification; (1) For every infinite regular language, (3) For every string z  L, (4)-(iv) For all i  0, For part (1) we picked language {aibi | i  1} that we supposed to be a regular language, for part (3) we picked z = anbn, and for part (4)-(iv) we picked i =2. (If we wanted, we could have chosen others than z = anbn and i = 2.)

346

Hierarchy

Rumination (1): Application of the Pumping Lemma

• Here is an application example of the pumping lemma with a logical bug that we often come across while grading homework assignments. Question: Is the language L = {abcaiabc | i  0} regular? Justify your answer. Answer: L is not regular. Suppose L is regular. Then L should satisfy the pumping lemma. Let n be the constant of the lemma, and choose string z = abcanabc, which is in L. Since |z|  n, string z satisfies condition (2) of the lemma. Let z = uvw, and consider the case where u = a, v = b, and w = canabc. Then we get uv2w = abbcanabc, which is not in L. It follows that L does not satisfy the pumping lemma. L is not regular language!! We can show that L is regular by showing an FA which recognizes L (see the figure below). What is wrong in the above argument? It is in the argument dealing with the existential quantification for strings u, v, and w in part (4) of the lemma. The argument took u, v, and w only for the specific values of u = a, v = b, and w = canabc, not considering other cases satisfying conditions (i)-(iii). We should show that uviw  L, for all possible locations of the “pump site” v in the prefix of length n of string z and for all possible lengths of v (i.e., 1 |v|  n). Actually, for the case where v contains some a‟s in the substring an, it satisfies that for all i  0, uviw  L. The proof fails.

(4) There exist strings u, v and w that satisfy the following conditions. (i) z = uvw, (ii) |uv|  n, (iii) |v|  1, (iv) for all i  0, uviw  L.

a

b

c

a

b

c

a

347

Rumination (1): Application of the Pumping Lemma

Hierarchy

• The pumping lemma presents a common property of all infinite regular languages. Recall the last claim of the lemma, i.e., “for all i ≥ 0, uviw  L”, which is based on the existence of a cyclic path in the state transition graph of every DFA which recognizes an infinite regular language. Notice that every finite language is regular. Let L = {w1, w2, . . . , wk} be a finite language. Then we can simply construct a regular grammar with rules S → w1 | w2 | . . .| wk. Therefore, every non-regular language is infinite. • We know that every infinite regular language satisfies the pumping lemma. Is the converse of this statement true? In other words, can we say that if an infinite language satisfies the pumping lemma, then the language is regular? We didn‟t prove that every infinite language satisfies the pumping lemma. We proved it only for infinite regular languages. Actually, Appendix E shows a non-regular CFL which satisfies the lemma. Thus, it is wrong to prove the regularity of a language by showing that it satisfies the pumping lemma.

Application of the pumping lemma for CFL

• When we apply the pumping lemma for CFL, we should keep in mind the difference between the lemmas for regular languages and CFL‟s. Recall that for regular languages, the region x, where we can pump, is limited in the prefix of length n of z, while for the lemma for CFL‟s, the two regions v and x, where we can pump, can be anywhere in z within the window of width |vwx|  p. Thus, for CFL‟s it requires more cases to analyze.

348

Hierarchy: Proper Containment Rumination (2): Application of Ogden’s Lemma • Often we hear that programming languages are context-free. However, they are not clean context-free languages, and carry some context-dependency in their syntax. For example, in C it is not allowed to declare a variable more than once in a block, and in FORTRAN there are statements requiring a matching label (e.g., 10 in the following DO-loop).

.... DO 10 I = 1, 100, 1 SUM = SUM + I 10 PRO = PRO * I .... The formal language L below has a similar context-dependency of programming languages. String x corresponds to the label (i.e., a string of decimal digits) and ai corresponds to the loop body. Using Ogden‟s lemma we showed that L is not a context-free language. L = {xaix | i > 0, x  {0, 1}+}  {x | x  {0, 1}+}

349

Exercises

Hierarchy: Proper Containment

12.1 Using the pumping lemma for regular languages, we are going to prove that L = {xxR | x {a, b, c}* }is not regular as follows. Complete the proof by answering each of the following questions. Proof. Suppose that L is regular, and let n be the constant of the lemma. All the following strings are in L. (a) abccba

(b) a100b100b100a100

(c ) an/2bn/2b n/2an/2

(d) anbnbnan

Question 1: Which of the strings above are you going to pick as the string z for your proof? Why are other strings not your choice? If you don‟t like any, choose another string in L and explain why. Question 2: Consider strings u, v, w {a, b, c}* that satisfy the conditions (i) z = uvw, (ii) |uv|  n, and (iii) |v|  1. What will be in v? Briefly explain all possible contents of string v. Question 3: To show that L is not regular, how are you going to use the last part of the lemma: “For all i  0, string uviw  L.?” Write your argument. 12.2 Which of the following languages are regular? Justify your answer.

L1 = {aibjck | i, j, k > 0 }

L2 = {aaaaibjci | i, j > 0 }

12.3 For the symbols a, b and c, and a string x, let #a(x), #b(x) and #c(x), respectively, be the number of a‟s, b‟s and c‟s in string x. Which of the following languages are context-free? Justify your answer. L3 = { x | x {a,b,c}+, and #a(x) > #b(x) } L4 = { x | x {a,b,c}+, and #a(x) = #b(x) = #c(x) } L5 = {x | x {a,b,c}+, and #a(x) < #b(x) < #c(x) } 12.4 We learned that language { w#wR | w  {a, b}* } is context-free. The following languages are not context-free. Prove why. L6 = {w#w | w  {a, b}* }

L7 = {xyx | x, y  {a, b}* }

350

Practical Application: Parsing

351

13. Parsing Parsing is one of the major functions of the compiler of a programming language. Given a source code w, the parser examines w to see whether it can be derived by the grammar of the programming language, and, if it can be, the parser constructs a parse tree yielding w. Based on this parse tree, the compiler generates an object code. So, the parser acts as a membership test algorithm designed for a given grammar G that, given a string w, tells us whether w is in L(G) or not, and, if it is, outputs a parse tree. Notice that the parser tests the membership based on the given grammar. Recall that when we practiced constructing a PDA for a given language, say {aibi | i > 0 }, we used the structural information of the language, such as a‟s come first, then b‟s, and the number of a‟s and b‟s are same. Consider the two CFG‟s G1 and G2 shown below in figure (a), which generate the same language {aibi | i > 0 }. Figure (b) shows a PDA that recognizes this language. For an input string w, this PDA does not give any information about the grammar and how the string w is derived. Hence, we need a different approach to construct a parser based on the grammar, not the language. There are several algorithms available for parsing that, given an arbitrary CFG G and a string x, tell whether x  L(G) or not, and if it is, output how x is derived. (CYK algorithm is a typical example, which is shown in Appendix F.) However, these algorithms are too slow to be practical. (For example, CYK algorithm takes O(n3) time for an input string of length n) Thus, we restrict CFG‟s to a subclass for which we can build a fast practical parser. This chapter presents two parsing strategies applicable to such restricted grammars together with several design examples. Finally, the chapter briefly introduces Lex (the lexical analyzer generator) and YACC (the parser generator).

( a, a/aa ) G1: S  aSb | ab G2: S  aA A  Sb | b (a)

(b, a/ ) (b, a/  )

(a, Z0/aZ0 )

start

(b)

(, Z0/Z0 ) 352

Parsing 13.1 Derivation 354 Leftmost derivation, Rightmost-derivation Derivations and parse trees 13.2 LL(k) parsing strategy 357 13.3 Designing an LL(k) parser 367 Examples Definition of LL(k) grammars 13.4 LR(k) parsing strategy 379 13.5 Designing LR(k) parsers 387 Examples Definition of LR(k) grammars 13.6 Lex and YACC 404 Rumination 409 Exercises 412

Memories

Break Time

Two very elderly ladies were enjoying the sunshine on a park bench in Miami. They had been meeting at the park every sunny day, for over 12 years, chatting, and enjoying each others friendship. One day, the younger of the two ladies, turns to the other and says, “Please don't be angry with me dear, but I am embarrassed, after all these years... What is your name ? I am trying to remember, but I just can't.” The older friend stares at her, looking very distressed, says nothing for 2 full minutes, and finally with tearful eyes, says, “How soon do you have to know ?” - overheard by Rubin -

353

Parsing

13.1 Derivation The parser of a grammar generates a parse tree for a given input string. For convenience, the tree is commonly presented in a sequence of rules applied in one of the following two ways to derive the input string starting with S. 

Leftmost derivation: A rule is applied with the leftmost nonterminal symbol in the current sentential form.  Rightmost derivation: A rule is applied with the rightmost nonterminal symbol in the current sentential form.

Example: G: S  ABC A  aa B  a C  cC | c Leftmost derivation: S  ABC  aaBC  aaaC  aaacC  aaacc Rightmost derivation: S  ABC  ABcC  ABcc  Aacc  aaacc

354

Parsing

Derivation

The derivation sequences, either leftmost or rightmost, are more convenient to deal with than the tree data structure. However, to generate an object code there must be a simple way to translate the derivation sequence into its unique parse tree. The following two observations shows how it can be done. Observation 1: The sequence of rules applied according to the leftmost

derivation corresponds to the order of the nodes visited, when you traverse the parse tree top-down (i.e., breadth first), left-to-right. (See the following example.) 1 S G: S  ABC C  cC | c

A  aa D  bd

B  bD

Leftmost derivation: 1 2 3 S  ABC  aaBC  aabDC 4 aabbdC 5 6   aabbdcC  aabbdcc

2

A

3

C5

B

aa b 4 D c 123456

C6 c

bd

355

Parsing

Derivation

Observation 2: The reverse order of the rules applied according to the

rightmost derivation corresponds to the nodes visited, when you traverse the parse tree bottom-up, left-to-right. (See the following example.) G: S  ABC C  cC | c

A  aa D  bd

B  bD 6A

aa

Rightmost derivation: 1

2

3

S  ABC  ABcC  ABcc 4

5

S1

 AbDcc  Abbdcc

6

 aabbdcc

C2

4B

b

5D c

C3 c

bd 654321

356

Parsing

13.2 LL(k) parsing strategy We know that parsers are different from PDA‟s, because their membership test should be based on the given CFG. Let‟s try to build a conventional DPDA which, with the grammar G stored in the finite control, tests whether the input string x is in L(G), and, if it is, outputs a sequence of rules applied to derive x. We equip the finite control with an output port for the output (see figure (b) below). Our first strategy is to derive the same input string x in the stack. Because any string must be derived starting with the start symbol, we let the machine push S into the stack and enter state q1 for the next move. For convenience, we assign a rule number to each rule as shown in figure (a). (1)

(2)

(3) aaaaaaaaaabbb

G: S  AB | AC A  aaaaaaaaaa (4) (5) B  bB | b

(6) (7) C  cC | c

L(G) = {a10x | x = bi or x = ci, i  1 }

(a)

output port

q1 G

S?

SZ0

(b) 357

LL(k) Parsing

Parsing

Now, we ask which rule, either (1) or (2), the machine should apply with S to eventually derive the string on the input tape. If the input string is derived using rule (1) (rule (2)) first, then there should be the symbol b (respectively, symbol c) after the 10-th a. Unfortunately, our conventional DPDA model cannot look-ahead the input before reading it. Recall that conventional DPDA‟s decide whether they will read the input or not depending on the stack top symbol. Only after reading the input does the machine knows what it is. Thus, without reading up to the 11-th input symbol, there is no way for the machine in the figure to identify the symbol at that position. (1) (2)

G: S  AB | AC A  aaaaaaaaaa (4) (5) B  bB | b

(6) (7) C  cC | c

L(G) = {a10x | x = bi or x = ci, i  1 }

(a)

aaaaaaaaaabbb

(3) output port

q1 G S  ? SZ0 (b) 358

LL(k) Parsing

Parsing

To overcome this problem, we equip the finite state control with a “telescope” with which the machine can look some finite k cells ahead on the input tape. For the grammar G, it is enough to have a telescope with the range of 11 cells. (Notice that for the range to look ahead, we also include the cell under the head.) With this new capability, the machine scans the input string ahead in the range, and, based on what it sees ahead, it takes the next move. While looking ahead, the input head does not move.

(1)

(2)

(3)

G: S  AB | AC A  aaaaaaaaaa (4) (5) B  bB | b

(6) (7) C  cC | c

L(G) = {a10x | x = bi or x = ci, i  1 } (a)

A N I

aaaaaaaaaabbb q1 G

S  AB !

SZ0 (b)

359

Parsing

LL(k) Parsing

Now, the parser, looking ahead 11 cells, sees aaaaaaaaaab. Since there is b at the end, the machine chooses rule (1) (i.e., S  AB), rewrites the stack top S with AB and outputs rule number (1) as shown in figure (a). Let q, , and  be, respectively, the current state, the remaining input portion to read, and the current stack contents. From now on, for convenience we shall use the triple (q, , ), called the configuration, instead of drawing the cumbersome diagram to show the parser. (1) (2) G: S  AB | AC (3) A  aaaaaaaaaa (4) (5) B  bB | b (6) (7) C  cC | c



aaaaaaaaaabbb (1)

q1 G

q

G

A B Z0



(a) Apply rule S  AB

(b) Configuration (q, , )

360

LL(k) Parsing (1)

G:

Parsing

(2)

(3)

(4)

S  AB | AC A  aaaaaaaaaa

(5)

B  bB | b

(6)

(7)

C  cC | c

Looking ahead 11 cells in the current configuration (q0, aaaaaaaaaabbb, SZ0), the parser applies rule (1) by rewriting the stack top S with the rule‟s right side AB. Consequently, the configuration changes as follows. look-ahead 11 cells (1)

(q0, aaaaaaaaaabbb, Z0) (q1, aaaaaaaaaabbb, SZ0) (q1, aaaaaaaaaabbb, ABZ0) Now, with nonterminal symbol A at the stack top, the parser must find a rule to apply. Since A has only one rule, i.e., rule (3), there is no choice. So, the parser applies rule (3), consequently changing the configuration as follows. (3)

(q1, aaaaaaaaaabbb, ABZ0)  (q1, aaaaaaaaaabbb, aaaaaaaaaaBZ0)

361

Parsing

LL(k) Parsing

(1)

G:

(2)

(3)

S  AB | AC A  aaaaaaaaaa

(4)

(5)

B  bB | b

(6)

(7)

C  cC | c

Notice that the terminal symbol appearing at the stack top after applying rule (3) corresponds to the leftmost terminal symbol appearing in the leftmost derivation. Thus, the terminal symbol appearing at the stack top must match the next input symbol, if the input string is generated by the grammar. So, the parser, seeing a terminal symbol at the stack top, reads the input and, if they match, pops the stack top. The following sequence of configurations shows how the parser successfully pops all the terminal symbols pushed on the stack top by applying rule (3). (1)

(q0, aaaaaaaaaabbb, Z0) (q1, aaaaaaaaaabbb, SZ0) (q1, aaaaaaaaaabbb, ABZ0) (3)

 (q1, aaaaaaaaaabbb, aaaaaaaaaaBZ0)  . . .(q1, abbb, aBZ0) (q1, bbb, BZ0) 362

LL(k) Parsing (1)

G:

Parsing

(2)

(3)

S  AB | AC A  aaaaaaaaaa

(4)

(5)

B  bB | b

(6)

(7)

C  cC | c

(1)

(3)

(q0, aaaaaaaaaabbb, Z0) (q1, aaaaaaaaaabbb, SZ0) (q1, aaaaaaaaaabbb, ABZ0)  (q1, aaaaaaaaaabbb, aaaaaaaaaaBZ0)  . . . .(q1, abbb, aBZ0) (q1, bbb, BZ0)  ?

Now, the parser must choose one of B‟s rules, either (4) or (5). If there remains only one b in the input tape, rule (5) is the choice. Otherwise (i.e., if there are more than one b), rule (4) must be applied. It follows that the parser needs to look two cells ahead and proceeds as follows. Look-ahead 2 cells (4)

(4)

(q1, bbb, BZ0)  (q1, bbb, bBZ0)  (q1, bb, BZ0)  (q1, bb, bBZ0)  (5)

(q1, b, BZ0)  (q1, b, bZ0)  (q1, , Z0) 363

Parsing

LL(k) Parsing

In summary, our parser works as follows, where underlined parts of the input string are look-ahead contents and the numbers are the rules in the order applied during the parsing. (1)

(3)

(q0, aaaaaaaaaabbb, Z0)(q1, aaaaaaaaaabbb, SZ0)(q1, aaaaaaaaaabbb, ABZ0)  (q1, aaaaaaaaaabbb, aaaaaaaaaaBZ0) . . . .(q1, abbb, aBZ0) (4)

(4)

(q1, bbb, BZ0)  (q1, bbb, bBZ0)  (q1, bb, BZ0)  (q1, bb, bBZ0)  (5)

(q1, b, BZ0)  (q1, b, bZ0)  (q1, , Z0) Notice that the last configuration above implies a successful parsing. It shows that the sequence of rules applied on the stack generates exactly the same string as the one originally written on the input tape. If the parser fails to reach the accepting configuration, we say the input is rejected. In the above example, the sequence of rules applied to the nonterminal symbols appearing at the stack top matches the sequence of rules applied for the leftmost derivation of the input string shown below. (1)

(3)

(4)

(4)

(5)

S  AB  aaaaaaaaaaB  aaaaaaaaaabB  aaaaaaaaaabbB  aaaaaaaaaabbb 364

Parsing

LL(k) Parsing (1)

G:

(2)

(3)

S  AB | AC A  aaaaaaaaaa

(4)

(5)

B  bB | b

(6)

(7)

C  cC | c

For the other input strings ending with c‟s, the parser can apply the same strategy and successfully parse it by looking ahead at most 11 cells (see below). This parser is called an LL(11) parser, named after the following property of the parser; the input is read Left-to-right, the order of rules applied matches the order of the Leftmost derivation, and the longest look-ahead range is 11 cells. For a grammar G, if we can build an LL(k) parser, for some constant k, we call G an LL(k) grammar. (2)

(3)

(q0, aaaaaaaaaabbb, Z0) (q1, aaaaaaaaaaccc, SZ0) (q1, aaaaaaaaaaccc, ACZ0)  (q1, aaaaaaaaaaccc, aaaaaaaaaaCZ0)  . . . .(q1, abbb, aCZ0)  (6)

(6)

(q1, ccc, CZ0)  (q1, ccc, bBZ0)  (q1, cc, CZ0)  (q1, cc, cCZ0)  (7)

(q1, c, CZ0)  (q1, c, cZ0)  (q1, , Z0) 365

LL(k) Parsing

Parsing

Formally, an LL(k) parser is defined by a parse table with the nonterminal symbols on the rows and the look-ahead contents on the columns. The table entries are the right sides of the rules applied. Blank entries are for the rejecting cases.

The parse table below is constructed based on our observations, while analyzing how the parser should work for the given input string. In the look-ahead contents, X is a don‟t-care symbol, and  means no look-ahead is needed. (1)

G:

(2)

(3)

(4)

S  AB | AC A  aaaaaaaaaa

Stack top S A B C

a10b AB

(5)

B  bB | b

Contents of 11 look-ahead a10c bbX9 bB10 ccX9 AC

(6)

(7)

C  cC | c

cB10

 a10

bB

b cC

c

Parse Table 366

Parsing

13.3 Designing an LL(k) Parser Example 1. Design an LL(k) parser with minimum k for the following CFG. (1)

(2)

S  aSb  aabbb The language of this grammar is {aiaabbbbi | i  0}. Every string generated by this grammar has aabbb at the center. As we did in the preceding section, let‟s examine how an LL(k) parser will parse the input aaaabbbbb with the shortest possible look-ahead range of k. To parse the input string successfully, the machine should apply the rules in the order of (1), (1), (1), (2), which is the same order applied for the following leftmost derivation. (1)

(1)

(1)

(2)

S  aSb  aaSbb  aaaSbbb  aaaaabbbbbb

367

Designing LL(k) Parser

Parsing

Pushing the start symbol S into the stack in the initial configuration, the parser gets ready to parse the string as shown below. With S in the stack top, it must apply one of S‟s two rules. To choose one of them, the parser needs to look ahead for supporting information. What could be the shortest range to look ahead? (1)

(2)

S  aSb  aabbb

(q0, aaaaabbbbbb, Z0)  (q1, aaaaabbbbbb, SZ0)  ?

If there is aabbb, rule (2) must be applied. So it appears k = 5. But the parser does not have to see the whole string. If there is aaa ahead, the leftmost symbol a must have been generated by rule (1). Otherwise, if there is aab ahead, the leftmost a must have been generated by rule (2). It is enough to look ahead 3 cells (i.e., k = 3). Thus, in the current configuration, since the contents of 3 look-ahead is aaa, the parser applies rule (1), then reads the input to match and pop the terminal symbol a from the stack top as follows. (1)

(q1, aaaaabbbbbb, SZ0) (q1, aaaaabbbbbb, aSbZ0) (q1, aaaabbbbbbb, SbZ0) Look-ahead 3 368

Designing LL(k) Parser

Parsing

Again, with S on the stack top, the parser looks ahead 3 cells, and seeing aaa, applies rule (1), and repeats the same procedure until it looks ahead aab as follows. (1)

(q1, aaaaabbbbbb, SZ0) (q1, aaaaabbbbbb, aSbZ0)  (1)

(2)

S  aSb  aabbb

(1)

(q1, aaaabbbbbbb, SbZ0)  (q1 , aaaabbbbbb, aSbbZ0 )  (1)

(q1, aaabbbbbbb, SbbZ0)  (q1 , aaabbbbbb, aSbbbZ0 )  (q1 , aabbbbbb, SbbbZ0 ) ? Now, the parser finally applies rule (2), and keeps reading and match-andpopping until it enters the accepting configuration as follows. (2) (q1 , aabbbbbb, SbbbZ0 )  (q1 , aabbbbbb, aabbbbbbZ0 )  …  (q1 , , Z0)

369

Designing LL(k) Parser

Parsing

The parser applied the rules in the order, (1), (1), (1), (2), which is the same order applied for the leftmost derivation of the input string aaaaabbbbbb. (1)

(1)

(1)

(2)

S  aSb  aaSbb  aaaSbbb  aaaaabbbbbb Given an arbitrary input string, the parser, applying the same procedure, will end up in the final accepting configuration if and only if the input belongs to the language of the grammar. The parser needs to look ahead at least 3 cells. Hence, the grammar is LL(3). The parse table is shown below.

(1)

(2)

S  aSb  aabbb

Stack top S

3 look-ahead aaa aab

aSb

aabbb

Parse Table

370

Designing LL(k) Parser

Parsing

Example 2. Construct an LL(k) parser with minimum k for the following CFG. (1)

(2)

S  abA | 

(3)

(4)

A  Saa | b

As we did for Example1, we pick up a typical string, ababaaaa, derivable by the grammar, and examine how it can be parsed according to the LL(k) parsing strategy with minimum k. Then, based on the analysis, we will construct a parse table. The order of the rules applied by the parser should be the same as the one applied in the following leftmost derivation. (1)

(3)

(1)

(3)

(2)

S  abA  abSaa  ababAaa  ababSaaaa  ababaaaa Pushing the start symbol S on the top of the stack, the parser must choose either rule (1) or (2) that will lead to finally deriving the input string. For the choice, is there any useful information ahead on the input tape? (q0, ababaaaa, Z0)  (q1, ababaaaa, SZ0)  ? 371

Designing LL(k) Parser (1)

Parsing (2)

S  abA | 

(3)

(4)

A  Saa | b

If the input is not empty, the parser, with S at the stack top, should choose rule (1) to apply. Then, as shown below, for each terminal symbol appearing at the stack top, the parser reads the next input symbol, and if they match, pops out the stack top until A appears. If the input tape was empty, the parser would simply pops S (i.e., rewrites S with ) and enters the accepting configuration. Now, with A at the stack top, the parser should choose a rule between (3) and (4). (1) (q1, ababaaaa, SZ0)  (q1, ababaaaa, abAZ0)  . .  (q1, abaaaa, AZ0) ? If rule (4) was used to derive the input, the next input symbol ahead should be b, not a. Looking symbol a ahead, the parser applies rule (3), and consequently, having S on the stack top as before, it needs to look ahead to choose the next rule. Up to this point, it appears that 1 look-ahead is an appropriate range. (3)

(q1, abaaaa, AZ0)  (q1, abaaaa, SaaZ0)  ? 372

Designing LL(k) Parser

Parsing

But this time, with S at the stack top it is uncertain which rule to apply. Looking a ahead, the parser can apply either rule (1) or rule (2), because in either case, the parser will successfully match the stack top a with the next input symbol a (see below). To resolve this uncertainty, the parser needs one more cell to look ahead. To solve this problem we could have the parser look down the stack. But we have chosen to extend the range of look-ahead, a straightforward solution. Later in this chapter, we will discuss parsers which are allowed to look down the stack some finite depth. (1) (q1 , abaaaa, abaaZ0) (1)

(2)

S  abA | 

(3)

(4)

A  Saa | b

(q1 , abaaaa, SaaZ0) (2) (q , abaaaa, aaZ ) 1 0

Now, looking ab ahead in the extended range, which must be generated by rule (1), the parser applies the rule and repeats the previous procedure as follows till S appears at the stack top again. (1)

(3)

(q1 , abaaaa, SaaZ0 )  (q1 , abaaaa, abAaaZ0 ) . .  (q1, aaaa, AaaZ0)  (q1, aaaa, SaaaaZ0) ? 373

Parsing

Designing LL(k) Parser (1)

(2)

S  abA | 

(3)

(4)

A  Saa | b

Looking aa ahead with S on the stack top, the parser applies rule (2). Then, for each a appearing at the stack top, it keeps reading the next input symbol, matching them and popping the stack top, eventually entering the accepting configuration. (2)

(q1, aaaa, SaaaaZ0)  (q1, aaaa, aaaaZ0)  . . . .  (q1, , Z0) In summary, the parser parses the input string ababaaaa as follows. (1)

(3)

(1)

(3)

(q1, ababaaaa, SZ0)  (q1, ababaaaa, abAZ0)  . .  (q1, abaaaa, AZ0) 

(q1, abaaaa, SaaZ0)  (q1 , abaaaa, abAaaZ0 ) . .  (q1, aaaa, AaaZ0)  (2)

(q1, aaaa, SaaaaZ0)  (q1, aaaa, aaaaZ0)  . . . . (q1, , Z0)

374

Parsing

Designing LL(k) Parser

The input string that we have just examined is the one derived by applying rule (2) last. For the other typical string ababbaa that can be derived by applying rule (4) last, the LL(2) parser will parse it as follows. (1)

(3)

(q1, ababbaa, SZ0)  (q1, ababbaa, abAZ0)  . .  (q1, abbaa, AZ0)  (1)

(q1, abbaa, SaaZ0)  (q1 , abbaa, abAaaZ0 ) . .  (q1, baa, AaaZ0) (3)

 (q1, baa, baaZ0)  . . . .  (q1, , Z0) From the analysis with the two parsing examples, we construct the following parse table. (Notice that with A at the stack top, though 1 look-ahead is enough, the entries are under the column of 2 look-ahead.) 2 look-ahead ab aa bX BB (1) (2) (3) (4) Stack  S  abA |  A  Saa | b top S abA  B: blank A Saa Saa b X: don‟t care

Parse Table 375

Designing LL(k) Parser

Parsing

For a given input string, the basic strategy of LL(k) parsing is to generate the same string on the top of the stack by rewriting every nonterminal symbol appearing at the stack top with the right side of that nonterminal‟s rule. If the nonterminal symbol has more than one rule, the parser picks the right one based on the prefix of the input string appearing on k cells looked ahead. Whenever a terminal symbol appears on the stack top, the machine reads the next input symbol and pops the stack top, if they match. The sequence of rules applied for a successful parsing according this strategy is the same as the one applied for the leftmost derivation of the input string. The class of CFG‟s that can be parsed by LL(k) parsing strategy is limited. The CFG G1 below is an example for which no LL(k) parser exists. However, G2, which generates the same language, is an LL(k) grammar. We will shortly explain why.

G1: S  A | B A  aA | 0 B  aB | 1 G2: S  aS | D D  0 | 1 L(G1) = L(G2) = {ait | i  0, t  {0, 1}} 376

Parsing

Designing LL(k) Parser

Consider the first working configuration illustrated below (with the start symbol S on top of the stack.) The parser should choose one of S‟s two rules, S→A and S →B. But it is impossible to choose a correct rule, because the right end symbol 0 (or 1), which is essential for the correct choice, can be located arbitrarily far to the right. It is impossible for any LL(k) parser to identify it ahead with its “telescope” of a finite range k. But for the grammar G2, we can easily design an LL(1) parser.

aaaa..... aa0 G1: S  A | B A  aA | 0 B  aB | 1 G2: S  aS | D D  0 | 1 L(G1) = L(G2) = {ait | i  0, t  {0, 1}}

q1

G1

SZ0 S  ?

377

Definition of LL(k) grammars

Parsing

We saw just now two CFG‟s that generate the same language, but for the one, no LL(k) parser exists, and for the other, we can design an LL(k) parser. So, we may ask the following: What is the property of LL(k) grammars?

For a string x, let (k)x denote the prefix of length k of string x. If  x  < k, then (k)x = x. For example, (2)ababaa = ab, (3)ab = ab. Definition (LL(k) grammar). Let G = (VT, VN, P, S) be a CFG. Grammar G is an LL(k) grammar if it satisfies the following condition. Consider two arbitrary leftmost derivations of the following forms. S * A   * y S * A   * x , where , ,  (VT VN)*, , x, y  VT*, A VN. If (k)x = (k)y , then it must be that  = . That is, in the above two derivations, the same rule of A should have been applied if (k)x = (k)y. The above condition implies that with a nonterminal symbol A on the stack top, the parser can identify A‟s rule to apply by looking ahead k cells. If G has such property, we can build an LL(k) parser. 378

Parsing

13.4 LR(k) Parsing

Recall that whenever a nonterminal symbol A appears on the stack top, LL(k) parsers replaces it with the right side of one of A‟s rules chosen. This LL(k) parsing strategy is not powerful enough to parse commonly used programming languages. LR(k) parsing (also called bottom-up parsing) uses a more powerful strategy applicable to parsing the programming languages. In a sense, LR(k) parsing strategy works in the reverse way of LL(k). LR(k) parsers try to create the right side of a rule, say A , on the stack top portion, and then replace it () by the left side (A), as it is illustrated in the figure below. (Notice that because the string  is pushed into the stack so that the right end of it appears at the stack top, it is written as R in the figure.) ---------------------q

G

A

R. . . . . . . Z0

---------------------q

G

A

A . . . . . . . Z0

379

Parsing

LR(k) parsing

If there is uncertainty because of other rules generating the same string , for example, B  , the parser looks some constant k cells ahead on the input string to resolve the uncertainty. The following figures show the difference of the two parsing strategies. A

N I

LL(k) Parsing ---------------------q

----------------------

A |  ?

G

q

A !

G

 . . . . . . . Z0

A . . . . . . . Z0

LR(k) Parsing ---------------------q

G

R . . . . . . . Z0

A B ?

---------------------q

G

A !

A . . . . . . . Z0 380

Parsing

LR(k) Parsing

Let‟s take a simple example of designing an LR(k) parser with the CFG shown below. As we did for designing LL(k) parsers, we shall first pick a typical string that can be derived by the grammar, and with this string, observe how the machine should work according to the LR(k) strategy. Based on the observation, we will build an LR(k) parser. (1)

(2)

S  AC | BD

(3)

A  aab

(4)

B  aab

(5)

Cc

(6)

Dd

Suppose string aabc is given on the input tape. The basic strategy of LR(k) parsers is to shift in the input (i.e., read the input and push it) into the stack, until the right side of a rule in the grammar appears on some stack top portion.

(q0, aabc, Z0 )  ?

381

Parsing

LR(k) Parsing (1)

(2)

S  AC | BD

(3)

A  aab

(4)

B  aab

(5)

Cc

(6)

Dd

For the example, the parser will read up to aab, until it sees baa (notice that it is reversed) on the stack top portion. Two rules, rule (3) and rule (4), are applicable. String baa must be replaced by either A or B. To resolve this uncertainty, the machine needs to look ahead. For this example, it is enough to look one cell ahead. Looking c ahead, the parser applies rule (3) by rewriting baa on the top portion of the stack by A as follows. look-ahead 1 (3)

(q0, aabc, Z0 )  ( q1, c, baaZ0)  (q1, c, AZ0)  ?

382

Parsing

LR(k) Parsing (1)

(2)

S  AC | BD

(3)

A  aab

(4)

B  aab

(5)

Cc

(6)

Dd

After applying rule (3), the parser sees no string on the stack top portion that is reducible (i.e., generated by a rule). So the parser shifts in the next symbol c from the input, and applies rule (5) to rewrite it with C. (This time no look ahead is needed.) Finally, the parser applies rule (1) S  AC, and rewrites AC appearing on the stack top portion (the parser reads the stack from inside out to the top) by the start symbol S, entering the final accepting configuration (q1, , SZ0).

look-ahead 1

accepting configuration (3)

(q0, aabc, Z0 )  ( q1, c, baaZ0)  (q1, c, AZ0)  (q1, , cAZ0)  (q1, , CAZ0)  (q1, , SZ0) (5)

(1)

383

Parsing

LR(k) Parsing

The following figure shows an LR(1) parsing history for the other string aabd, which can be generated by the grammar. (1)

(2)

S  AC | BD

(3)

A  aab

(4)

(5)

B  aab

Cc

look-ahead 1

(6)

Dd

accepting configuration (4)

(q0, aabd, Z0 )  ( q1, d, baaZ0)  (q1, c, BZ0)  (q1, , dBZ0)  (q1, , DBZ0)  (q1, , SZ0) (6)

(2)

Notice that the sequence of rules applied by the parser, (4)(6)(2), is exactly the reverse order of the rightmost derivation of aabd (see below). It is also true for the former example aabc, and actually for every LR(k) parsing. (Recall that the reverse order of the rules applied for the rightmost derivation of a string x corresponds to the bottom-up leftto-right traversal of the parse tree yielding x. ) (2)

(6)

(4)

S  AD  Ad  aabd

(1)

(5)

(3)

S  AC  Ac  aabc 384

Parsing

LR(k) Parsing

Formally, an LR(k) parser is defined by a parse table (otherwise called a reduction table), where each row corresponds to the right side of each rule, and each column corresponds to the look ahead content as in the LL(k) parse tables. The table entries are the left sides of the rules applied. Based on our observations while tracing the parser parsing the two sample strings according to the LR parsing strategy, we can build the parse table of an LR(1) parser as shown below. The two parsing histories are copied below for a review. 1 look-ahead Stack top  c d S  AC | BD A  aab portion AC B  aab C  c Dd S (q0, aabc, Z0 )  ( q1, c, baaZ0)  (q1, c, BZ0)  (q1, , cAZ0)  (q1, , CAZ0)  (q1, , SZ0) (q0, aabd, Z0 )  ( q1, d, baaZ0)  (q1, d, BZ0)  (q1, , dBZ0)  (q1, , DBZ0)  (q1, , SZ0)

BD aab

S A

B

c

C

d

D

Parse (reduction) table 385

Parsing

LR(k) Parsing

With a simple modification to the LR(1) grammar that we have just examined, we can change it into an LR(k) grammar for any given constant k. For example, by prefixing eeee to the right side of the rules of C and D as shown below, we can change the grammar to LR(5). Below are two LR parsing histories of an LR(5) parser for two typical strings derivable by the modified grammar and the parse table constructed based on the histories. S  AC | BD A  aab B  aab C  eeeec D  eeeed (q0, aabeeeec, Z0 )  ( q1, eeeec, baaZ0) 

(q1, eeeec, BZ0) . . .  (q1, , ceeeeAZ0)  (q1, , CAZ0)  (q1, , SZ0) (q0, aabeeeed, Z0 )  ( q1, eeeed, baaZ0)  (q1, eeeed, BZ0) . . .  (q1, , deeeeBZ0)  (q1, , DAZ0)  (q1, , SZ0)

stack top

5 look-ahead eeeec eeeed



AC

S

BD

S

aab

A

B

eeeec

C

eeeed

D

Parse (reduction) Table 386

Parsing

13.5 Designing LR(k) Parsers

We shall now practice designing an LR(k) parser for a couple of grammars. As with LL(k), we will first pick a typical string that can be generated by the given grammar and observe how an LR(k) parser parses it with minimum k. Based on the observation, we will build the parse table.

Example 1. Construct an LR(k) parser for the following CFG with minimum k. (1)

(2)

(3)

S  ADC  aaaddd

A  aaa

(4)

D  ddd

(5)

(6)

C  Ccc

The language of this grammar is {aaaddd}  {aaadddci | i  1}. We shall pick aaabbbccc, which can be derived by the rightmost derivation shown below. For the successful LR parsing, which ends up in the accepting configuration (q1, , SZ0), our LR(k) parser should apply the rules in the reverse order applied in this derivation, i.e., (3), (4), (6), (5), (5), (1). Rightmost derivation: (1)

(5)

(5)

(6)

(4)

(3)

S  ADC  ADCc  ADCcc  ADccc  Adddccc  aaadddccc 387

Designing LR(k) Parsers (1)

(2)

S  ADC  aaaddd

Parsing (3)

A  aaa

(4)

D  ddd

(5)

(6)

C  Ccc

The production rule applied last in the rightmost derivation of a string that can be generated by this grammar is rule (3) or rule (2). So, before applying one of these rules, the parser should shift in the input until it sees either aaa or aaaddd on the stack top portion. But we have a problem. Consider the following two configurations with aaa shifted in onto the stack. Cases (a) and (b) are, respectively, for input strings aaaddd and aaadddccc. (a) (q0, aaaddd, Z0 )  ( q1, aaddd, aZ0) . . .  (q1, ddd, aaaZ0)  ? (b) (q0, aaadddccc, Z0 )( q1, aadddccc, aZ0). . .(q1, dddccc, aaaZ0)  ? For case (a), it is wrong to apply rule (3) and reduce aaa by A. In this case the parser should shift in the remaining ddd onto the stack and then apply rule (2). For case (b), rule (3) must be applied in the current configuration. How can the parser make the right move? It needs to look ahead. 388

Designing LR(k) Parsers (1)

Parsing

(2)

S  ADC  aaaddd

(3)

A  aaa

(4)

D  ddd

(5)

(6)

C  Ccc

(a) (q0, aaaddd, Z0 )  ( q1, aaddd, aZ0) . . .  (q1, ddd, aaaZ0)  ? (b) (q0, aaadddccc, Z0 ) ( q1, aadddccc, aZ0)  .

. . (q1, dddccc, aaaZ0)  ?

The parser needs to look 4 cells ahead to see if there is dddB or dddc (where B denotes the blank symbol). If it is dddB (i.e., case (a) below), the parser keeps on shifting in (i.e., reading and pushing) the input symbols onto the stack until it sees aaaddd on the top portion of the stack (read inside out). Then the parser applies rule (2) as shown below to enter the accepting configuration. If the 4 look-ahead is dddc (i.e., case (b) below), the parser applies rule (3) in the current configuration. (a) (q0, aaaddd, Z0 )  ( q1, aaddd, aZ0) . . .  (q1, ddd, aaaZ0)  . . . . . . . (q1, , dddaaaZ0)  (q1, , SZ0) (2)

(b) (q0, aaadddccc, Z0 ) ( q1, aadddccc, aZ0)  .

..

(q1, dddccc, aaaZ0) (q1, dddccc, AZ0)  ? (3) 389

Designing LR(k) Parsers (1)

Parsing

(2)

S  ADC  aaaddd

(3)

A  aaa

(4)

D  ddd

(5)

(6)

C  Ccc

(a) (q0, aaaddd, Z0 )  ( q1, aaddd, aZ0) . . .  (q1, ddd, aaaZ0)  . . . . . . . (q1, , dddaaaZ0)  (q1, , SZ0) (2)

(b) (q0, aaadddccc, Z0 ) ( q1, aadddccc, aZ0)  .

..

(q1, dddccc, aaaZ0) (q1, dddccc, AZ0)  ? (3)

Now, the parser sees no content on the stack top that can be reduced. So it resumes shifting in the input until it sees ddd on the stack top portion that can be reduced by applying rule (4). But it needs to take caution before applying rule (4), because in case (a) above, it also has ddd on the stack before applying rule (2). Such confusion can be resolved by looking one cell ahead to see whether there is a symbol c ahead. (4)

(q1, dddccc, AZ0)  . . . .  (q1, ccc, dddAZ0)  (q1, ccc, DAZ0)  ? 390

Designing LR(k) Parsers

(1)

(2)

S  ADC  aaaddd

Parsing (3)

A  aaa

(4)

D  ddd

(5)

(6)

C  Ccc

After applying rule (4), the parser, seeing no string on the stack that is reducible by applying a rule, shifts in the next input symbol c from the input. This first c from the input is the one generated by rule (6). (Note that if rule (5) were C  cC, with the right side reversed, the last c would have been generated by rule (6).) Thus, the parser applies rule (6), and reduces the c on the stack top by C as follows. (3)

(q0, aaadddccc, Z0 ) ( q1, aaadddccc, aZ0) . . . (q1, dddccc, aaaZ0)  (4)

(q1, dddccc, AZ0)  . . . .  (q1, ccc, dddAZ0)  (q1, ccc, DAZ0)  (6) (q1, cc, cDAZ0)  (q1, cc, CDAZ0)  ?

391

Designing LR(k) Parsers (1)

(2)

S  ADC  aaaddd

Parsing (3)

A  aaa

(4)

D  ddd

(5)

(6)

C  Ccc (3)

(q0, aaadddccc, Z0 ) ( q1, aaadddccc, aZ0) . . . (q1, dddccc, aaaZ0)  (4)

(q1, dddccc, AZ0)  . . . .  (q1, ccc, dddAZ0)  (q1, ccc, DAZ0) 

(6) (q1, cc, cDAZ0)  (q1, cc, CDAZ0)  ? Now, the parser has ADC on the stack that would be reducible by applying rule (1). But it is too early to do this, because there are more c‟s to be processed. Here it needs to look one cell ahead, and shift in the next c to apply rule (5) and reduce Cc on the stack by C. The parser repeats this reduction until it sees no c‟s ahead, and finally applies rule (1) to reduce ADC by S, entering the accepting configuration. (5)

(q1, cc, CDAZ0)  (q1, c, cCDAZ0)  (q1, c, CDAZ0)  (5)

(1)

(q1, , cCDAZ0)  (q1, , CDAZ0)  (q1, , SZ0) 392

Designing LR(k) Parsers (1)

Parsing

(2)

(3)

S  ADC  aaaddd

A  aaa

(4)

D  ddd

(5)

(6)

C  Ccc

Below is the summary of our observations. The order of the rules applied is exactly the reverse order of the rules applied for the rightmost derivation of the input string (see below). (3)

(q0, aaadddccc, Z0 )  . .  ( q1, aadddccc, aaZ0)  (q1, dddccc, aaaZ0)  (4)

(q1, dddccc, AZ0)  . . . .  (q1, ccc, dddAZ0)  (q1, ccc, DAZ0)  (6)

(5)

(q1, cc, cDAZ0)  (q1, cc, CDAZ0)  (q1, c, cCDAZ0)  (q1, c, CDAZ0)  (5)

(1)

(q1, , cCDAZ0)  (q1, , CDAZ0)  (q1, , SZ0) Rightmost derivation: (1)

(5)

(5)

(6)

(4)

(3)

S  ADC  ADCc  ADCcc  ADccc  Adddccc  aaadddccc 393

Designing LR(k) Parsers (1)

Parsing

(2)

(3)

S  ADC  aaaddd

(4)

A  aaa

D  ddd

(5)

(6)

C  Ccc

Now, based on the observations, while tracing how an LR(k) parser can successfully parse the chosen input string, we can construct the following parse table for an LR(4) parser. Stack top portion

aaa ddd aaaddd c Cc ADC

4 look-ahead dddc dddB BBBB cXXX A



Shift-in

X: don‟t care B : blanks : no look-ahead

D S C C S

Shift-in

Parse Table 394

Parsing

Designing LR(k) Parsers

Example 2. Construct an LR(k) parser for the following CFG with minimum k. (1)

(2)

SEAF  EBF

(3) (4)

EaE  a

(5)

(6)

A  aaab

(7)

B  aaac

Fd

This is not an LL(k) grammar. (We leave the proof for the reader. For the proof, recall the non-LL(k) grammar that we discussed with at the end of last section.) The language of this grammar is {aixd  i  1, x  {aaab, aaac}}. Again, we will first trace how an LR parser should parse a typical string from this language to construct a parse table. Let‟s choose the string aaaaaabd for the analysis. The rightmost derivation for this string is shown below. Our LR(k) parser, given this string as an input, should parse it by applying the rules in the reverse order of the derivation, i.e., (4), (3), (3), (5), (7), (1). (1)

(7)

(5)

(3)

(3)

(4)

S  EAF  EAd  Eaaabd  aEaaabd  aaEaaabd aaaaaabd

395

Designing LR(k) Parsers (1)

(2)

SEAF  EBF

Parsing (3) (4)

EaE  a

(5)

A  aaab

(6)

B  aaac

(7)

Fd

Every rightmost derivation of this grammar must end by applying rule (4). Thus, our LR(k) parser should shift in the a onto the stack that has been generated by this rule. The language of this grammar is {aixd  i  1, x  { aaab, aaac}}, and the rightmost a in the prefix ai of the language is generated by rule (4). So the parser should shift in the whole string ai into the stack to bring the rightmost a (of the string ai) on top of the stack. But, to do this we have a problem, because there are more a‟s to the right which belong to x  { aaab, aaac }. To identify the last a in ai, the parser needs to look ahead. If it sees either aaab or aaac ahead, then the a on top of the stack is the last a in ai, which should be reduced by applying rule (4) as follow. (q0, aaaaaabd, Z0)  (q1, aaaaabd, aZ0)  … .  (4)

(q1, aaabd, aaaZ0)  (q1, aaabd, EaaZ0)  ? 396

Designing LR(k) Parsers (1)

(2)

SEAF  EBF

Parsing (3) (4)

EaE  a

(5)

A  aaab

(6)

B  aaac

(7)

Fd

(q0, aaaaaabd, Z0)  (q1, aaaaabd, aZ0)  … .  (4) (q1, aaabd, aaaZ0)  (q1, aaabd, EaaZ0)  ? Now, we have aE at the stack top portion that is reducible by applying rule (3). The parser does this reduction twice, and then shifts in the next input d, because no string appears on the stack top that is reducible. Symbol d on the stack is reduced by rule (7), and the result of this reduction EAF is in turn reduced by rule (1) to enter the final accepting configuration (q1, , SZ0). (3)

(3)

(q1, aaabd, EaaZ0)  (q1, aaabd, EaZ0)  (q1, aaabd, EZ0)  (5)

(q1, aabd, aEZ0)  …  (q1, d, baaaEZ0)  (q1, d, AEZ0)  (7)

(1)

(q1, , dAEZ0)  (q1, , FAEZ0)  (q1, , SZ0) 397

Designing LR(k) Parsers (1)

(2)

SEAF  EBF

Parsing

(3) (4)

EaE  a

(5)

A  aaab

(6)

B  aaac

(7)

Fd

Here are the summary of our analysis and the parse table. The order of rules applied matches the order of the rightmost derivation, which is (4), (3), (3), (5), (7), (1). 4 look-ahead aaab aaac others

(q0, aaaaaabd, Z0)  (q1, aaaaabd, aZ0)  …  (4)

(3)

(q1, aaabd, aaaZ0)  (q1, aaabd, EaaZ0)  (3)

(q1, aaabd, EaaZ0)  (q1, aaabd, EaZ0) 

(q1, aaabd, EZ0)  (q1, aabd, aEZ0)  … (5)

(q1, d, baaaEZ0)  (q1, d, AEZ0)  (7)

(1)

(q1, , dAEZ0) (q1, , FAEZ0)  (q1, , SZ0)

a E aE aaab aaac d

E



shift-in

E A B F S S

EAF EBF

Parse Table 398

Designing LR(k) Parsers

Parsing

Example 3. Construct an LL(k) parser for the following CFG with minimum k. (1)

(2)

S bS | Accc

(3)

(4)

A  bAc | bc

We shall trace the parser with input string bbbbbbccccc, which can be derived by the rightmost derivation as shown below. Thus, the LR(k) parser should be able to parse it by applying the rules in the order (4), (3), (3), (2), (1), (1). (1)

(1)

(2)

(3)

(3)

(4)

S  bS  bbS  bbAccc  bbbAcccc  bbbbAccccc bbbbbcccccc The language of the grammar is {bibncnccc | i  0, n  1}, and every rightmost derivation ends by rule (4), which generates bc. Notice that this bc is located at the center of bncn of the language and can be brought in on the stack top by simply shifting in the input until bc appears on the stack. Then the LR(k) parser reduces it by applying rule (4) as follows. (4)

(q0, bbbbbcccccc, Z0)  … .  (q1, ccccc, cbbbbbZ0)  (q1, ccccc, AbbbbZ0) 399

Designing LR(k) Parsers (1)

Parsing (2)

(3)

(4)

A  bAc | bc

S bS | Accc

(4)

(q0, bbbbbcccccc, Z0)  … . .  (q1, ccccc, cbbbbbZ0)  (q1, ccccc, AbbbbZ0)  (q1, cccc, cAbbbbZ0) ? After rule (4) applied, with no stack top portion reducible, the parser shifts in the next input symbol c as above. Now, the parser sees bAc on the stack, which is reducible by rule (3). However, we should be careful. We cannot let the parser repeat applying this reduction without looking ahead. It needs to save the last three c‟s of the input string for the final reduction by rule (2). Thus the parser needs to look 3 cells ahead. In the current configuration, since there is ccc ahead, it can reduce the bAc on the stack top portion by rule (3). The parser can do this shift-in and reduction once more as shown below. (3)

(q1, cccc, cAbbbbZ0)  (q1, cccc, AbbbZ0)  (3)

(q1, ccc, cAbbbZ0)  (q1, ccc, AbbZ0) (q1, cc, cAbbZ0)  ? 400

Designing LR(k) Parsers (1)

Parsing (2)

S bS | Accc

(3)

(4)

A  bAc | bc (4)

(q0, bbbbbcccccc, Z0)  … . .  (q1, ccccc, cbbbbbZ0)  (3) (q1, ccccc, AbbbbZ0)  (q1, cccc, cAbbbbZ0) (q1, cccc, AbbbZ0) 

(3) (q1, ccc, cAbbbZ0)  (q1, ccc, AbbZ0) (q1, cc, cAbbZ0)  ? Now, the parser sees two c‟s ahead, that must be shifted in onto the stack for the reduction by rule (2), because those last three c‟s, together with A, are generated by rule (2). After this reduction by rule (2), the parser reduces the stack top twice by applying rule (1) and enters the final configuration as follows. (2)

(q1, cc, cAbbZ0) . . . (q1, , cccAbbZ0)  (1)

(1)

(q1, , SbbZ0)  (q1, , SbZ0)  (q1, , SZ0)

401

Designing LR(k) Parsers

Parsing

Below is the summary of our analysis and the parse table of the LR(3) parser. (4) (q0, bbbbbcccccc, Z0)  … . .  (q1, ccccc, cbbbbbZ0)  (3) (q1, ccccc, AbbbbZ0)  (q1, cccc, cAbbbbZ0) (q1, cccc, AbbbZ0)  (3) (q1, ccc, cAbbbZ0)  (q1, ccc, AbbZ0) (q1, cc, cAbbZ0)  . . .  (2) (1) (1) (q1, , cccAbbZ0)  (q1, , SbbZ0)  (q1, , SbZ0)  (q1, , SZ0) 3 look-ahead ccc ccB xxx (1)

(2)

S bS | Accc (3)

(4)

A  bAc | bc

bc Stack bAc top Accc portion bS

A A Shift-in S S Parse Table 402

Designing LR(k) Parsers

Parsing

Definition of LR(k) Grammars LR(k) grammars are defined as follows, which is quite similar to the definition of LL(k) grammars except that in this definition, we are concerned with rightmost derivations instead of leftmost derivations. Definition (LR(k) grammar). Let G = (VT, VN, P, S) be a CFG, and let x and y be two sentential forms in the rightmost derivation of a string in L(G), where , ,   V*, x, y  VT*, and A VN. If there is a constant k for which the following conditions are satisfied, then grammar G is an LR(k) grammar (i.e., we can construct an LR(k) parser for G). (i) If S *Ax  x and S *Ay  y, such that (k)x = (k)y, (ii) then  = . In other words, if the above condition is satisfied, it is possible for an LR(k) parser to establish  (=  ) on the stack top portion and reduce it with the rule A →  by looking at most k cells ahead. (The proof, which can be done using the proof-byinduction technique, is a little challenging for the level of this text. We omit the proof.) 403

Parsing

13.6 Lex and YACC Lexical analyzer and parser are the major components of a compiler. A lexical analyzer, scanning the source program, identifies the basic program components, called tokens, such as reserved words, operators, and variable names. A parser, based on the grammar of the programming language, constructs the parse tree (see Section 11.1) whose yield is the sequence of tokens found by the lexical analyzer. This section briefly introduces two software tools, Lex and YACC, which automatically generate a lexical analyzer and a parser, respectively.

Lex Given tokens in terms of regular expression, Lex constructs a lexical analyzer. (Actually, this lexical analyzer is a software implementation of an FA that recognizes the set of tokens.) The input to Lex consists of the following three parts, each delimited by %%. (1) Token definition (2) Token description and actions (3) User-written codes

404

Lex

Parsing

(1) Token definition Tokens are defined in terms of a variation of the regular expression that was introduced in the rumination section of Chapter 3. Here we repeat the examples presented in that rumination section. (a | b)* = (a+b)* [a-z] = a + b + . . . + z,

(a | b)+ = (a+b)(a+b)* (ab\.c)? = (ab.c +  )

[ab] = (a+b), abc? = {abcd, abd}

(2) Token Descriptions and Actions Lex informs each token identified to the parser. The input part for token description and action defines what to inform for each token. For example, {real} {integer}

return FLOAT; return INTEGER;

(3) User-Written Code When the action for a token is too complex to describe in the action part, it is written in a C program and put in the part for “the user written code.” (For further details, see a compiler text.) 405

YACC

Parsing

Given a CFG, YACC constructs a look ahead LR (also called LALR, which is LR(k) that we learned about in Chapter 13) parser. The input includes actions to be taken depending on the semantics of each rule. YACC also produces a code for such actions. The input to YACC consists of the following three parts.

(1) Declarations and definitions, (2) Grammar and actions, (3) User-written Codes (1) Declarations and definitions This part defines all tokens (except for the single symbol operators), operator precedence, and associativity. This part also shows the variable names used in the program and their data types such that a proper link can be established between them existing in different parts of the program. Here are some simple examples. %token ID /*token for identifier */ %token NUMBER /* token for numbers */ %token BEGIN END ARRAY FILE #include “yylex.c” /* include lexical scanner */ extern int yylval; /* token values from yylex */ int tcount = 0; /* a temporary integer variable */ %start S /* Grammar‟s start symbol */ 406

YACC

Parsing

(2) Grammar and Actions The input grammar is written in BNF (Backus-naur form) as follows.  Every terminal symbol is put between a pair of single quotation marks and nonterminal symbols are denoted by a word (with no quotation marks).  Instead of the arrow () in CFG rules, a colon( : ) is used, and each rule is terminated by a semicolon.  Blank denotes . Here are some simple examples together with the corresponding CFG rules, where i is a variable name.

expr : expr „+‟ term | expr „-‟ term | term ; term : term „*‟ fact | term „/‟ fact | fact; fact : „(„ expr „)‟ | ID ;

EE+T|E–T|T TT*F|T/F|F F  (E) | i

407

Parsing

YACC (3) User-written codes

This part, which has the following form, must include the function call yyparse(), together with other C codes, if needed. (For further details, refer a compiler text.)

main ( ) { .... yyparse( ); .... } Power Memory Class

Break Time

An elderly couple were experiencing declining memories, so they decided to take a power memory class, where they teach one to remember things by association. Later, the man was talking to a neighbor about how much the class helped them. “Who was the instructor?” the neighbor asked. “Oh, let‟s see,” pondered the man. “Umm… what‟s that flower, you know, the one that smells real nice but has those thorns…?” “A rose?” offered the neighbor. “Right,” said the man. He then turned toward his house and shouted, “Hey, Rose, what‟s the name of the guy we took that memory class from?” - DonTravels -

408

Parsing

Rumination (1): Parsing

 The LL and LR parsers work based on a given grammar. The following three context-free grammars generate the same language {aiX | i 1, X  {b, c}}. Grammar G1 is neither an LL(k) nor an LR(k), for any k. G2 is an LR(1) grammar, but not LL(k), for any k., and G3 is LL(2) and LR(0). (We leave the proofs for the reader.) Thus, LL(k) and LR(k) refer to the grammar, not to the language.

G1 : S  Ab | Bc A Aa | a B Ba | a G2 : S  Ab | Bc A aA | a B aB | a G3 : S  aS | aA A  b | c  Ambiguous CFG‟s have more than one parse tree that yields the same string x, and hence, there are more than one leftmost or rightmost derivation for x. Thus, it is impossible to parse x and give the exact sequence of rules that have been applied for the leftmost or rightmost derivation. However, if we allow the parser to choose arbitrarily one of the multiple derivations, we can design a parser. As an example, consider the following grammar, which is ambiguous. We can make an LL(0) or LR(0) parser for this grammar by letting it choose either one of following two derivations; S  A  a and S  B  a.

SA| B

A a

Ba

409

Parsing

Rumination (1): Parsing

 One extra working state (q1) was enough for all LL(k) parsers that we have designed in this chapter. We can show that this is true for every LL(k) grammar. However, this is not true for LR(k) parsing, though all the LR(k) parsers that we have presented in this chapter use two states including the start state q0. The following grammar is an example that requires three states.

S aaA | baB | caC A aA | a B aB | a

C aC | a

(q0, baa. . . . aa, Z0)  … . .  (q1, , aa. . . aabZ0) ?

The language of this grammar is {xai | i 1, x {a,b,c}}. Every string in this language begins with either a, b or c, and depending on it, the last a in ai is generated by either A  a, B  a or C  a. Notice that for any rightmost derivation, one of these rules is the last one applied. It follows that the parser must shift in the whole input string to bring the last a on top of the stack, which can be done by looking one cell ahead (see above). Now, the problem is to correctly choose a rule among the three A  a, B  a and C  a to reduce the a on the stack top. The only information needed for the correct reduction is the symbol pushed in first on top of the bottom of stack symbol Z0 (b in the above illustration). This symbol can be arbitrarily far from the top. (Recall that the parser is allowed to look down the stack by only a finite depth.) So the parser, reading the first symbol, needs to remember it in its finite state control and use it for choosing the correct rule for the first reduction. The parser needs three states, say qa, qb, and qc, respectively, corresponding to keeping the first symbol either a, b or c, in its memory.

410

Parsing

Rumination (1): Parsing

S aaA | baB | caC A aA | a B aB | a

C aC | a

(q0, baa. . . . aa, Z0)  … . .  (q1, , aa. . . aabZ0) ?

Below are the three LR(1) parsing histories showing how the parser can parse the above grammar with three states qa, qb and qc.

(q0, aaa. . . . aa, Z0)  (qa, aa. . . . aa, aZ0)  … . .  (qa, , aaa. . . aaaZ0)  (qa, , Aaa. . . aaaZ0)  (qa, , Aa. . . aaaZ0)  . . . .  (qa, , AaaZ0)  (qa, , SZ0) (q0, baa. . . . aa, Z0)  (qb, aa. . . . aa, bZ0)  … . .  (qb, , aa. . . aabZ0) (qb, , Baa. . . aabZ0)  (qb, , Ba. . . aabZ0)  . . . . . .  (qb, , BabZ0)  (qb, , SZ0) (q0, caa. . . . aa, Z0)  (qc, aa. . . . aa, cZ0)  … . .  (qc, , aa. . . aacZ0) (qc, , Caa. . . aacZ0)  (qc, , Ca. . . aacZ0)  . . . . . .  (qc, , CacZ0)  (qc, , SZ0) 411

Parsing

Exercises

13.1 Following the steps (i) and (ii) below, construct an LL(k) parser with minimum k for each of the following grammars.

(a)

S  aS | D

D 0|1

(c)

S  BA | aaab

(b) S  aaA A  aAb | aaaaaaab

A  bbS | bbb B  bBa | bbbb

(i) Choose a typical string w that is derivable by the grammar, and trace the LL(k) parsing for the input w with clear indication of where and why your parser needs to look k cells ahead. (ii) Based on the analysis in step (i) above, construct an LL(k) parse table of your parser. 13.2 (a) Why can CFG G below not be an LL(k) grammar for any constant k? (b) Construct an LL(k) grammar G’ whose language is L(G), and show an LL(k) parse table for G’, together with the analysis that you took to build your parse table according to step (i) in problem 13.1 above. G: S  aAb | aBb

A  aAb | c

B  aBb | d

13.3 Following steps (i) and (ii) below, construct an LR(k) parser with minimum k for each of the following grammars. (a) S  ABC | BC | C (b) S  aSBb | a

A  aaa

B  aa

C a

B  aaaaab

(i) Choose a typical string w that is derivable by the grammar, and trace the LR(k) parsing for the input w with clear

indication of where and why your parser needs to look k cells ahead. (ii) Based on the analysis in step (i) above, construct an LR(k) parse table of your LR(k) parser.

412

Practical Application: Web Programming and Bioinformatics

413

14. Web Programming and Bioinformatics In Section 3.2, we learned how the syntax of Pascal programming language can be defined with a CFG. (Actually, Appendix A shows the whole definition of Pascal programming language in terms of a syntax flow graph.) Other programming languages can also be defined formally with a grammar, except for a few exceptional cases of context dependency, such as double declaration of a variable in a block and using numeric labels in DO-loop in FORTRAN. This chapter first shows that the major parts of the popular Web programming language HTML (Hypertext Markup Language) and XML (Extensible Markup Language), which is designed for e-commerce, can also be defined in terms of a CFG. Then the chapter presents a potential application of the formal languages to molecular biology.

14.1 Hyper Text Markup Language (HTML) 415 14.2 Document Type Definition (DTD) and XML 418 14.3 Genetic code and grammar 425

414

Web Programming and Bioinformatics

14.1 Hyper Text Markup Language (HTML) In this section, to see how our knowledge in the formal languages could be applicable to investigating the properties of programming languages, we shall first examine HTML. The left box below shows an informal definition (itemized for convenience) of an HTML list that would commonly appear in a text. To the right are CFG rules translated from the informal definition. (1) Char denotes a character. (2) Text is a sequence of characters. (3) Doc is a sequence of Elements. (4) For a string x, we call <x> a tag, and the matching tag of <x>. (5) Element is a Text, or a Doc inbetween a matching tag, or a Doc with a tag at the front. (6) ListItem is a document with tag

at the front. ListItem is an item of a list. (7) List is a sequence of zero or more ListItem.

(1) Char  a | A | . . . . . | z | Z | . . (2) Text  Char Text |  (3) Doc  Element Doc |  (4) Tag  | (5) Element  Text | a Doc in-between a matching tag

| Doc (6) ListItem 

Doc

(7) List  ListItem List | 

415

HTML

Web Programming and Bioinformatics

Notice that in the CFG rules, the bold-faced Italic words denote nonterminal symbols. The blank between two nonterminals (e.g., Char Text) is used to delimit the two nonterminals (not for the terminal symbol blank) and hence, can be ignored. While translating, we found a context dependent part of the language, i.e., “Element is a Doc between a matching tag,” that is impossible to translated into a context-free grammar rule (see the highlighted part). (1) Char denotes a character. (2) Text is a sequence of characters. (3) Doc is a sequence of Elements. (4) For a string x, we call <x> a tag, and the matching tag of <x>. (5) Element is a Text, or a Doc inbetween a matching tag, or a Doc with a tag at the front. (6) ListItem is a document with tag

at the front. ListItem is an item of a list. (7) List is a sequence of zero or more ListItem.

(1) Char  a | A | . . . . . | z | Z | . . (2) Text  CharText |  (3) Doc  ElementDoc |  (4) Tag  | (5) Element  Text | a Doc between a matching tag

| TextDoc

(6) ListItem 

Doc (7) List  ListItem List | 

416

HTML

Web Programming and Bioinformatics

According to the informal definition, a document Doc should appear between a matching tag, like <x>Doc , where x is a text (i.e., a string). In the formal definition, we need a rule with Element which can generate <x>Doc for an arbitrary Text x. Notice that Element  Doc does not work, because we cannot guarantee the left side Text and the right side Text generate identical strings. Actually, it is not hard to see that the set of documents with matching tags has the same context dependency as in the language L = {xcx | x  {a, b}*}. Using the pumping lemma for context-free languages, we can prove that this language L is not context-free (see Exercise 12.4). This implies that the list part of HTML is not context-free. However, if we restrict the matching tags to a finite set of strings, say {w1, w2, . . , wk} (e.g., <EM>Doc,

Doc, etc.), HTML list can be defined in terms of a “pure” CFG as shown below. (1) Char  a | A | . . . . . | z | Z | . . . (2) Text  Char Text |  (3) Doc  Element Doc |  (4) Tag  | (5) Element  Text | Doc | <w1>Doc | . . . . | <wk>Doc (6) ListItem  <“LI”>Doc (7) List  ListItem List |  417

Web Programming and Bioinformatics

14.2 DTD and XML The major objective of HTML is to define the format of a document, and hence it is not easy to describe the semantics of the text, that is needed in the area of e-commerce. XML has been introduced to overcome this drawback. Actually, XML is the language (i.e., the set of documents) written according to a DTD (Document Type Definition), which can be transformed into a CFG. All DTD‟s have the following form. , where each element definition has the following format. This format can be transformed into the following CFG rules, where DTD corresponds to the start symbol S. DTD  List-of-Element-Definitions  Element-Definition List-of-Element-Definitions | Element-Definition Element-Definition  418

DTD and XML

Web Programming and Bioinformatics

DTD  List-of-Element-Definitions  Element-Definition List-of-Element-Definitions | Element-Definition Element-Definition  The List-of-Element-Definitions and Element-Description are, respectively, delimited by the pairs of the matching tags and , and <element-name> and . Element description is a variation of the regular expression that is defined as follows. (1) Both (a) and (b) below are element descriptions. (a) An element-name. (b) Any text (with no tag) that is denoted by the special term #PCDATA. (2) If E1 and E2 are element descriptions, E1*, E1+, E1?, (E1, E2), and E1 | E2 are element descriptions. Recall that E1+ = E1(E1)*, E1? = E1+, E1 | E2 = E1+E2, and E1,E2 correspond to the concatenation of regular expression E1 and E2. We know that every language expressible with regular expression can also be defined in terms of a regular grammar, and every regular grammar is a CFG. Hence, Element-Description is context-free. 419

DTD and XML

Web Programming and Bioinformatics

Here is a popular example of a DTD, which defines the specification of PC‟s. ]>

420

Web Programming and Bioinformatics

DTD and XML

Here, we show how to transform each element definition of PcSpecs into CFG rules (shown in the box). PCS  PCSS

PCSS  PC PCSS | 

PC  MODEL PRICE PROCESSOR RAM DISK+ DISK+  DISKS DISKS  DISK DISKS | DISK

MODEL  <MODEL>Text PRICE  Text

421

Web Programming and Bioinformatics

DTD and XML

PROCESSOR  MANF MODEL SPEED

MANF (\#PCDATA)>

MODEL (\#PCDATA)>

MANF  <MANF>Text

MODEL  <MODEL>Text

..... DISK  HARDDISK | CD | DVD

..... 422

DTD and XML

Web Programming and Bioinformatics

Here is an example of XML document written according to the DTD PcSpecs.

<MODEL>1234 $3000 512 <MANF>Superdc <MODEL>xx1000 <SIZE>62Gb <SPEED>32x . . . . 423

Web Programming and Bioinformatics

DTD and XML

Again in XML we see the same context dependency caused by the matching tags. As in the previous example, if we restrict the matching tags to those chosen from a finite set of reserved words, such as MODES, PRICE, DISK, etc., XML turns out to be a CFL.

Warning Signs

Break Time

- On children's alphabet blocks: Letters may be used to construct words, phrases and sentences that may be deemed offensive. - On a microscope: Objects are smaller and less alarming than they appear. - On a revolving door: Passenger compartments for individual use only. - On work gloves: For best results, do not leave at crime scene. - On a palm sander: Not to be used to sand palms. - On a calendar: Use of term "Sunday" for reference only. No meteorological warranties express or implied. -On a blender: Not for use as an aquarium. - Dennis -

424

Web Programming and Bioinformatics

14.3 Gene code and grammar Quite a few linguistic terminologies, such as code, express, translate, edit, etc., have been adopted by biologists, especially in the field of molecular biology, ever since James Watson and Francis Crick proposed the structure of DNA (deoxyribonucleic acid) strands in 1953. This implies that in biology there are problems that we can approach with the knowledge of automata and formal languages. As an example, this section shows how it is possible to express genes in terms of a grammar. Before the example, we need a very brief introduction to the process of protein synthesis. A genome is a long double helix DNA strand, consisting of four chemical bases; adenine, thymine, cytosine and guanine, denoted by A, T, C, and G, respectively.

Genome (double helix)

DNA molecule

backbone A

G

T

C

T

C

A

G

bases 425

Gene Grammar

Web Programming and Bioinformatics

Each base in a double helix forms one of the four complementary pairs, A-T, T-A, C-G, and G-C as the following example shows. (Thus, given a DNA single strand, we can figure out its complementary strand.) A C T TAA C G G C G AT T G AAT T G C C G C TA Proteins are synthesized by the following three steps; (1) a single strand segment (i.e., gene) of the genome is transcribed into an RNA. The RNA is the complementary strand of the source strand, with every base T replaced by U (uracil). Hence, RNA‟s consist of four bases, A, C, G and U. 5‟ end

TCAAGCT 5‟ end UCAAGCU

3‟ end

3‟ end

AGTTCGA

RNA

(1) RNA Transcription 426

Web Programming and Bioinformatics

Gene Grammar

(2) In eukaryotes, only useful substrings for protein synthesis, called exons, are spliced (i.e., cut off and concatenated) to form a mRNA (messenger RNA). (3) Finally, ribosomes, large molecular assemblies traveling along the mRNA, translate the mRNA into a protein, which is an amino-acid sequence.

Transcribed RNA

exons

introns

(2) splicing

Messenger RNA (3) Translation Protein

427

Gene Grammar

Web Programming and Bioinformatics

There are 20 different amino acids. Hence, proteins can be represented as a string of 20 characters. The translation is done by triplets (i.e., groups of three adjacent bases), also called codons, according to the table shown below. The translation begins with the start codon ATG, and continues until it meets one of the stop codons, TAA, TAG, and TGA (see the table below). For every codon scanned, a specific amino acid is attached to the next position of the protein being synthesized. Notice that because there are 444 = 64 different codons and 20 amino acids, several codons are translated into the same amino acid. (Following convention, we shall represent codons using character T rather than U.) Ala Arg Asp Ans Cys GCA

AGA

GCG

GAT AGG GAC

GCT

CGA

GCC

CGG

AAT TGT AAC TGC

Glu

Gln Gly

GAA CAA GGA GAG CAG GGG GGT GGC

His

Ile Leu Lys Met

Phe Pro Ser Thr Trp Tyr Var

CAT ATA TTA AAA ATG TTT CAC ATT TTG AAG (start) TTC ATC CTA CTG CTT CTC

CCA CCG CCT CCC

AGT AGC TCA TCG TCT TCC

ACA TGG TAT GTA ACG TAC GTG ACT GTT ACC GTC

Stop TAA TAG TGA

CGT CGC

428

Gene Grammar

Web Programming and Bioinformatics

This three step process for protein synthesis is called the central dogma in molecular biology, because with so many exceptional cases, it is no longer accepted as a valid theory. However, to show a potential application of the formal languages and automata theory to molecular biology, we shall present an example based on this dogma. As we did in Section 14.1 for HTML, we will transform the informal definition of genes into a formal definition in terms of a grammar. (This example is an excerpt, lightly edited, from David Sears‟ paper “The Linguistics of DNA,” which appeared in the journal American Scientist, Vol. 80, 1992.) • The transcript, which is the part of the gene that is copied into a messenger RNA, has flanking 5‟- and 3‟-untranslated-regions around a coding-region initiated by a startcodon. → → <5‟-untranslated-region><start-codon> <3‟-untranslated-region>

429

Gene Grammar

Web Programming and Bioinformatics

• The coding-region consists of a codon followed by more coding-region, a recursion that ultimately terminates in a stop-codon. The coding-region also allows for a splice at any point in the recursion, resulting in the insertion of an intron, which is bracketed by a splicing signals GT and AG. →

| <splice> | <stop-codon> <splice> → → GTAG • A stop-codon is any of the three triplets TAA, TAG and TGA, whereas a codon is any of the remaining 61 possible triplets that are translated into an amino acid according to the rule given in the codon table. <start-codon> → <Met>

<stop-codon> → TAA | TAG | TGA

→ | | | | | | | | | | | | <Met>| | | <Ser> | | | | 430

Gene Grammar

Web Programming and Bioinformatics

• The first two bases of a codon partially specify the amino acid into which it is translated. Every codon starting with GC (CG) is translated into alanine (respectively, arginine), independent of the third base. Every codon starting with GA is translated into aspartic acid if the third base is a pyrimidine (i.e., C or T). Otherwise, if the third base is a purine (i.e., A or G), it is translated into glutamic acid. Here are some formal representations of such translation rules specified in the codon table. → GC → TG → A | C | G | T

→ CG → GA → C | T

→ GA → CA → A | G

All the grammar rules presented up to this point are context-free. But the following biological fact requires a rule from the upper level in the Chomsky hierarchy. • Introns can be transposed to the left or right. →

→

The following slide shows a grammar that specifies the set of “genes” that we have developed in this section. 431

Web Programming and Bioinformatics

Gene Grammar →

→ <5‟-untranslated-region><start-codon><3‟-untranslated-region> →

| <splice> | <stop-codon> ---------------------------------------------------------------------------

→ | | | | | | | | | | | | <Met>| | | <Ser>| | | | <start-codon> → <Met>

<stop-codon> → TAA | TAG | TGA

---------------------------------------------------------------------------- → GC

→ CG

→ GA

→ TG

→ GA

→ CA

→ GG

→ CA

→ AT | ATA <Met> → ATG

→ TT | CT → TT<primidine>

<Ser> → AG<primidine> | TC

→ TGG

→ CC → TA

→ AA → AC → GT

------------------------------------------------------------------------------ → A | T | G | C

→ C | T

→ A | G

------------------------------------------------------------------------------<splice> →

→ GTAG

------------------------------------------------------------------------------ →

→

432

Gene Grammar

Web Programming and Bioinformatics

The grammar that we have just developed is incomplete because no rules are defined for the nonterminal symbols <5‟-untranslated-region>, <3‟-untranslated-region> and . The two untranslated regions flanking the coding-region usually contain segments regulating the gene expression. The contents and the function of these parts are not yet completely understood. The intron-body parts generally consist of a long repeating string, whose exact role is also under investigation. We need more research before we can present a formal model for these regions.

Love

Break Time

What else is love but understanding and rejoicing in the fact that another person lives, acts, and experiences otherwise than we do….? - Friedrich Nietzsche – Just because you love someone doesn‟t mean you have to be involved with them. Love is not a bandage to cover wounds. - Hugh Elliot -

433

Hierarchy of the Models

434

Review

Chomsky Hierarchy

Languages (grammars) Recursively Enumerable Sets (type 0)

Context-sensitive Languages(type 1)

Machines Turing Machines (TM)

Linear-bounded Automata (LBA)

Context-free Languages(type 2)

Pushdown Automata (PDA)

Regular Languages(type3)

Finite State Automata (FA)

Other Models Post System, Markov Algorithms, -recursive Functions

. . .  : Containment .  : Characterization . * Colored arrows indicate proven relations. Regular Expression 435

15. Hierarchy of the Models: Final Proof In Chapters 7 and 12, respectively, we have partially proved the characterization relations and the containment relations in the hierarchy of languages and automata (i.e., Chomsky hierarchy, repeated in the preceding slide, where the colored arrows indicate proven parts). This chapter introduces two additional interesting classes of languages into the hierarchy. One is the class of languages that cannot be recognized by any TM, and the other is the class of languages that can be recognized by a TM that eventually halts. (These languages are called recursive languages.) Then the chapter completes the proof of the hierarchy including these new classes of languages. The proofs are quite challenging for undergraduate students. However, it is worth the challenge because the techniques involve elegant logic that can be followed with the knowledge that we have learned in Chapter 1.

15.1 Languages that TM’s cannot recognize 437 Enumeration Enumerating TM‟s 15.2 Universal TM 448 15.3 Enumerable Languages 450 15.4 Recursive Languages 459 15.5 Proof of characterizations 471 Type 0 grammar and TM CSG and LBA CFG and PDA Rumination 506 Exercises 507

436

Hierarchy

15.1 Languages that TM’s cannot recognize At the top level of the Chomsky hierarchy, there are type 0 languages. In this chapter we will prove that every language is type 0 if and only if it is recognizable by a TM. Before we prove this characterization, we may ask the following question. Is there a language that cannot be recognized by a TM? The answer is positive. We will first prove this answer. A popular approach that we will use in this chapter is the diagonalization technique introduced in chapter 1. This technique requires an enumeration (i.e., listing out) of all possible target objects, such as certain types of grammars, automata, or strings. Before using the diagonalization technique for the proof, we need to formally define the term enumeration. Definition 15.1(Enumeration). For a given set S, to enumerate S is to list out every element in S on a line, and we call a TM that is capable of enumerating a set an enumerator. If a set is enumerable, then we call it an enumerable set. For example, the set of positive integers and the set of character strings on a finite alphabet are enumerable, because we can build a TM that lists all the elements of the sets on its tape (see the illustration on the next page). Notice that because these sets are infinite, the enumerator never stops. 437

Non-type 0 Languages

Hierarchy

The figure below shows how we can build a TM which enumerates the set of positive integers in binary. (The set *, for a finite alphabet , is also enumerable because we can build a TM which enumerates the set in lexicographic order.)

Write 1 on the blank tape start (B, 1, R)

Write a copy of the rightmost number in the list to its right, separated by two cells.

Increment the copy by one

Enumerable sets are also called countable, because all the elements in an enumerable set are countable, i.e., they can be put to one to one correspondence with the set of positive integers. There are uncountable sets. The following theorem shows an example.

438

Hierarchy

Non-type 0 Languages

Theorem 15.1 For a given finite alphabet , the sets of all languages in  (i.e., the set of all subsets in *) is not enumerable. Proof. We use the diagonalization technique. Suppose that all the languages in * are enumerable, i.e., we can list them out along an infinite line. Let Li be the i-th language in the list. As illustrated in the figure below, consider a matrix M, where M[i, j] = 1, if wi is in language Lj, otherwise, M[i, j] = 0. (Notice that this is a conceptual setting. We don‟t actually have to construct the whole matrix.) L1 L2 L3 L4 L5 . . . w1 w2

w3 w4 . .

0

1

0

1

1

1

1

0

0

1

1

0

1

1

0

0

1

1

0

0

M 439

Hierarchy

Non-type 0 Languages

Now, with the diagonal entries we construct a language LD defined as follows. LD = {wi | M[i, i] = 0}

We claim that LD is a language that does not exist in the list. Suppose that i-th language Li = LD. If wi Li, then wi  LD. Otherwise, if wi  Li, then wi  LD. It must be that Li  LD. Notice that this proof holds even for the singleton alphabet, for example,  = {a}. L1 L2 L3 L4 L5 . . . w1 w2

w3 w4 . .

0

1

0

1

1

1

1

0

0

1

1

0

1

1

0

0

1

1

0

0

M 440

Non-type 0 Languages

Hierarchy

Enumerating all TM’s Now we go back to the question of whether there is a language that cannot be recognized by any TM. We learned that for a given alphabet , the set of languages in * is not enumerable. If we can show that the set of all TM‟s is enumerable, then we prove that there is a language that cannot be recognized by any TM, because for every TM, there is only one language that the machine recognizes. Let M = (Q, , , , q0, F) be a TM, where Q,  and    are, respectively, the set of states, the tape alphabet, and the input alphabet, all finite. We assume that M has only one accepting state, i.e., |F| = 1 and  = {a, b}. (Once in an accepting state, the TM‟s do not have to move. Hence, all accepting states can be merged into a single state without affecting the language accepted by a TM. See the rumination section at the end of Chapter 4.) We may use other alphabet sizes, but 2 are enough for our proof. Notice that to list all TM‟s accepting languages in *, we must consider all possible finite sets of states and tape alphabets. Here, the finiteness implies that the size is fixed for a given TM, but there is no upper bound. Thus, there are infinitely many TM‟s, each recognizing a language in {a, b}*. If they are enumerable, how can we list them with an enumerator? 441

Non-type 0 Languages

Hierarchy

With no loss of generality, assume that every TM uses a finite number of state names and the tape alphabet, respectively, chosen from the left end of the following lists (pool) of state names and tape alphabet. (Notice that both lists are infinitely long.) As usual, we will use R, L, N for the direction of the tape head move. pool of state names: {q0, q1, q2, . . . . } pool of tape alphabet: {t0, t1, t2, . . . . . } direction of a move: {R, L, N} We further assume that q0 is the start state, q1 is the accepting state, t0 is the blank symbol (B), t1 = a, and t2 = b. (Recall from the ruminations at the end of Chapters 4 and 5 that every TM can accept its language with one accepting state.) Finally, we chose a format to describe the state transition function . For example, we may list , as follows. (q0, a) = (q2, b, R); (q2, a) = (q5, a, L); (q0, b) = (q3, b, R); . . . . .

442

Non-type 0 Languages

Hierarchy

For simplicity, let‟s delete the parentheses and commas from the list as follows. The transition function in conventional notation:

(q0, t1) = (q2, t2, R); (q2, t2) = (q5, t1, L); (q0, t2) = (q3, t3, R); . . . . . The transition function in abbreviated notation: q0 t1 =q2 t2R;q2 t2=q5 t1L;q0 t2=q3 t3R; . . . . . In this shortened list, every transition consists of six symbols and is delimited with a semicolon. Though poorly legible, it carries the same information as the original list. We are going to represent a TM‟s transition function  written in this format. Recall that once all the symbols in the set of states Q, input alphabet , tape alphabet , the start state q0, and the accepting state in F are defined, we can uniquely represent a TM only with the transition function  (or with the state transition graph). We will use this list for the convenience of converting it into a binary number in the next step.

443

Hierarchy

Non-type 0 Languages

Let E be an encoding function (i.e., homomorphism) such that given a symbol used in the transition function of a TM, E returns a binary code as follows. E(;) = 010, E(=) = 0120, E(L) = 0130, E(R) = 0140, E(N) = 0150, E(qi) = 016+2i0, E(ti) = 017+2i0, i  0. Notice that every binary code is uniquely identified by the number of 1‟s between the two 0‟s. State codes have an even number of 1‟s and tape symbols an odd number of 1‟s. Here are some examples of the binary encoding according to the above encoding function E.

q0t1=q2t2R; q2t2=q5t1L; q0t2=q3t3R;

01600190012001100011100140010 011000111001200116001900130010 016001110012001120011300140010

We extend function E such that, given a TM M, E(M) gives the binary encoding of M‟s transition function  according to the above definition. 444

Hierarchy

Non-type 0 Languages

It is possible to construct a TM MD that, given a binary string w, decides whether w is a binary encoding E(M) of a TM M or not. MD simply checks whether every substring in w delimited by 010 (the code for ;) is the binary encoding of a transition pa=qbX for some states p and q, some tape symbols a and b, and X  {R, L, N}, each belonging to the sets we have defined. Now, we are ready to construct an enumerator TE (i.e., a TM) that lists all TM‟s on its tape, M0, M1, M2, …. To do this, our enumerator TE generates the binary numbers on the tape in the order of their length and value; 0, 1, 00, 01, 10, 11, . . . , one by one, checking whether the current binary number written on the tape is the binary encoding of the transition function of a TM. If it is, the enumerator generates the next binary number to the right. Otherwise it erases the number before the next number is written. The enumerator repeats this process forever. Since for every TM M, the length of E(M) is finite, E(M) will eventually appear on the list. M0, M1, M2, . . . , Mi . . . TE 445

Non-type 0 Languages

Hierarchy

Now, we can claim the following theorem. Theorem 15.2 The set of all TM‟s, each encoded by the function E, are enumerable. The binary encoding function E is one-to-one function, i.e., no two TM‟s can be encoded to the same binary number. For each TM M, there is only one language accepted by M. Thus, the fact that all TM‟s are enumerable implies that all TM languages are enumerable. (Since a language can be recognized by more than one TM, a language may appear in the enumeration multiple times.) Corollary. The set of all type 0 languages on a given finite alphabet  is enumerable. Type 0 languages are also called recursively enumerable (R.E.) languages. By Theorem 15.1, we know that for a given finite alphabet , it is impossible to enumerate all languages in *. With this theorem and the above Corollary, we can claim the following theorem. Theorem 15.3 Let  be a finite alphabet. There is a language in * that cannot be recognized by a TM.

446

Non-type 0 Languages

Hierarchy

Notice that this theorem also holds for a singleton alphabet, for example  = {a}. The same binary encoding technique that we have introduced for TM is also applicable to grammars and other automata. Thus, we can also claim the following. Theorem 15.4 The set of all binary encoded CSG‟s is enumerable.

Life

Break Time

- We make a living by what we get, we make a life by what we give. - Sir Winston Churchill - Life is something that happens when you can‟t get to sleep. - Fran Lebowitz – - Life is something that everyone should try at least once. - Henry J. Tillman - Life is pleasant. Death is peaceful. It‟s the transition that‟s troublesome. - Isaac Asimov – - Life is what happens to you while you‟re busy making other plans. - John Lennon – - The secret of a good life is to have the right loyalties and hold them in the right scale of values. - Norman Thomas -

447

Hierarchy

15.2 Universal TM Definition 15.2 (Universal TM). We call a TM Tu universal, if Tu, given an arbitrary TM M and an input x, simulates the computation of M on the input x. Here is an idea of how to construct a universal TM Tu. The input M and x given to Tu must be represented according to an agreed format. Suppose that they are binary encoded as E(M) and E(x), using the same function E that we developed in the previous section. TU records E(M), E(x), the current state E(qi) of M and the head position as the following figure illustrates. Using this information on its tape, Tu follows M‟s moves one by one. If M enters an accepting state, so does Tu. If M meets a transition undefined, Tu terminates the simulation.

E(M) E(x)

E(qi) Head position

Tu 448

Hierarchy Universal TM A TM M recognizes only its language L(M). Thus M is a special purpose computer built to recognize L(M). In contrast, the universal TM is a general purpose computer in the sense that, given E(M) and E(x), the machine gives the result of the computation of M on input x. Here E(M) and E(x) correspond, respectively, to a program and an input represented in an agreed form, i.e., a programming language.

Break Time Science & Research The whole science is nothing more than a refinement of everyday thinking.

- Albert Einstein –

Research is of mind, not of hands, a concentration of thought and not a process of experimentation. Research is the effort of the mind to comprehend relationships which no one had previously known. - DeForest Arnold -

449

Hierarchy

15.3 Enumerable Languages Now, we turn our attention from the topic of enumerating the sets of automata and languages to enumerating a language itself. We first prove the following theorem, which claims that enumerability is another characterization of type 0 languages.

Theorem 15.5 (1) If a language L is enumerable, then L is a type 0 (R.E.) language. (2) If a language L is type 0 (R.E.), then L is enumerable. Proof (1). Let Me be a TM (i.e., enumerator) that enumerates L. With Me we can build a 2-tape TM M recognizing L as follows. Given an input x, using Me in its finite state control, M writes the next string wi in the enumeration and checks whether x = wi. If it does, M enters an accepting state and halts. Otherwise, it writes the next string wi+1 in the enumeration and repeats the process. M accepts x if and only if x  L. x

Me

TM M wi 450

Enumerable Languages

Hierarchy

Proof (2). This part of proof is not as simple as the proof for part (1). Let M be a TM recognizing L. We build an enumerator Me for L with M. Let  be the alphabet of language L. We may think of the following straightforward idea. We let Me, while writing the strings wi  * in the lexicographic order on one of its tape tracks, check whether wi is accepted by M. If it is accepted, we let it be the next string in the enumeration. Otherwise, we erase it and repeat the process with the next string in *. There is a serious problem with this idea. What will happen to Me, if M enters an infinite loop without accepting wi? We cannot make Me not to put wi in the enumeration, because it is impossible for Me to decide whether M will eventually halt or not (i.e., it is an unsolvable decision problem). Thus, the idea fails. w1

w2

w3

w2

w4

w7

w4 . . . . .

. . .

accepted strings TM M

enumerated strings in *.

Me 451

Hierarchy

Enumerable Languages

So we need a different approach. As before, while enumerating the strings in * one by one in the lexicographic order, we let Me progressively simulate the computation of M on input w1, w2, . . ., as the figure below illustrates. We let Me, writing w1, simulate the first move of M on w1. Writing w2, we let it simulate the first move of M on w2 followed by the second move on w1. Writing w3, we let it simulate the first move of M on w3 followed by the second move on w2, followed by third move on w1, and so on. Any string accepted during this procedure is moved onto the lower track (not shown), and we let it be the next string in the enumeration. w1

steps

w2

w3

1





2





3



w4 . . . .



4

. .

TM M

Me 452

Enumerable Languages

Hierarchy

According to Church‟s hypothesis, no computational model exists that is more powerful than Turing machines. In other words, any computational model computes at best what Turing machines can compute. Then we may ask the following. Is there any limit on what Turing machines can compute? We learned a proof that for a given alphabet , there is a language in * that cannot be recognized by any TM. This proof is non-constructive. The proof of the following theorem (constructively) shows a language that cannot be recognized by any TM. Theorem 15.6 Let  be an arbitrary finite alphabet. There is a language in *, that cannot be recognize by any TM.

Proof. We know that the set * and the set of all TM‟s are enumerable (Theorem 15.2). With these facts, we will prove the theorem using the diagonalization technique. We assume that every TM M and string x  * are, respectively, expressed by the binary encoding E(M) and E(x) that we have introduced in the previous section.

453

Hierarchy

Enumerable Languages

Proof (cont’d). We construct a matrix M such that for every i  0, the i-th row corresponds to the i-th string wi , and the i-th column corresponds to the i-th TM Mi in the enumeration. This is possible because the set of TM‟s and the set * are both enumerable. For the matrix entries, we let M[i, j] = 1, if Mj accepts string wi. Otherwise, we let M[i, j] = 0. Now, we claim that no TM recognizes language L0 below, which contains string wi , if i-th TM Mi in the enumeration does not accept i-th string wi. M0 M1 M2 M3 . . . . w0

0

1

1

1

0

w1

1

1

0

0

1

w2

1

0

0

0

0

w3

0

0

1

1

1

w4

1

1

1

0

0

L0 = { wi | M[i, i] = 0, i  0 }

. M 454

Hierarchy

Enumerable Languages

Suppose that there is a TM, say Mi in the enumeration, that recognizes L0. (We are using the proof-by-contradiction technique.) Consider what will happen if we give ith string wi as an input to Mi. (Notice that Mi and wi have the same subscript.) Either wi  L0 or wi  L0. We examine these two cases. (1) When wi  L0. By definition of the matrix, this case implies that M[i, i] = 0, i.e., Mi does not accept wi. It follows that wi  L(Mi) = L0. We have a contradiction. (2) When wi  L0. By definition, the case implies that M[i, i] = 1. That is, Mi accepts wi. It follows that wi  L(Mi) = L0. Again, we have a contradiction. So we should give up the assumption that L0 = L(Mi), for some i  0. No TM exists that recognizes L0. Language L0 is not type 0 (R.E.). M0 M1 M2 M3 . . . . L0 = { wi | M[i, i] = 0, i  0 } = { w0, w2, w4 , . . . }

w0

0

1

1

1

0

w1

1

1

0

0

1

w2

1

0

0

0

0

w3

0

0

1

1

1

w4

1

1

1

0

0

.

455

Hierarchy

Enumerable Languages

Now, consider L1 = {wi | M[i, i] = 1 }, which is the complement of L0 (see below). We will show that in contrast to L0, language L1 is R.E., i.e., type 0.

Theorem 15.7 L1 is a type 0 (R.E.) language. M0 M1 M2 M3 . . . . w0

0

1

1

1

0

w1

1

1

0

0

1

w2

1

0

0

0

0

w3

0

0

1

1

1

w4

1

1

1

0

0

L1 = { wi | M[i, i] = 1, i  0 } = { w1, w3, . . . }

.

456

Hierarchy

Enumerable Languages

Proof. We construct a TM MD that recognizes L1 as follows. Suppose an input x is given to MD. MD is equipped with two enumerators, the one enumerating the set of all strings in * and the other enumerating the set of all TM‟s (see the figure below). MD alternately enumerates M0, w0, M1, w1, . . . on its two extra tapes until TM Mi and wi (= x) appear. Then MD simulates the computation of Mi on input wi. If Mi accepts x, so does MD, and it halts. (Notice that Mi may run forever without accepting. Then so does MD.) It follows that MD is a TM which recognizes L1. Language L1 is type 0 (R.E.). If x = wi, then simulate Mi with input x.

Mi

x Input tape *

TM enumerator MD

enumerator

wi 457

Hierarchy

Enumerable Languages Since L0 is the complement of L1, we have the following corollary.

Corollary. There is a type 0 language whose complement is not type 0.

Good life

Break Time

When you were born, you were crying and everyone around you was smiling. Live your life so that when you die, you‟re smiling and everyone around you crying, with tears of joy for having known you. - Anonymous -

458

Hierarchy

15.4 Recursive Language The Turing machine model that we have defined halts if the input is accepted by the machine. The definition does not explicitly say what the machine should do if the input is not accepted. The machine may either enter a dead state (if an undefined input symbol is read in a state) or never halt, going around an indefinite loop of nonaccepting states. (Recall that for convenience, we do not show transitions entering the dead state.) How about defining a TM model, as the following figure illustrates, which after some finite number of moves, always halts in either an accepting state or a nonaccepting state? Such TM‟s and their languages are called recursive.

(a, b, N) Reject and halt

Recursive TM

(c, a, N)

Accept and halt 459

Recursive Languages

Hierarchy

This model could be more user friendly than the original one. We may let it output a kind message before halting, like “Yes, I accept the input!” or “No, I regret that I have to reject it.” For the original TM model, it is impossible to give any specific message when it does not accept the input because there is no way to know whether it will enter an infinite loop or not. Interestingly, we will show that the recursive (i.e., halting) TM model is weaker than the original model in the sense that there is a type 0 language that cannot be recognized by any recursive TM. Before we show the proof, we define the following. Definition 15.3 (Recursive TM, Recursive Language). A recursive TM is a TM that eventually (i.e., after a finite number of moves) halts in either an accepting or a non-accepting state. The language accepted by such TM is called a recursive language. In contrast to type 0 languages, the following theorem claims that the complement of recursive languages are also recursive.

460

Hierarchy

Recursive Languages

Theorem 15.8 The complement of a recursive language is also recursive.

Proof. Let L be a recursive language accepted by a recursive TM M. We simply change the accepting states of M to non-accepting states, and vice versa. This modified TM accepts the complement of L. Let Type-Rec be the class of recursive languages. We know that every recursive language is type 0. According to the Corollary of Theorem 15.7, there is a type 0 language whose complement is not type 0. Thus, Theorems 15.7 and 15.8 above implies the following theorem. Theorem 15.9 Type-Rec is properly contained in Type-0L.

Type-0L

Type-Rec

461

Hierarchy

Recursive Languages

Now that the top level (i.e., type 0) of Chomsky hierarchy has been refined, we move down to the next lower level (i.e., type 1). We will first prove Theorem 15.10 below, which claims that every context-sensitive language is recursive. Then using this theorem and Theorem 15.9, we will show that Type-1L  Type-0L.

Theorem 15.10 Every context-sensitive language is recursive. Proof. Let G = (VT, VN, P, S) be an arbitrary context-sensitive grammar. We construct a recursive TM M that recognizes L(G) as follows. M has a two-track tape with G stored in its finite state control and the input is given on the top track, as the following figure illustrates. The idea is to let M, using G, nondeterministically generate a terminal string on its second track and see whether it is equal to x. If it is, M accepts x and halts. Otherwise, M rejects it and halts. x S => . . . . => x ? TM M

G 462

Hierarchy

Recursive Languages

On its lower track M writes a derivation sequence, i.e., a sequence of sentential forms, w0, w1, w2, . . ., that can be derivable starting with the start symbol S of G as the figure below illustrates. In the list w0 = S, and w1 is a sentential form that can be generated by applying a rule of S, string w2 is generated by applying a rule whose left side is a substring in w1, and so on. If wi contains more than one substring corresponding to the left side of some rules of G, then M nondeterministically chooses one of the substrings to apply its rule. If the chosen substring has more than one rule (for example, bA  Ab |ba), M nondeterministically chooses one of those rules. M repeats this procedure until it sees that one of the following conditions is satisfied. (1) wi = x,

(2) wi = wj, i < j ,

S

Recursive TM M

(3) (|wi| > |x|) OR (wi  (VT)+ AND wi  x)

w1

x 

w2

...

G 463

Recursive Languages

Hierarchy

If condition (1) is satisfied, M halts in an accepting state. Case (2) implies that the derivation got into a loop. So, M halts in a non-accepting state. Case (3) implies that according to the non-contracting property of context-sensitive grammar rules, it is impossible to derive string x. Thus in this case too M halts in a non-accepting state. M is a recursive (i.e., halting) TM that recognizes L(G). It follows that every contextsensitive language is recursive.

Now, we may ask the following question: Is the converse of Theorem 15.10 true? (That is, is every recursive language context-sensitive?) In terms of automata, this is equivalent to the question of whether every language recognized by a recursive TM can also be recognized by an LBA. The next theorem shows that there is a recursive language that is not context-sensitive.

464

Hierarchy

Recursive Languages

Theorem 15.11 There is a recursive language that is not context-sensitive. Proof. Again, we will use the diagonalization technique. We know that the set of all CSG‟s with terminal alphabet VT = {a, b} is enumerable (see Theorem 15.4). As illustrated below, construct a matrix M such that i-th column corresponds to the i-th grammar Gi in the enumeration, and i-th row corresponds to the i-th string wi  {a, b}* in the lexicographic order. For the entries of the matrix, we let M[i, j] = 1, if wi  L(Gj). Otherwise, M[i, j] = 0. We have L(Gi) = { wj | M[j, i] = 1, j  0}. G0 G1

G2

G3 . . . .

w0

0

1

1

1

0

w1

1

1

0

0

1

w2

1

0

0

0

0

w3

0

0

1

1

1

w4

1

1

1

0

0

.

465

Hierarchy

Recursive Languages

Now, going along the diagonal of matrix M, we pick all wi  L(Gi) and construct the language L0. G0

G1

G2

G3 . . . .

w0

0

1

1

1

0

w1

1

1

0

0

1

w2

1

0

0

0

0

w3

0

0

1

1

1

w4

1

1

1

0

0

L0 = { wi | M[i, i] = 0, i  0 } = { w0, w2, w4, , , , }

. We first prove that L0 is not context-sensitive. This proof is very similar to the one of Theorem 15.3, where we showed a language that cannot be recognized by any TM. Suppose that L0 is a language of a CSG Gi, i.e., L0 = L(Gi). Consider the string wi. (Notice that wi and Gi have the same subscript.) Either wi  L0 or wi  L0. We examine each of these cases. 466

Hierarchy

Recursive Languages

(1) Case of wi  L0: By the definition of L0, we have M[i, i] = 0, which implies that wi  L(Gi) = L0 . We are in a contradiction. (2) Case of wi  L0: Again by the definition of L0, we have M[i, i] = 1, which implies wi  L(Gi) = L0. Again, we are in a contradiction. So we must give up the supposition that L0 = L(Gi). Language L0 is not context-sensitive. Now, we will show that L0 is recursive. The proof is similar to the proof of Theorem 15.7. We construct a recursive TM M0 which recognizes L0 as follows. Suppose that on the input tape of M0, a string x is given. Gi x

CSG enumerator

M0

* enumerator

wi 467

Hierarchy

Recursive Languages

The machine M0 has two enumerators, one for the set of context-sensitive grammars, and another for the set *. Using these two enumerators, M0 keeps alternately generating grammars and strings on two tapes as shown below, until wi appears, that is equal to x. Then M0 tests whether Gi can derive x using the idea presented in the proof of Theorem15.10. If this test shows that Gi can derive x, M0 rejects the input x and halts. Otherwise, M0 accepts x and halts. It follows that M0 is a recursive TM that recognizes L0. Language L0 is recursive. Gi x

CSG enumerator M0

* enumerator

wi

468

Recursive Languages

Hierarchy

In this chapter we have introduced two additional classes of languages, i.e., the class of recursive languages and the class of languages that are not type 0, and we have identified their level in the Chomsky hierarchy. The one is the class of languages that cannot be recognized by any TM, and the other is the set of languages recognized by recursive TM‟s. The next page shows the Chomsky hierarchy, including the class of recursive languages. (The class of non-type 0 languages is not shown.) In the figure the red arrows show the characterization relations (  ) and the proper containment relations () that we have proved so far.

Funny ads

Break Time

- Dinner Special: Turkey $2.35; Chicken or Beef $2.25; Children $2.00. - For Sale: Antique desk suitable for lady with thick legs and large drawers. - Now is your chance to have your ears pierced and get an extra pair to take home too! - No matter what your topcoat is made of, this miracle spray will make it really repellent. - Dog for Sale: Eats anything and is fond of children. - Tired of cleaning yourself? Let me do it! - Jason -

469

Hierarchy

Chomsky Hierarchy Languages

Automata

Recursively Enumerable Sets (type 0)

Turing Machines

Recursive Languages

Context-sensitive Languages(type 1)

Recursive Turing Machines Linear-bounded Automata

Context-free Languages(type 2)

Pushdown Automata

Regular Languages(type3)

Finite State Automata

 : Containment

 : Characterization * Colored arrows indicate proven relations.

Regular Expression 470

Hierarchy

15.5 Proofs of Characterizations Now, finally, we will complete the proof of Chomsky hierarchy by proving the top three characterization relations between the types of grammars and automata. Since there is no known class of grammars that generate recursive languages, no proof for characterization is given for this level.

Type 0 grammars and the Turing machines Theorem 15.12 (1) Every language generated by a type 0 grammar can be recognized by a TM. (2) Every language recognized by a TM can be generated by a type 0 grammar. Proof (1). Let G be an arbitrary type 0 grammar, and let {1, 2, . . , n} be the set of strings used for the left side of the rules of G. Let {i1, i2, . . ik} be the set of strings appearing on the right side of rule i, i.e., i  i1 | i2 |, . . , | ik (Notice that n and k are some finite integers.) We construct a TM M that recognizes L(G) as follows. M has G in its finite state control stored as a look-up table.

471

Hierarchy

Proofs of Characterizations

Suppose that a string x is given on its input tape as illustrated in the figure below. The machine M, using an extra work tape, tests whether x can be derived by grammar G. M starts the work by writing the start symbol S on the work tape. Suppose w is a sentential form that can be derived starting with S. M, reading w from left to right, nondeterministically chooses a substring i and replaces it with one of its right side ij, also chosen nondeterministically. M repeats this process until it sees that the current sentential form w is equal to the input string x and, if it is, M halts in an accepting state. Notice that if x  L(G), M may repeat the process forever or see a sentential form w on which no rule is applicable and halt. M accepts x if and only if it is generated by G. It follows that M recognizes L(G), and hence, every type 0 language is recognized by a TM. x

x

x

input G

G i

S

G ij

x

?

G x

work tape 472

Type 0 Grammar  TM

Hierarchy

Proof (2). For the proof, let M = (Q, , , q0, , F) be an arbitrary TM. We construct a type 0 grammar G that generates L(M). Let Q = {q0, q1, . . , qn}, and qf  F. For the construction we shall use two distinct symbols #, $  . Grammar G generates L(M) in the following three phases (see the figure below): (1) for an arbitrary string x  *, grammar G generates string x#q0x$. (2) Starting with the start state q0, G simulates M‟s computation on input x using the string x generated to the right of q0 in the first phase. (3) If M enters the accepting state qf in the second phase, G erases all symbols except the left side x generated in the first phase. We will show how G carries out the three phases. According to the convention of using upper case letters for the nonterminal symbols of a grammar, we shall use Qi instead of state qi. In the figure below, strings u and v concatenated corresponds to the final tape content of M when it accepts x. (2) (3) (1) S  . . .  x#Q0x$  . . . .  x#uQfv$  . . . .  x

473

Hierarchy

Type 0 Grammar  TM

(2) (3) (1) S  . . .  x#Q0x$  . . . .  x#uQfv$  . . . .  x

Let  = {a1, a2, . . . , ak} be the input alphabet of M. The following set of rules carries out phase (1) to generate the string x#Q0x$, for every x  *. Notice that for x = , the rules generates #Q0$. S  A$ A  a1Aa1 | a2Aa2 | . . .

(1) S  . . .  x#Q0x$

| akAak | #Q0

474

Hierarchy

Type 0 Grammar  TM (2) (3) (1) S  . . .  x#Q0x$  . . . .  x#uQfv$  . . . .  x

For phase (2), we set up a one-to-one correspondence between G‟s sentential forms and M‟s configurations as the following figures illustrate. Symbol a   is positioned to the right of Qi in a sentential form of G, if and only if M reads a in state qi (figure (a)). Symbol $ (symbol #) is to the right of Qi in a sentential form of G, if and only if M, in state qi, reads the blank symbol next to the right end (respectively, to the left end) of the current tape contents (figures (b) – (d)). ...

. . . bQia . . .

b a

..

qi

(a)

a .. ..

. . . Qi# a . . qi (c)

a .. ..

. . . #Qia . . (b)

qi ... a

. . . aQi$ . . .

(d)

qi 475

Hierarchy

Type 0 Grammar  TM

Depending on M‟s transitions, we design G‟s rules as follows. Part (a) below is for the case of M reading a non-blank symbol a, and part (b) is for the case of reading the blank symbol next to either end of the current tape contents. Grammar G uses two special symbols $ and # as boundary markers, respectively, to expand its sentential form to the right and to the left to accommodate M‟s current tape contents. (a) M‟s transition: (qi, a) = (qj, b, D)

G‟s rule (where c   {#} ):

if D = R,

cQia  cbQj

if D = L,

cQia  Qjcb

if D = N.

cQia  cQjb

(b) M‟s rule: (qi, B) = (qj, b, D)

Qi#  #bQj

if D = R, if D = L, if D = N,

G‟s rule (where c  ):

Qi#  Qj#b

Qi$  bQj$

#Qi$  Qj#b$ cQi$  Qjcb$ Qi#  #Qjb

Qi$  Qjb$ 476

Hierarchy

Type 0 Grammar  TM (2) (3) (1) S  . . .  x#Q0x$  . . . .  x#uQfv$  . . . .  x

It is simple to design a set of rules for phase (3). By the two rules in (a) below, we let the nonterminal symbol Qf, which denotes the accepting state of M, move to the right till it meets $. Then by the rules in part (b), we let QE keep moving to the left and erasing every symbol it meets up to and including #. For every symbol a , include the following rules in G. (a) Qf#  #Qf

Qfa  aQf

(b) Qf$  QE

aQE  QE

#QE  

Grammar G is type 0, and it derives a string x if and only if x is accepted by M. We proved that L(G) = L(M). It follows that every language recognized by a TM can be generated by a type 0 grammar.

477

Hierarchy

CSG and LBA Now, we prove the characterization relation between CSG‟s and LBA‟s. The approach is similar to the one for the characterization between TM‟s and type 0 grammars that we have just completed. For the proof we should take into consideration of two restrictions: the tape space of an LBA is bounded by the input length, and the grammar rules are non-contracting. Theorem 15.13 (1) Every language generated by a CSG can be recognized by an LBA. (2) For every language L recognizable by an LBA, there is a CSG G such that L(G) = L – {}. Proof (1). The LBA uses a two-track tape, instead of two tapes, to conveniently limit the tape space. If the LBA sees that next sentential form derived can be longer than x, it stops the derivation and rejects the input string x, because it is impossible to derive x by applying non-contracting CSG rules. input [

x

[S

] ]

[

[

x i

] ]

[

x

[

ij

] ]

[

x

[

x

] ]

? G

G

G

G 478

CSG  LBA

Hierarchy

For the special case; when the input is  (i.e., the tape has []), the machine enters an accepting state, if the grammar has the rule S  . (The details are omitted.) Proof (2). Suppose L is the language of an LBA M = (Q,  ,  ,  , q0, [, ], F) , where Q = {q0, q1, …., qm}, and  = {a1, a2, …., ak}, for some constants m and k. For the proof we will construct a CSG G such that L(G) = L(M) – {}. For the construction, we will be inclined to use the same three-phase approach (repeated below) that we used to construct a type 0 grammar which can derive all the input string x accepted by a TM. (2) (3) (1) S  . . .  x#Q0x$  . . . .  x#uQfv$  . . . .  x However, we have a problem to follow this approach because CSG rules are noncontracting. Reviewing the type 0 grammar rules for the three-phase procedure, we find that among the rules for phase (3) (repeated below), all the rules in line (b) are contracting. These rules are used to erase the tape, leaving only the accepted input string x. (a) Qf#  #Qf Qfa  aQf (b) Qf$  QE aQE  QE #QE   479

CSG  LBA

Hierarchy

To overcome this problem, instead of generating two copies of x in phase (1), we merge them into one with composite symbols as shown below. Let x = x1x2 . . . xn be the input string. CSG G derives the following sentential form, where each part delimited by either the angled brackets „<„ and „>‟ or the parentheses „(„ and „)‟ is a composite symbol, and „[„ and „]‟ are the input boundary markers. <x1, q0, ( [, x1 )><x2, x2> . . . . <xn, ( xn, ] )> Now, we are ready to show how G can derive the string x if and only if M accepts it. After deriving a sentential form of the above format, CSG G begins to “simulate” the computation of M on the string [x1x2 . . . xn] going along the sentential form of composite symbols. The leftmost elements in the composite symbols delimited by the angled brackets are reserved for the final derivation when M accepts the input. Seeing that M is in an accepting state, G erases all symbols in the current sentential form except for the reserved input symbols, consequently, deriving the input string x = x1x2 . . . xn.

480

Hierarchy

CSG  LBA

<x1, q0, ([, x1)><x2, x2> . . . . <xn, (xn, ])> Below (1) – (4) are the rules for deriving all possible sentential forms shown above. (Recall that in the above sentential form, xi is a member of the input alphabet  = {a1, a2, …., ak}.) The rules in line (2) are to generate the initial sentential forms of all possible input strings of length one. We need the rule in line (5) for the case when M accepts . Notice that we can safely add this rule because no rule has S on its right side. (Recall that CSG‟s are allowed to use this contracting rule under the condition that no rule has S on its right side.) (1) S  A | A | . . . | A (2)

| | | . . . |

(3) A  A | A | . . . | A (4)

| | | . . . |

(5) If the start state q0 is an accepting state, add S  . 481

Hierarchy

CSG  LBA

For the simulation, G assumes that M is currently reading (in a state qi) the rightmost element in the composite symbol (delimited by parentheses) appearing next to the right of symbol qi. The following figures show the correspondence between the various contexts involving the state symbol qi in a sentential form and the positions of M‟s tape head. (Recall that M doesn‟t keep the original input string x and may change it while computing.) <xi, a>

...

a b .. qi

<xn, qi, (a, ])> ... a

<x1, qi, ([, a)>

[ a

..

qi <x1, qi, ([, ], a)>

]

[ a ]

qi

qi

<x1, qi, (a, [)>

[ a

..

<xn, qi, (], a)> ... a

]

qi

qi <x1, qi, ([, a, ])>

<x1, qi, (], a, [)>

[ a ]

[ a

qi

]

qi 482

CSG  LBA

Hierarchy

Below is the set of rules of G simulating each move of M. The readers are strongly recommended to refer the figures above showing the correspondence between the context of qi and M‟s local configuration. M‟s transition (qi, d) = (qj, e, D) if D = R (qi, c) = (qj, d, D) if D = L (qi, e) = (qj, d, D)

if D = N

G‟s rule ( a, b   , c, d, e  )  <(a, e)>       

483

Hierarchy

CSG  LBA

Finally, to complete the construction for phase (2), we add the rules shown below for the special cases when M reads the boundary markers. M‟s transition

(qi, ]) = (qj, ], L)

(qi, [) = (qj, [, R)

G‟s rule ( a, b  , c, d, e )  

 

484

CSG  LBA

Hierarchy

Now, we present below a set of rules needed for the final phase. With qf indicating that M has accepted the input, the grammar erases all the symbols in the sentential form, except for the leftmost elements reserved to generate the input string. Notice that all the rules satisfy the non-contracting condition. (Recall from Section 1.1 that composite symbols are introduced for the matter of comprehension. They are not different from the regular symbols. We are just extending the alphabet size.) In the following rules, a, b   and c, d . (1)  (2)  a (3) a  ba (4) a  ba (5)  a  a  a The rule in part (1) keeps shifting qf to right until it sees the rightmost composite symbol. Rule (2) extracts the rightmost input symbol (a in the figure). Rules (3) and (4) let the grammar, going left along the sentential form, collect the reserved input symbols to finally derive the input string accepted by M. Rules in part (5) are for the special cases when the length of the input string is 1. Grammar G derives every string x  * if and only if M accepts it, implying that L(G) = L(M). 485

CFG’s and PDA’s

Hierarchy

Now we finally complete the proofs of the characterizations in Chomsky hierarchy by proving the characterization relation between CFG‟s and PDA‟s. Theorem 15.14 (1) Every language generated by a CFG can be recognized by a PDA. (2) Every language recognized by a PDA can be generated by a CFG. Proof (1). Let G be an arbitrary CFG. We construct an NPDA M which recognizes L(G) as follows. The basic approach is the same as we took for the proof of the characterization at the upper level. Using G, the machine nondeterministically derives a terminal string x in the stack to match the input string. M starts the work by pushing the start symbol S into the stack and does the following until it sees that the stack is empty. Let A be a nonterminal symbol which has some constant k rules.

A  1 | 2 |, . . ., | k (a) If A is at the stack top, M nondeterministically choosed a rule A  i, 1 i  k, popd A, and pushes i. (b) If a terminal symbol, say a, appears at the stack top, M reads the input and pop the stack top if the input symbol and the stack top match. 486

CFG  PDA

Hierarchy

(c) If the stack is empty (i.e., the stack top is Z0 ), M enters an accepting state. The NPDA M enters an accepting state if G derives the string given in the input tape. It follows that L(M) = L(G). Example: Suppose that CFG G shown in figure (a) below is given. We construct an NPDA as shown in figure (b), which recognizes L(G). Interestingly, we can show that 3 states are enough to construct an NPDA recognizing L(G) for any CFG G. (, S/aBC), (a, a/ ), (c , c/ ) (, B/bBc), (, B/bc), (b, b/) (, C/bCc), (, C/), G: S  aBC

B  bBc | bc

C  bCc | 

(, Z0/SZ0)

2

(, Z0/Z0)

L(G) = {abncnbmcm | n 1, m  0 } (a)

1

(b) 487

CFG  PDA

Hierarchy

Proof (2). This proof is the last and most challenging among the four characterizations between the models for grammars and automata. However, once we understand the underlying concept of the approach, it requires only perseverance. Here again we will present how to construct a CFG G for a given PDA M such that L(G) = L(M). For the proof we assume that M has the following property. (1) M accepts the input by empty stack (i.e., Z0 is popped out from the stack). (2) M pops the stack or pushes it with exactly one symbol.

In Section 6.3, we showed that every context-free language can be recognized by a PDA with an empty stack (condition (1)). With an empty stack, no move is possible. So we assume that there is one distinguished state that M enters after emptying the stack. (If there is more than one such state, we can merge them all into one state.) In the rumination section of Chapter 4, we showed that every context free language can be recognized by a DPDA with stack operations restricted to either popping or pushing exactly one symbol in a move (condition (2)). (The same claim is applicable to NPDA‟s.) Thus, without loss of generality, we can assume that M meets the two conditions (1) and (2) above. Otherwise, we can modify it without affecting the language. 488

CFG  PDA

Hierarchy

Suppose that M, in a state p, with a symbol A on top of its stack, takes some number of moves until the next stack symbol below A (symbol B in figure (a) below) appears on the stack top for the first time, as shown in figure (b). Let q be the state of M when B appears for the first time on the stack top, but before any operation with B. Notice that B may never appear on the stack top, or the stack height may change many times before B appears. Let [p, A, q], called the stack-top erasing set (STES), denote the set of all input string segments that M will read from the input tape from the time when the machine sees A at the stack top in state p until it first sees B under A, as the following figures illustrate. (Notice that [p, A, q] may be empty.) x p Z0 . . . BA (a)

q Z0 . . . B

[p, A, q] = {x, y, . . }

(c)

(b) 489

CFG  PDA

Hierarchy

Figure (a) below shows non-empty STES‟s for the PDA in figure (a). Notice that, among many others, [2, A, 5] and [1, Z0, 3] are empty STES‟s.

(b, a/) (a, A/aA)

1

[5, a, 6] = {b}

5

2 (a, Z0/AZ0)

6 (b, A/)

(b, A/ )

7

[6, A, 7] = {b} [7, Z0, 4] = {}

3

(, Z0/ )

(, Z0/ )

4

[2, A, 7] = {ab} [2, A, 3] = {b}

[1, Z0, 4] = {aabb, ab} (a)

(b)

490

CFG  PDA

Hierarchy

Let M = (Q, , , , q0, Z0, ) be the NPDA. Here is an algorithm for constructing a context-free grammar G = (VT, VN, S, P) such that L(G) = L(M). (1) Let VT = , VN = { [p, A, q] | p, q  Q, A   }  {S} (i.e., the nonterminal alphabet of G consists of all the composite symbols, each denoting an STES of M and the start symbol S. (2) Let t be the state M enters in a move emptying the stack (i.e., Z0 is popped). Put rule S  [q0, Z0, t] in P. (3) Suppose that a rule with a nonterminal symbol [p, A, q] is newly included in P, where p, q  Q, A  . (If there is no such rule, the algorithm terminates.) (a) If M has a move (p, a, A) = (r, BC) (where, a    { }, r  Q, and B, C   ), then for every s  Q, put the following rule in P. [p, A, q] a[r, B, s][s, C, q] (b) If M has a move (p, a, A) = (q, ) (where, a    { }), put the following rule in P. [p, A, q] a 491

CFG  PDA

Hierarchy

Now we need to understand the underlying idea for constructing CFG rules with the PDA M that generates L(M). The main objective is to design the grammar rules such that x [p, A, q] if and only if string x is derivable starting with the nonterminal symbol [p, A, q], as the following figures illustrate.

x  [p, A, q]

[p, A, q]  . . . .  x

(a) M‟s STES

(b) G‟s derivation

STES [q0, Z0, t] contains all the strings that M reads, starting in the start state and continuing until it enters the state t by emptying the stack (i.e., pops Z0). Hence, we get L(M) = [q0, Z0, t]. It follows that G should be able to derive L(M) starting with the nonterminal symbol [q0, Z0, t]. However, to use S as the start symbol following the convention, in step (2) of the algorithm we put the rule S  [q0, Z0, t] in P.

492

Hierarchy

CFG  PDA

Next let us examine step (3) of the algorithm. Recall that [p, A, q] is the set of all possible input segments that M will read, starting in state p with A on the stack top until the machine, in state q, sees the symbol under A (B in figure (a) below) appear for the first time at the stack top. For example, if M has a move as shown in figure (b) below (i.e. (p, a, A) = (q, )), then a [p, A, q].

a

a q

(a, A/) p

q p

Z0 . . . BA

Z0 . . . B (a)

(b)

493

Hierarchy

CFG  PDA

Now, suppose that M has a pushing move, say (p, a, A) = (r1, DC). (Note that in this move, A is changed to C, and D is pushed on top of it). Let s1 be the state M enters the first time it sees C at the stack top, and let q be the state M enters the first time it sees B (which was under the A) at the stack top as shown below. The set [p, A, q] must contain all the strings in a[r1, D, s1][s1, C, q] (for example, axy in the figure), i.e., [p, A, q]  a[r1, D, s1][s1, C, q].

a

a

x

p

r1

s1

Z0 . . BA

Z0 . . BCD

Z0 . . BC

a

x

y

q Z0 . . . . B

494

Hierarchy

CFG  PDA

The following state transition graph will help us understand why this containment relation holds. [p, A, q]  a[r1, D, s1][s1, C, q] (., D/..) p

(a, A/DC)

r1

q

(., D/..)

(. , B/..)

(. , C/. .)

(. ./AB) s11

s12

(. , C/. .)

Notice that, as the figure illustrates, there can be more than one state (in the figure s1i ) that M may enter the first time it sees C at the stack top. So the following relation holds. [p, A, q]  (a[r1, D, s11][s11, C, q]  a[r1, D, s12][s12, C, q]  . . . )

495

Hierarchy

CFG  PDA

Suppose that M has multiple pushing moves in the same state p with the same stack top A, like (p, a1, A) = (r1, D1C1), (p, a2, A) = (r2, D2C2). We apply the same idea and let [p, A, q] include all the segments of the input strings that M will read until it sees for the first time the symbol (B in the figure below) under DiCi . Thus, for this more general case, we have the following (see the figure below, which illustrates the case).

[p, A, q]  a1[r1, D1, s11][s11, C1, q]  a1[r1, D1, s12][s12, C1, q]  . .

a2[r2, D2, s21][s21, C2, q]  a2[r2, D2, s22][s22, C2, q]  . . . .....

(ai, A/DiCi) (.. /AB)

p

ri

(., Di /..) q

(., B/..)

sij (., Ci /..)

496

Hierarchy

CFG  PDA

Now, suppose that M has some popping moves in state p with A at the stack top, for example, (p, b1, A) = (q, ), (p, b2, A) = (q, ), . . . Since M has only popping and pushing moves on the stack, we finally get the following.

[p, A, q] = a1[r1, D1, s11][s11, C1, q]  a1[r1, D1, s12][s12, C1, q]  . . a2[r2, D2, s21][s21, C2, q]  a2[r2, D2, s22][s22, C2, q]  . . . ....  {b1}  {b2}  . . . Suppose that in state p, M has k1 pushing moves and k2 popping moves, and let n be the number of states of M. The above equation can be expressed as follows. k1

[p, A, q] =

n

(  i=1 j=1

,where p, q, ri, sij, sij  Q,

k2

ai[ri, Di, sij][sij,Ci, q] ai, bi  ,

)

b)  ( i =1 i

A, Di, Ci, . 497

CFG  PDA

Hierarchy

We know that for each triple p, q and A, the following equation holds.

[p, A, q] = a1[r1, D1, s11][s11, C1, q]  a1[r1, D1, s12][s12, C1, q]  . . a2[r2, D2, s21][s21, C2, q]  a2[r2, D2, s22][s22, C2, q]  . . . ....  {b1}  {b2}  . . . Thus, for every nonterminal [p, A, q] in grammar G, we make the following rules. This is exactly what step (3) of the algorithm does.

[p, A, q]  a1[r1, D1, s11][s11, C1, q] | a1[r1, D1, s12][s12, C1, q] | . . | a2[r2, D2, s21][s21, C2, q] | a2[r2, D2, s22][s22, C2, q] |. . . .... | b1 | b2 |. . . 498

Hierarchy

CFG  PDA

Let‟s see a couple of application examples of the algorithm. Example 1. For the PDA M shown in figure (a) below, figure (b) shows a CFG constructed with the algorithm. Rule (1) is constructed by step (2) of the algorithm, rule (2) by step (3)-(a), and rule (3) and (4) by step (3)-(b). We can show that L(M) = {aba} = L(G).

p

(a, Z0/AZ0)

(1) S  [p, Z0, q]

r

(b, A/)

q

s (a, Z0/) (a) PDA M

(2) [p, Z0, q]  a[r, A, s] [s, Z0, q] (3) [r, A, s]  b (4) [s, Z0, q]  a (b) CFG G

499

CFG  PDA

Hierarchy

Example 2. The PDA below is a little more complex than the one in Example 1. Notice that this is an NPDA because of the two transitions in state 2. (b, a/) (a, A/aA) 2

6

5

(b, A/)

(, A/ )

(a, Z0/AZ0)

7 3

1

(b, Z0/ )

4

(, Z0/ )

L(M) = L(G) = {aabb, ab } S  [1, Z0, 4] [1, Z0, 4]  a[2, A, 7] [7, Z0, 4] | a[2, A, 3] [3, Z0, 4] [2, A, 7]  a[5, a, 6][6, A, 7] [5, a, 6]  b

[6, A, 7]  b

[7, Z0, 4]  

[3, Z0, 4]  b

[2, A, 3]  

500

CFG  PDA

Hierarchy

The grammar that we have just constructed has too many rules to comprehend for the simple language. If we apply the techniques for minimizing the number of production rules and eliminating the unit production rules to this grammar (recall the techniques in Sections 10.1 and 10.2), we can drastically simplify the grammar as shown below. S  [1, Z0, 4]

[1, Z0, 4]  a[2, A, 7] [7, Z0, 4] | a[2, A, 3] [3, Z0, 4] [2, A, 7]  a[5, a, 6][6, A, 7] [5, a, 6]  b [7, Z0, 4]  

[6, A, 7]  b

[2, A, 3]  

[3, Z0, 4]  b

S  aabb | ab

501

CFG  PDA

Hierarchy

Example 3. Figure (a) below shows a PDA accepting (by empty stack) our familiar language L = {aibi | i  1 }. Figure (b) below shows the grammar G generated by the algorithm. Again, it has too many rules to comprehend. We need to clean it up.

(1) S  [1, Z0, 4]

(a, A/AA) 2

(a, Z0/AZ0)

(b, A/)

1 (b, A/)

3 4

(, Z0/)

(a) PDA M

(2) [1, Z0, 4]  a[2, A, 3][3, Z0, 4] | a[2, A, 2][2, Z0, 4] (3) [2, A, 4]  a[2, A, 3][3, A, 4] | a[2, A, 2][2, A, 4] (4) [2, A, 3]  a[2, A, 3][3, A, 3] | a[2, A, 2][2, A, 3] | b (5) [2, A, 2]   (6) {3, A, 4]   (7) [3, A, 3]  b (8) [3, Z0 , 4]   (b) CFG G 502

CFG  PDA

Hierarchy

Rules (5) and (6) are useless, because they generate empty string. In state 2, M makes only pushing moves and it never sees (in the same state 2) the stack symbol under the current stack top symbol A. So, [2, A, 2] is empty. Set [3, A, 4] is empty, since the machine empties the stack and terminates without reading the input. Thus we can eliminate all rules involving two nonterminals [2, A, 2] and [3, A, 4] from the grammar. (1) S  [1, Z0, 4] (2) [1, Z0, 4]  a[2, A, 3][3, Z0, 4] |

(a, A/AA) 2

(a, Z0/AZ0)

(b, A/)

1 (b, A/)

3 4

(, Z0/)

a[2, A, 2][2, Z0, 4] (3) [2, A, 4]  a[2, A, 3][3, A, 4] | a[2, A, 2][2, A, 4] (4) [2, A, 3]  a[2, A, 3][3, A, 3] |

a[2, A, 2][2, A, 3] | b (5) [2, A, 2]   (6) [3, A, 4]   (7) [3, A, 3]  b (8) [3, Z0 , 4]   503

CFG  PDA

Hierarchy

Cleaning up those useless symbols from the grammar (see Section 10.3), we get the following simplified grammar.

(a, A/AA) 2

(a, Z0/AZ0)

(b, A/)

1 (b, A/)

3 4

(, Z0/)

(1) S  [1, Z0, 4] (2) [1, Z0, 4]  a[2, A, 3][3, Z0, 4] (4) [2, A, 3]  a[2, A, 3][3, A, 3] | b (7) [3, A, 3]  b (8) [3, Z0 , 4]  

(b) CFG G

(a) PDA M

504

CFG  PDA

Hierarchy

To further simplify the grammar let us replace each of the composite symbols with a unique nonterminal symbol as shown in figure (c) below. Eliminating unit productions and minimizing the number of -production rules (presented in Sections 10.1 and 2), we finally get the CFG shown in figure (d). Now, we see that this grammar generates the language L = {aibi | i  1 }.

(1) S  [1, Z0, 4]

(1) S  A (2) A  aBC (4) B  aBD | b

(2) [1, Z0, 4]  a[2, A, 3][3, Z0, 4] (4) [2, A, 3]  a[2, A, 3][3, A, 3] | b (7) [3, A, 3]  b (8) [3, Z0 , 4]  

(7) D  b

(b)

(8) C   (c)

(1) S  aB (4) B  aBb | b (d) 505

Hierarchy Rumination (1): Constructing a CFG with a PDA Step (3)-(a) of the algorithm (copied below) for constructing a CFG with a given PDA is the single source of generating a “flood” of rules into the grammar. (In particular, see the underlined part.) This step is executed for every pushing move of the automaton. As we saw in Example 3, the algorithm may generate many useless nonterminals. The algorithm for eliminating useless symbols from a CFG that we learned in Chapter 10 can be efficiently used to eliminate those useless symbols. (a) If M has a move (p, a, A) = (r, BC) (where, a    { }, r  Q, and B, C   ), then for every s  Q, put the following rule in P. [p, A, q] a[r, B, s][s, C, q]

Our Language

Break Time

Did you know that "verb" is a noun? How can you look up words in a dictionary if you can't spell them? If a word is misspelled in a dictionary, how would we ever know? If two mouses are mice and two louses are lice, why aren't two houses hice? If Webster wrote the first dictionary, where did he find the words? If you wrote a letter, perhaps you bote your tongue? If you've read a book, you can reread it. But wouldn't this also mean that you would have to "member" somebody in order to remember them? - MoodyFan -

506

Hierarchy Exercises 15.1 Prove that the union S1  S2 of two enumerable set s S1 and S2 is also enumerable. 15.2 Show that the set of all real numbers is not enumerable. 15.3 Prove that the class of languages which are not type 0 is not enumerable.

507

Models For Computation: Part Iii

Overview

More details

Related Documents

Models For Computation: Part Iii

Models For Computation: Part 0

Models For Computation: Part I

Models For Computation: Part Iv

Models For Computation: Part Ii

Models Of Computation

More Documents from ""

Models For Computation: Part Iv

Models For Computation: Part Ii

Models For Computation: Part 0

Models For Computation: Part I

Models For Computation: Part Iii

Review Makalah