Formal Languages
Chapter 5 Context-Free Languages
Wuu Yang National Chiao-Tung University, Taiwan, R.O.C. September 15, 2008
1
Chapter Outline 1. Context-Free Grammars 2. Parsing and Ambiguity 3. Context-Free Grammars and Programming Languages
2
We have seen many languages that are not regular, for instance, {(n )n | n ≥ 0}, which is a special case of properly nested parentheses widely used in conventional programming languages. Context-free languages are mostly used in the specification of high-level computer programming languages, such as Java and Perl. To decide the membership problem (whether a string belongs to a context-free language) is called parsing, which is the front-end of a compiler.
3
§5.1 Context-Free Grammars Definition. A grammar G =def (V, T, S, P ) is a context-free grammar if all production rules in P have the form A → α, where A ∈ V and α ∈ (V ∪ T )∗ . A language L is context-free if and only if L = L(G) for some context-free grammar G. Note that a regular grammar satisfies the above definition and, hence, it is also a context-free grammar. Consequently, a regular language is also a context-free language. Example 5.1. The following grammar is context-free, but is not regular. S → aSa S → bSb S→λ Here is a sample derivation: S ⇒ aSa ⇒ aaSaa ⇒ aabSbaa ⇒ aabbaa. The language generated 4
by this grammar is {wwR | w ∈ Σ∗ }, which is context-free, but not regular. 2 Note that this grammar is linear (see slide 3-20) in that the right-hand side of each production rule contains at most one nonterminal. But it is not right-linear nor left-linear. From this example, we conclude that the family of regular languages is a proper subclass of the family of the context-free languages.
5
Example 5.2. The following grammar is context-free, but is not regular. S → abB A → aaBb B → bbAa A→λ The language generated by this grammar is {ab(bbaa)n bba(ba)n | n ≥ 0}. This language, which is similar to {en f n | n ≥ 0}, is not regular. Note that, though, similar to a right-linear grammar, the right-hand side of each production rule contains at most one nonterminal, the grammar is not right-linear (hence, not regular). 2
6
Example. The language L =def {w ∈ {a, b}∗ | na (w) = nb (w)} is context-free, but is not regular. We can derive a grammar for L: V → V aV bV V → V bV aV V →λ There are at least two derivations of the sentence abab: V ⇒ V aV bV ⇒ aV bV ⇒ abV ⇒ abV aV bV ⇒ abaV bV ⇒ ababV ⇒ abab and V ⇒ V aV bV ⇒ V aV b ⇒ V ab ⇒ V aV bV ab ⇒ aV bV ab ⇒ abV ab ⇒ abab and V ⇒ V aV bV ⇒ V aV b ⇒ aV b ⇒ aV bV aV b ⇒ aV bV ab ⇒ abV ab ⇒ abab. We way this grammar is ambiguous. 2
7
Example. The language L =def {w ∈ {a, b}∗ | na (w) ≥ nb (w)} is context-free, but is not regular. We can derive a grammar for L: T → T aT T →V V → V aV bV | V bV aV | λ This grammar is also ambiguous. 2 Example. The language L =def {w ∈ {a, b}∗ | na (w) > nb (w)} is context-free, but is not regular. We can derive a grammar for L: S → T aT T → T aT | V V → V aV bV | V bV aV | λ This grammar is also ambiguous. 2
8
Example. The language L =def {w ∈ {a, b}∗ | na (w) 6= nb (w)} is context-free, but is not regular. We can derive a grammar for L: S → T aT | U bU T → T aT | V U → U bU | V V → V aV bV | V bV aV | λ This grammar is also ambiguous. This language is the complement of a previous context-free langauge. 2
9
Example. The language L =def {an bm | n = m} is context-free, but is not regular. We can derive a grammar for L: V → aV b V →λ This grammar is unambiguous. 2 Example. The language L =def {an bm | n ≥ m} is context-free, but is not regular. We can derive a grammar for L: T → aT T →V V → aV b | λ The strings derived from T contains zero or more a’s than b’s. This grammar is unambiguous. 2
10
Example. The language L =def {an bm | n > m} is context-free, but is not regular. We can derive a grammar for L: S → aT T → aT | V V → aV b | λ The strings derived from S contains one or more a’s than b’s. This grammar is unambiguous. 2
11
Example 5.3. The language L =def {an bm | n 6= m} is context-free, but is not regular. We can derive a grammar for L: S → aT | U b T → aT | V U → Ub | V V → aV b | λ Either (1) the strings derived from S contains one or more a’s than b’s (if we take S → aT during the first derivation step) or (2) the strings derived from S contains one or more b’s than a’s (if we take S → U b during the first derivation step). This grammar is unambiguous. (2nd solution). Here is the grammar from the textbook: S → AV | V B A → aA | a B → Bb | b 12
V → aV b | λ This grammar is unambiguous. 2 How can we show that the two grammars generate the same language? Exercise 25. Find a linear grammar for this language.
13
Example 5.4. Consider the following grammar: S → aSb | SS | λ The language generated by this grammar is {w ∈ {a, b}∗ | na (w) = nb (w); na (v) ≥ nb (v), for any prefix v of w}. This is the language of properly nested parentheses commonly used in computer programming languages and mathematical expressions. 2 This language is not regular. Question. Is there a linear grammar for this language? (See chapter 8.)
14
Leftmost and Rightmost Derivations A derivation is a sequence of steps. In each step we expand a nonterminal A by replacing A with the right-hand side of an A-production rule. For example, consider the following grammar: S → AB A → aaA A→λ B → Bb B→λ The language generated by this grammar is {a2n bm | n ≥ 0, m ≥ 0}. The string aab is a sentence (or an element) of this language. Here are two derivations of this sentence: S ⇒ AB ⇒ aaAB ⇒ aaB ⇒ aaBb ⇒ aab S ⇒ AB ⇒ ABb ⇒ Ab ⇒ aaAb ⇒ aab 15
The result of each derivation step is called a sentential form. The derivation stops when non more nonterminal is left. A sentential form without nonterminals is called a sentence. The first derivation is a leftmost derivation in which the leftmost nonterminal is expanded first. Similarly, the second derivation is a rightmost derivation in which the rightmost nonterminal is expanded first. A derivation can be drawn as a derivation tree. A derivation tree is also called a syntax tree. For example:
16
S B
A a
a
A
B
A b
a
a
A
(b) a partial derivation tree Fig 5.1
(a) a derivation tree
A derivation tree is an ordered tree, which means that there is an ordering among siblings. The root of a derivation is labelled with the start symbol of the grammar. The leaves are labelled with an element of T ∪ {λ}. The internal nodes are labelled with a nonterminal (or a variable, which is an element of V ). A subtree of the derivation tree with some sub-subtrees removed is called a partial derivation tree. 17
Example 5.6. Consider the following grammar: S → aAB A → bBb B→A|λ The language generated by this grammar is {a(bb)m | m ≥ 1}. The string abbbb is a sentence of this language. Here is the leftmost derivation of this sentence: S ⇒ aAB ⇒ abBbB ⇒ abbB ⇒ abbA ⇒ abbbBb ⇒ abbbb
18
S a
B
b
B
A
B
A b
b
A
B
b
(b) a partial derivation tree b
B
b Fig 5.2
(a) a derivation tree
19
Theorem 5.1. There is an obvious correspondence between a derivation of a sentence w ∈ L(G) and its derivation tree.
20
§5.2 Parsing and Ambiguity There are two sides of a (context-free) grammar: • We may use a grammar to generate sentences (derivation). • We may ask whether a string can be generated by a grammar (parsing). A simple parsing method is to try all possible derivations and see if the string could be derived. We use a top-down, breadth-first, left-to-right approach.
21
0. exhaustive search 1. Input is a string w and a grammar G. 2. T = {S} (the start symbol of the grammar) 3. repeat 4. for each sentential form f in T do 5. locate the leftmost nonterminal, say A, 6. expand A with every A-rule 7. T := T − {f } ∪ { new sentential forms } 8. delete those sentential forms that cannot generate the required string. 9. end for 10. until we finds a leftmost derivation of the string or the collection of sentential becomes empty. If w ∈ L(G), then this algorithm always terminates and returns a leftmost derivation of w. If w 6∈ L(G), this algorithm may not terminate. 22
An alternative strategy of exhaustive search. We may do this by following the leftmost derivation. When expanding a nonterminal A, we try each A-rule in turn. Deriving a sentence stops whenever it is possible to decide whether the result is the required string. This exhaustive search method may not terminate, even if w ∈ L(G), due to left-recursive rules (that is, rules of the form L → Lα). This same problem occurs if we follow the rightmost derivation, due to right-recursive rules. This is a top-down, depth-first approach. 2
23
Recall the reverse of a grammar GR defined in §3.3. A leftmost derivation in G corresponds to a rightmost derivation in GR .
24
Example 5.7. Consider the string aabb and the grammar S → SS | aSb | bSa | λ In the 1st round, we will try the following derivations in turn: S ⇒ SS S ⇒ aSb S ⇒ bSa S⇒λ The last two derivations cannot lead to the string aabb. In the 2nd round, we have 8 sentential forms: S ⇒ SS ⇒ SSS S ⇒ SS ⇒ aSbS S ⇒ SS ⇒ bSaS 25
S ⇒ SS ⇒ λS S ⇒ aSb ⇒ aSSb S ⇒ aSb ⇒ aaSbb S ⇒ aSb ⇒ abSab S ⇒ aSb ⇒ aλb The 3rd, 7th, and 8th derivations cannot lead to the required string aabb. There are 5 sentential forms left. We may conduct the 3rd round and will find a leftmost derivation: S ⇒ aSb ⇒ aaSbb ⇒ aaλbb
26
Problems with exhaustive search: • It is inefficient. • It may not terminate if w 6∈ L(G). If we impose the additional constraint that there is no λ rules (that is, rules of the form A → λ) nor rules of the form A → B, then the above exhaustive search method always terminates with a correct, definite answer whether or not w ∈ L(G). We will see later that this constraint does not affect the power of context-free grammars in any significant way.
27
Example 5.8. The grammar in example 5.7 is equivalent to the following grammar (except the empty sentence), which satisfies the above constraint (no λ-rules): T → T T | aT b | bT a | ab | ba 2 Corollary. Let G be a context-free grammar which does not include rules of the forms A → λ and A → B where A, B ∈ V . Then the derivation of a sentence w ∈ L(G) takes at most 2|w| − 1 steps. Proof. Note that in such grammars, every derivation step increases the length of the derived sentential form by at least 1 or it changes a nonterminal to a terminal (with a rule A → a). 2
28
Theorem 5.2. Let G be a context-free grammar which does not include rules of the forms A → λ and A → B where A, B ∈ V . Then the exhaustive search method always terminate with a correct answer. Proof. Due to the above corollary, we can limit our search to at most 2|w| − 1 rounds (there is a derivation step per round), where w is the given string. If w ∈ L(G) we will find a (leftmost) derivation. Otherwise, the search will terminate with a NO answer. 2
29
Next we will consider the time complexity of exhaustive search. Initially, there is a single sentential form (which consists of the single start symbol S). In each round, a sentential form is expanded into at most |P | new sentential forms. There are at most 2|w| − 1 rounds. Hence the upper bound of the number of sentential forms is 2
3
|P | + |P | + |P | + . . . + |P |
2|w|−1
|P |2|w| − |P | = O(|P |2|w| ) = |P | − 1
This is an exponential function on the length of the input string |w|. There are more efficient general parsers, such as CYK and Earley’s parsers.
30
Theorem 5.3. Every context-free grammars have a O(n3 )-time parser. Context-free grammars and parsing are used mostly in programming languages and compilers. In practice we usually require a linear-time parser. Not all context-free grammars have a linear-time parser.
31
Definition. A (context-free) grammar G is ambiguous if and only if there is a sentence w ∈ L(G) that have two or more leftmost derivations. Equivalently, a (context-free) grammar G is ambiguous if and only if there is a sentence w ∈ L(G) that have two or more rightmost derivations. Equivalently, a (context-free) grammar G is ambiguous if and only if there is a sentence w ∈ L(G) that have two or more derivation trees. Example 5.10. The grammar S → aSb | SS | λ is ambiguous since the sentence aabb has the following two leftmost derivations: S ⇒ SS ⇒ S ⇒ aSb ⇒ aaSbb ⇒ aabb S ⇒ aSb ⇒ aaSbb ⇒ aabb 32
S
Fig 5.4 S
S a
S b
S a
S
b
a
b
S a
S
b
Sometimes it is possible to transform an ambiguous grammar into an unambiguous one. For instance, the above grammar is equivalent to the following unambiguous grammar: S→T |λ T → U | UT U → ab | aU b 33
It is very difficult to determine if a context-free grammar is ambiguous. (We will discuss this later) Example 5.11. The following grammar E → E + E | E ∗ E | (E) | a is ambiguous. This grammar is used to model the usual arithmetic expressions. Usually, we impose the additional stipulation that ∗ is performed before + (that is, ∗ has a higher precedence than +). We may use the following (unambiguous) grammar to show this precedence: Example 5.12. E →E+T | T T →T ∗F | F F → (E) | a
34
The above examples show that a context-free grammar can be used to impose precedence. Similarly, associativity can also be enforced by context-free grammars. For left-associative operations, such as +: L→L+E | E For right-associative operations, such as ∗∗: R → E ∗ ∗R | E
35
We have shown that ambiguity sometimes can be removed by properly transforming the grammar. However, this is not always possible. Certain context-free languages have only ambiguous grammars. They are called inherently ambiguous languages. Definition. Let L be a context-free grammar. If L has an unambiguous grammar, it is unambiguous. Otherwise, it is inherently ambiguous. Example 5.13. Consider the following language L =def {an bn cm } ∪ {an bm cm } The left part {an bn cm } can be generated by a grammar: S → Sc | A A → aAb | λ Similarly, the right part {an bm cm } can be generated by a grammar: 36
T → aT | B B → bBc | λ Their union is described by one additional rule: Q→S |T The string an bn cn , which belongs to both parts, have two derivations. Though this does not shown L is inherently ambiguous, it is quite possible that it is never possible to combine the two parts with a single unambiguous grammar.
37
§5.4 Context-Free Languages and Programming Languages The syntax of a programming language is usually specified by a context-free grammar. Due to the consideration of parsing efficiency, we are usually restricted to the subclass of LL(1) or LR(1) grammars. The following page contains C’s LALR(1) grammar.
38
39
Index ambiguous grammar, 7, 32 sentence, 16 sentential form, 16
39-1