Top-down and Bottom-up Parsing - a whirlwind tour
Simplified Compiler Structure Source code if (b == 0) a = b;
Understand source code
Intermediate code
Optimize
Front end (machineindependent)
Optimizer
Intermediate code Assembly code cmp $0,ecx cmovz edx,ecx
Generate assembly code
Back end (machinedependent)
Simplified Front-End Structure Source code if (b == 0) a = b; (character stream)
Lexical Analysis Token stream
if ( b == 0 ) a = b ;
Abstract Syntax Tree (AST)
== b
Syntax Analysis (Parsing)
if 0
= a
b
Semantic Analysis
Parse Tree vs. AST • Parse tree also called “concrete syntax” S E + S Parse Tree ( S ) E (Concrete E + S 5 Syntax) 1 E + S 2 E (S) E+S E 3 4
Abstract Syntax Tree + + 1
5 + +
2 3
4
Discards (abstracts) unneeded information
How to build an AST • Need to find a derivation for the program in the grammar • Want an efficient algorithm – should only read token stream once – exponential brute-force search out of question – even CKY is too slow
• Two main ways to parse: – top-down parsing (recursive descent) – bottom-up parsing (shift-reduce)
Parsing Top-down
S→E+S |E E → num | ( S )
Goal: construct a leftmost derivation of string while reading in token stream Partly-derived String
S → → → → → → → → →
Lookahead parsed part unparsed part
( E+S (S) +S (E+S)+S (1+S)+S (1+E+S)+S (1+2+S)+S (1+2+E)+S (1+2+(S))+S (1+2+(E+S))+S
( 1 1 2 2 2 ( 3 3
(1+2+(3+4))+5 (1+2+(3+4))+5 (1+2+(3+4))+5 (1+2+(3+4))+5 (1+2+(3+4))+5 (1+2+(3+4))+5 (1+2+(3+4))+5 (1+2+(3+4))+5 (1+2+(3+4))+5 (1+2+(3+4))+5
Problem
S→E+S |E E → num | ( S )
• Want to decide which production to apply based on next symbol
(1) (1)+2
S → E → (S) → (E) → (1) S → E + S → (S) + S →(E) + S → (1)+E → (1)+2
• Why is this hard?
Grammar is Problem • This grammar cannot be parsed top-down with only a single look-ahead symbol • Not LL(1) = Left-to-right-scanning, Leftmost derivation, 1 look-ahead symbol • Is it LL(k) for some k? • Can rewrite grammar to allow top-down parsing: create LL(1) grammar for same language
Making a grammar LL(1) S S E E
→ → → →
E+S E num (S)
S → ES' S' → ε S' → + S E → num E→(S)
• Problem: can’t decide which S production to apply until we see symbol after first expression • Left-factoring: Factor common S prefix, add new non-terminal S' at decision point. S' derives (+E)*
Parsing with new grammar S → ES ' S → → → → → → → → → → → →
S'→ε|+S
E S' (S) S' (E S') S' (1 S') S' (1+E S' ) S' (1+2+(3+4))+5 (1+2 S') S' (1+2 + S) S' (1+2 + E S') S' (1+2 + (S) S') S' (1+2 + (E S' ) S') S' (1+2+(3+4))+5 (1+2 + (3 S') S') S' (1+2+(3+4))+5 (1+2 + (3 + E) S') S'
E → num | ( S )
( ( 1 1 +
(1+2+(3+4))+5 (1+2+(3+4))+5 (1+2+(3+4))+5 (1+2+(3+4))+5 (1+2+(3+4))+5 2
+ ( ( 3
(1+2+(3+4))+5 (1+2+(3+4))+5 (1+2+(3+4))+5 (1+2+(3+4))+5 3 +
4
(1+2+(3+4))+5
Predictive Parsing • LL(1) grammar: – for a given non-terminal, the look-ahead symbol uniquely determines the production to apply – top-down parsing = predictive parsing – Driven by predictive parsing table of non-terminals × terminals → productions
Using Table S → → → → → →
E S' (S) S' (E S' ) S' (1 S') S' (1 + S) S' (1+E S' ) S' (1+2+(3+4))+5 → (1+2 S') S' S S' E
num →ES' → num
+ → +S
( ( 1 1 + 2
S→ES' S' →ε | +S E → num | ( S ) (1+2+(3+4))+5 (1+2+(3+4))+5 (1+2+(3+4))+5 (1+2+(3+4))+5 (1+2+(3+4))+5 (1+2+(3+4))+5
2 + (
(1+2+(3+4))+5 ) →ES' →ε →(S)
$ →ε
How to Implement? • Table can be converted easily into a recursive-descent parser S S' E
num →ES' → num
+ → +S
(
) →ES' →ε →(S)
$ →ε
• Three procedures: parse_S, parse_S’, parse_E
Recursive-Descent Parser lookahead token void parse_S () { switch (token) { case num: parse_E(); parse_S’(); return; case ‘(’: parse_E(); parse_S’(); return; default: throw new ParseError(); } } S S’ E
number → ES’ → number
+ → +S
( → ES’ →(S)
)
$
→ε
→ε
Recursive-Descent Parser void parse_S’() { switch (token) { case ‘+’: token = input.read(); parse_S(); return; case ‘)’: return; case EOF: return; default: throw new ParseError(); } } S S’ E
number → ES’ → number
+ → +S
( → ES’ →(S)
)
$
→ε
→ε
Recursive-Descent Parser void parse_E() { switch (token) { case number: token = input.read(); return; case ‘(‘: token = input.read(); parse_S(); if (token != ‘)’) throw new ParseError(); token = input.read(); return; default: throw new ParseError(); } } S S’ E
number → ES’ → number
+ → +S
( → ES’ →(S)
)
$
→ε
→ε
Call Tree = Parse Tree (1 + 2 + (3 + 4)) + 5
parse_S parse_E parse_S’ parse_S parse_S parse_E parse_S’ parse_S parse_Eparse_S’ parse_S parse_Eparse_S’ parse_S
S
E
S’
( S )
+S
E S’ 1 +S
5
E S’ 2 + S E S’ ( S ) ε E S’ 3 +S E 4
How to Construct Parsing Tables • There exists an algorithm for automatically generating a predictive parse table from a grammar (take 412 for details) S → ES’ S’ → ε | + S E → number | ( S )
N
+
(
)
$
S ES’ ES’ S’ +S ε E N (S)
ε
Summary for top-down parsing
• LL(k) grammars
– left-to-right scanning – leftmost derivation – can determine what production to apply from the next k symbols – Can automatically build predictive parsing tables
• Predictive parsers – Can be easily built for LL(k) grammars from the parsing tables – Also called recursive-descent, or top-down parsers
Top-Down Parsing Summary Language grammar Left-recursion elimination Left-factoring
LL(1) grammar predictive parsing table recursive-descent parser parser with AST generation
Now: Bottom-up Parsing • A more powerful parsing technology • LR grammars -- more expressive than LL – construct right-most derivation of program – virtually all programming languages – easier to express programming language syntax
• Shift-reduce parsers – Parsers for LR grammars – automatic parser generators (e.g. yacc,CUP)
Bottom-up Parsing • Right-most derivation -- backward – Start with the tokens – End with the start symbol
S→ S+E|E E → num | ( S )
(1+2+(3+4))+5 ← (E+2+(3+4))+5 ← (S+2+(3+4))+5 ←(S+E+(3+4))+5 ← (S+(3+4))+5 ← (S+(E+4))+5 ←(S+(S+4))+5 ← (S+(S+E))+5 ← (S+(S))+5 ←(S+E)+5 ← (S)+5 ← E+5 ← S+E ← S
(1+2+(3+4))+5 (1+2+(3+4))+5 (E+2+(3+4))+5 +2+(3+4))+5 (S+2+(3+4))+5 +2+(3+4))+5 (S+E+(3+4))+5 +(3+4))+5 (S+(3+4))+5 ← (S+(E+4))+5 ← (S+(S+4))+5 ← (S+(S+E))+5 ← (S+(S))+5 ← +5 (S+E)+5 ← +5 (S)+5 ← +5 E+5 ← +5 S+E ←
←
right-most derivation
Progress of Bottom-up ← Parsing (1 ←
(1
←
(1+2 (1+2+(3 +4))+5 (1+2+(3 +4))+5 (1+2+(3 +4))+5 (1+2+(3+4 ))+5 (1+2+(3+4 )) (1+2+(3+4)
)
(1+2+(3+4)
)
(1+2+(3+4)) (1+2+(3+4))+5
Bottom-up Parsing • (1+2+(3+4))+5 (E+2+(3+4))+5 (S+2+(3+4))+5 (S+E+(3+4))+5
← ← ← …
S→ S+E|E E → num | ( S )
• Advantage of bottom-up parsing: can postpone the selection of productions until more of the input is scanned
S S + E E 5 ( S ) S + E S+E (S) E 2 S+E 1 E 4 3
Top-down Parsing (1+2+(3+4))+5 S → S+E → E+E → (S)+E → (S+E)+E → (S+E+E)+E →(E+E+E)+E → (1+E+E)+E → (1+2+E)+E ...
• In left-most derivation, entire tree above a token (2) has been expanded when encountered
S→ S+E|E E → num | ( S )
S S + E E 5 ( S ) S + E S+E (S) E 2 S+E 1 E 4 3
Top-down vs. Bottom-up Bottom-up: Don’t need to figure out as much of the parse tree for a given amount of input
scanned unscanned
scanned unscanned
Top-down
Bottom-up
Shift-reduce Parsing • Parsing actions: is a sequence of shift and reduce operations • Parser state: a stack of terminals and nonterminals (grows to the right) •Derivation Current derivation step = always stack+input step stack unconsumed input (1+2+(3+4))+5 ← (1+2+(3+4))+5 (E+2+(3+4))+5 ← (E +2+(3+4))+5 (S+2+(3+4))+5 ← (S +2+(3+4))+5 (S+E+(3+4))+5 ← (S+E +(3+4))+5
Shift-reduce Parsing • Parsing is a sequence of shifts and reduces • Shift : move look-ahead token to stack stack input action ( 1+2+(3+4))+5 shift 1 (1 +2+(3+4))+5 • Reduce : Replace symbols γ from top of stack with non-terminal symbol X, corresponding to production X → γ (pop γ, push X) stack input action (S+E +(3+4))+5 reduce S → S+E (S +(3+4))+5
Shift-reduce Parsing S→ S+E|E E → num | ( S ) derivation stack (1+2+(3+4))+5 ← (1+2+(3+4))+5 ← ( (1+2+(3+4))+5 ← (1 →num
input stream action (1+2+(3+4))+5 shift 1+2+(3+4))+5 shift +2+(3+4))+5 reduce E
(E+2+(3+4))+5 ←
(E
+2+(3+4))+5
(S+2+(3+4))+5 ← (S+2+(3+4))+5 ← (S+2+(3+4))+5 ←
(S (S+ (S+2
+2+(3+4))+5 shift 2+(3+4))+5 shift +(3+4))+5 reduce E
(S+E+(3+4))+5 ← → S+ E (S+(3+4))+5 ← (S+(3+4))+5 ← (S+(3+4))+5 ← (S+(3+4))+5 ←
(S+E
+(3+4))+5
reduce S
(S (S+ (S+( (S+(3
+(3+4))+5 (3+4))+5 3+4))+5 +4))+5
shift shift shift reduce E
→E
→num
→num
reduce S
Problem • How do we know which action to take: whether to shift or reduce, and which production? • Issues: – Sometimes can reduce but shouldn’t – Sometimes can reduce in different ways
Action Selection Problem • Given stack σ and look-ahead symbol b, should parser: – shift b onto the stack (making it σb) – reduce X → γ assuming that stack has the form α γ (making it αX) • If stack has form α γ, should apply reduction X → γ (or shift) depending on stack prefix α α is different for different possible reductions, since γ’s have different length.
LR Parsing Engine • Basic mechanism: – Use a set of parser states – Use a stack with alternating symbols and states • E.g: 1 ( 6 S 10 + 5 – Use a parsing table to: • Determine what action to apply (shift/reduce) • Determine the next state
• The parser actions can be precisely determined from the table
The LR Parsing Table Terminals Non-terminals State
Next action and next state
Next state
Action table
Goto table
• Algorithm: look at entry for current state S and input terminal C If Table[S,C] = s(S’) then shift: push(C), push(S’) If Table[S,C] = X→α then reduce: pop(2*|α|), S’=top(), push(X), push(Table[S’,X])
LR Parsing Table Example ( s3
1 2 3
S→id
)
id s2
S→id
S→id
s3
, S→id
$
S g4
S→id
s2
g7
g5 4 5 6 7 8 9
accept
s6 S→(L) S→(L) L→S
L→S
s3 L→L,S L→L,S
s8 S→(L) L→S
s2 L→L,S
S→(L) S→(L) L→S
L→S
g9 L→L,S L→L,S
L
LR(k) Grammars • LR(k) = Left-to-right scanning, Right-most derivation, k look-ahead characters • Main cases: LR(0), LR(1), and some variations (SLR and LALR(1)) • Parsers for LR(0) Grammars: – Determine the actions without any lookahead symbol
Building LR(0) Parsing Tables • To build the parsing table: – Define states of the parser – Build a DFA to describe the transitions between states – Use the DFA to build the parsing table
Summary for bottom-up parsing
• LR(k) grammars
– left-to-right scanning – rightmost derivation – can determine whether to shift or reduce from the next k symbols – Can automatically build predictive parsing tables
• Shift-reduce parsers – Can be built for LR(k) grammars using automated parser generator tools, eg. CUP, yacc.
Top-down vs. Bottom-up again LL(k), recursive descent reduce
LR(k), shift-
scanned unscanned
scanned unscanned
Top-down
Bottom-up