Language Processing

  • Uploaded by: api-3806739
  • 0
  • 0
  • November 2019
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Language Processing as PDF for free.

More details

  • Words: 1,934
  • Pages: 38
Top-down and Bottom-up Parsing - a whirlwind tour

Simplified Compiler Structure Source code if (b == 0) a = b;

Understand source code

Intermediate code

Optimize

Front end (machineindependent)

Optimizer

Intermediate code Assembly code cmp $0,ecx cmovz edx,ecx

Generate assembly code

Back end (machinedependent)

Simplified Front-End Structure Source code if (b == 0) a = b; (character stream)

Lexical Analysis Token stream

if ( b == 0 ) a = b ;

Abstract Syntax Tree (AST)

== b

Syntax Analysis (Parsing)

if 0

= a

b

Semantic Analysis

Parse Tree vs. AST • Parse tree also called “concrete syntax” S E + S Parse Tree ( S ) E (Concrete E + S 5 Syntax) 1 E + S 2 E (S) E+S E 3 4

Abstract Syntax Tree + + 1

5 + +

2 3

4

Discards (abstracts) unneeded information

How to build an AST • Need to find a derivation for the program in the grammar • Want an efficient algorithm – should only read token stream once – exponential brute-force search out of question – even CKY is too slow

• Two main ways to parse: – top-down parsing (recursive descent) – bottom-up parsing (shift-reduce)

Parsing Top-down

S→E+S |E E → num | ( S )

Goal: construct a leftmost derivation of string while reading in token stream Partly-derived String

S → → → → → → → → →

Lookahead parsed part unparsed part

( E+S (S) +S (E+S)+S (1+S)+S (1+E+S)+S (1+2+S)+S (1+2+E)+S (1+2+(S))+S (1+2+(E+S))+S

( 1 1 2 2 2 ( 3 3

(1+2+(3+4))+5 (1+2+(3+4))+5 (1+2+(3+4))+5 (1+2+(3+4))+5 (1+2+(3+4))+5 (1+2+(3+4))+5 (1+2+(3+4))+5 (1+2+(3+4))+5 (1+2+(3+4))+5 (1+2+(3+4))+5

Problem

S→E+S |E E → num | ( S )

• Want to decide which production to apply based on next symbol

(1) (1)+2

S → E → (S) → (E) → (1) S → E + S → (S) + S →(E) + S → (1)+E → (1)+2

• Why is this hard?

Grammar is Problem • This grammar cannot be parsed top-down with only a single look-ahead symbol • Not LL(1) = Left-to-right-scanning, Leftmost derivation, 1 look-ahead symbol • Is it LL(k) for some k? • Can rewrite grammar to allow top-down parsing: create LL(1) grammar for same language

Making a grammar LL(1) S S E E

→ → → →

E+S E num (S)

S → ES' S' → ε S' → + S E → num E→(S)

• Problem: can’t decide which S production to apply until we see symbol after first expression • Left-factoring: Factor common S prefix, add new non-terminal S' at decision point. S' derives (+E)*

Parsing with new grammar S → ES ' S → → → → → → → → → → → →

S'→ε|+S

E S' (S) S' (E S') S' (1 S') S' (1+E S' ) S' (1+2+(3+4))+5 (1+2 S') S' (1+2 + S) S' (1+2 + E S') S' (1+2 + (S) S') S' (1+2 + (E S' ) S') S' (1+2+(3+4))+5 (1+2 + (3 S') S') S' (1+2+(3+4))+5 (1+2 + (3 + E) S') S'

E → num | ( S )

( ( 1 1 +

(1+2+(3+4))+5 (1+2+(3+4))+5 (1+2+(3+4))+5 (1+2+(3+4))+5 (1+2+(3+4))+5 2

+ ( ( 3

(1+2+(3+4))+5 (1+2+(3+4))+5 (1+2+(3+4))+5 (1+2+(3+4))+5 3 +

4

(1+2+(3+4))+5

Predictive Parsing • LL(1) grammar: – for a given non-terminal, the look-ahead symbol uniquely determines the production to apply – top-down parsing = predictive parsing – Driven by predictive parsing table of non-terminals × terminals → productions

Using Table S → → → → → →

E S' (S) S' (E S' ) S' (1 S') S' (1 + S) S' (1+E S' ) S' (1+2+(3+4))+5 → (1+2 S') S' S S' E

num →ES' → num

+ → +S

( ( 1 1 + 2

S→ES' S' →ε | +S E → num | ( S ) (1+2+(3+4))+5 (1+2+(3+4))+5 (1+2+(3+4))+5 (1+2+(3+4))+5 (1+2+(3+4))+5 (1+2+(3+4))+5

2 + (

(1+2+(3+4))+5 ) →ES' →ε →(S)

$ →ε

How to Implement? • Table can be converted easily into a recursive-descent parser S S' E

num →ES' → num

+ → +S

(

) →ES' →ε →(S)

$ →ε

• Three procedures: parse_S, parse_S’, parse_E

Recursive-Descent Parser lookahead token void parse_S () { switch (token) { case num: parse_E(); parse_S’(); return; case ‘(’: parse_E(); parse_S’(); return; default: throw new ParseError(); } } S S’ E

number → ES’ → number

+ → +S

( → ES’ →(S)

)

$

→ε

→ε

Recursive-Descent Parser void parse_S’() { switch (token) { case ‘+’: token = input.read(); parse_S(); return; case ‘)’: return; case EOF: return; default: throw new ParseError(); } } S S’ E

number → ES’ → number

+ → +S

( → ES’ →(S)

)

$

→ε

→ε

Recursive-Descent Parser void parse_E() { switch (token) { case number: token = input.read(); return; case ‘(‘: token = input.read(); parse_S(); if (token != ‘)’) throw new ParseError(); token = input.read(); return; default: throw new ParseError(); } } S S’ E

number → ES’ → number

+ → +S

( → ES’ →(S)

)

$

→ε

→ε

Call Tree = Parse Tree (1 + 2 + (3 + 4)) + 5

parse_S parse_E parse_S’ parse_S parse_S parse_E parse_S’ parse_S parse_Eparse_S’ parse_S parse_Eparse_S’ parse_S

S

E

S’

( S )

+S

E S’ 1 +S

5

E S’ 2 + S E S’ ( S ) ε E S’ 3 +S E 4

How to Construct Parsing Tables • There exists an algorithm for automatically generating a predictive parse table from a grammar (take 412 for details) S → ES’ S’ → ε | + S E → number | ( S )

N

+

(

)

$

S ES’ ES’ S’ +S ε E N (S)

ε

Summary for top-down parsing

• LL(k) grammars

– left-to-right scanning – leftmost derivation – can determine what production to apply from the next k symbols – Can automatically build predictive parsing tables

• Predictive parsers – Can be easily built for LL(k) grammars from the parsing tables – Also called recursive-descent, or top-down parsers

Top-Down Parsing Summary Language grammar Left-recursion elimination Left-factoring

LL(1) grammar predictive parsing table recursive-descent parser parser with AST generation

Now: Bottom-up Parsing • A more powerful parsing technology • LR grammars -- more expressive than LL – construct right-most derivation of program – virtually all programming languages – easier to express programming language syntax

• Shift-reduce parsers – Parsers for LR grammars – automatic parser generators (e.g. yacc,CUP)

Bottom-up Parsing • Right-most derivation -- backward – Start with the tokens – End with the start symbol

S→ S+E|E E → num | ( S )

(1+2+(3+4))+5 ← (E+2+(3+4))+5 ← (S+2+(3+4))+5 ←(S+E+(3+4))+5 ← (S+(3+4))+5 ← (S+(E+4))+5 ←(S+(S+4))+5 ← (S+(S+E))+5 ← (S+(S))+5 ←(S+E)+5 ← (S)+5 ← E+5 ← S+E ← S

(1+2+(3+4))+5 (1+2+(3+4))+5 (E+2+(3+4))+5 +2+(3+4))+5 (S+2+(3+4))+5 +2+(3+4))+5 (S+E+(3+4))+5 +(3+4))+5 (S+(3+4))+5 ← (S+(E+4))+5 ← (S+(S+4))+5 ← (S+(S+E))+5 ← (S+(S))+5 ← +5 (S+E)+5 ← +5 (S)+5 ← +5 E+5 ← +5 S+E ←



right-most derivation

Progress of Bottom-up ← Parsing (1 ←

(1



(1+2 (1+2+(3 +4))+5 (1+2+(3 +4))+5 (1+2+(3 +4))+5 (1+2+(3+4 ))+5 (1+2+(3+4 )) (1+2+(3+4)

)

(1+2+(3+4)

)

(1+2+(3+4)) (1+2+(3+4))+5

Bottom-up Parsing • (1+2+(3+4))+5 (E+2+(3+4))+5 (S+2+(3+4))+5 (S+E+(3+4))+5

← ← ← …

S→ S+E|E E → num | ( S )

• Advantage of bottom-up parsing: can postpone the selection of productions until more of the input is scanned

S S + E E 5 ( S ) S + E S+E (S) E 2 S+E 1 E 4 3

Top-down Parsing (1+2+(3+4))+5 S → S+E → E+E → (S)+E → (S+E)+E → (S+E+E)+E →(E+E+E)+E → (1+E+E)+E → (1+2+E)+E ...

• In left-most derivation, entire tree above a token (2) has been expanded when encountered

S→ S+E|E E → num | ( S )

S S + E E 5 ( S ) S + E S+E (S) E 2 S+E 1 E 4 3

Top-down vs. Bottom-up Bottom-up: Don’t need to figure out as much of the parse tree for a given amount of input

scanned unscanned

scanned unscanned

Top-down

Bottom-up

Shift-reduce Parsing • Parsing actions: is a sequence of shift and reduce operations • Parser state: a stack of terminals and nonterminals (grows to the right) •Derivation Current derivation step = always stack+input step stack unconsumed input (1+2+(3+4))+5 ← (1+2+(3+4))+5 (E+2+(3+4))+5 ← (E +2+(3+4))+5 (S+2+(3+4))+5 ← (S +2+(3+4))+5 (S+E+(3+4))+5 ← (S+E +(3+4))+5

Shift-reduce Parsing • Parsing is a sequence of shifts and reduces • Shift : move look-ahead token to stack stack input action ( 1+2+(3+4))+5 shift 1 (1 +2+(3+4))+5 • Reduce : Replace symbols γ from top of stack with non-terminal symbol X, corresponding to production X → γ (pop γ, push X) stack input action (S+E +(3+4))+5 reduce S → S+E (S +(3+4))+5

Shift-reduce Parsing S→ S+E|E E → num | ( S ) derivation stack (1+2+(3+4))+5 ← (1+2+(3+4))+5 ← ( (1+2+(3+4))+5 ← (1 →num

input stream action (1+2+(3+4))+5 shift 1+2+(3+4))+5 shift +2+(3+4))+5 reduce E

(E+2+(3+4))+5 ←

(E

+2+(3+4))+5

(S+2+(3+4))+5 ← (S+2+(3+4))+5 ← (S+2+(3+4))+5 ←

(S (S+ (S+2

+2+(3+4))+5 shift 2+(3+4))+5 shift +(3+4))+5 reduce E

(S+E+(3+4))+5 ← → S+ E (S+(3+4))+5 ← (S+(3+4))+5 ← (S+(3+4))+5 ← (S+(3+4))+5 ←

(S+E

+(3+4))+5

reduce S

(S (S+ (S+( (S+(3

+(3+4))+5 (3+4))+5 3+4))+5 +4))+5

shift shift shift reduce E

→E

→num

→num

reduce S

Problem • How do we know which action to take: whether to shift or reduce, and which production? • Issues: – Sometimes can reduce but shouldn’t – Sometimes can reduce in different ways

Action Selection Problem • Given stack σ and look-ahead symbol b, should parser: – shift b onto the stack (making it σb) – reduce X → γ assuming that stack has the form α γ (making it αX) • If stack has form α γ, should apply reduction X → γ (or shift) depending on stack prefix α α is different for different possible reductions, since γ’s have different length.

LR Parsing Engine • Basic mechanism: – Use a set of parser states – Use a stack with alternating symbols and states • E.g: 1 ( 6 S 10 + 5 – Use a parsing table to: • Determine what action to apply (shift/reduce) • Determine the next state

• The parser actions can be precisely determined from the table

The LR Parsing Table Terminals Non-terminals State

Next action and next state

Next state

Action table

Goto table

• Algorithm: look at entry for current state S and input terminal C If Table[S,C] = s(S’) then shift: push(C), push(S’) If Table[S,C] = X→α then reduce: pop(2*|α|), S’=top(), push(X), push(Table[S’,X])

LR Parsing Table Example ( s3

1 2 3

S→id

)

id s2

S→id

S→id

s3

, S→id

$

S g4

S→id

s2

g7

g5 4 5 6 7 8 9

accept

s6 S→(L) S→(L) L→S

L→S

s3 L→L,S L→L,S

s8 S→(L) L→S

s2 L→L,S

S→(L) S→(L) L→S

L→S

g9 L→L,S L→L,S

L

LR(k) Grammars • LR(k) = Left-to-right scanning, Right-most derivation, k look-ahead characters • Main cases: LR(0), LR(1), and some variations (SLR and LALR(1)) • Parsers for LR(0) Grammars: – Determine the actions without any lookahead symbol

Building LR(0) Parsing Tables • To build the parsing table: – Define states of the parser – Build a DFA to describe the transitions between states – Use the DFA to build the parsing table

Summary for bottom-up parsing

• LR(k) grammars

– left-to-right scanning – rightmost derivation – can determine whether to shift or reduce from the next k symbols – Can automatically build predictive parsing tables

• Shift-reduce parsers – Can be built for LR(k) grammars using automated parser generator tools, eg. CUP, yacc.

Top-down vs. Bottom-up again LL(k), recursive descent reduce

LR(k), shift-

scanned unscanned

scanned unscanned

Top-down

Bottom-up

Related Documents