Week 3

  • May 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Week 3 as PDF for free.

More details

  • Words: 986
  • Pages: 5
Week 3 – Lexical Analysis • Tasks of Lexical Analysis • Why separating lexical analysis and parsing? • Tokens, Patterns and Lexemes • Complex tokens like identifier and numeral are described using regularexpression notations. • Attributes for Tokens: provide actual value for a lexeme (a pointer to an entry in symbol table). • A symbol table may contain word symbols, names, and numerals. •

Buffering Techniques: Buffer Pairs and Sentinels



Buffer Paris: a buffer contains 2 N-character halves. Each system read system call brings N characters into buffer. Two pointers are maintained: forward and lexeme_beginning. Drawback: can not recognize token with length more than N.

• i

=

i

+

1

;

e

n

d

EOF

e

n

d

EOF

if forward at end of first half then begin reload second half; forward := forward + 1 end else if forward at end of second half then begin reload first half; move forward to beginning of first half end else forward := forward +1

Sentinels: reduce two comparisons to on e. i

=

i

+

1

;

EOF

forward := forward +1 if *forward = EOF then begin if forward at end of first half then begin reload second half; forward := forward + 1 end else if forward at end of second half then begin reload first half; move forward to beginning of first half end else terminate lexical analysis 1

end • • • • • •



Token Specification Strings and Languages A language denotes any set of strings over some fixed alphabet. The empty set or the set that contains empty string are languages. Operations on Languages: union, concatenation, and closure. Regular Expressions Regular expressions over alphabet sigma is defined as follows: 1) epsilon is a regular expression that denotes {epsilon} 2) if a is a symbol in alphabet, then a is a regular expression that denotes {a}. 3) suppose r and s are regular expressions denoting the languages L(r) and L(s). then a. (r) | (s) is a regular expression denoting L(r) U L(s). b. (r) (s) is a regular expression denoting L(r)L(s). c. (r)* is a regular expression denoting (L(r))*. d. (r) is a regular expression denoting L(r).

A language denoted by a regular expression is said to be a regular set.

• Regular Definitions A regular definition over sigma is a sequence of definitions of the form d1 -> r1 d2 -> r2 … dn -> rn where each di is a distinct name and each ri is a reqular expression over the symbols in sigma U {d1, d2, …, di-1}. • Non-regular sets Some language can not be described by any regular expression. 1) Balanced or nested construct: e.g., the set of all strings of balanced parentheses can not be described by a regular expression. However, it can be specified by a context- free grammar. 2) Repeating strings cannot be described by regular expressions. {wcw| w is a string of a’s and b’s} can not be described by any regular expression, nor can it be specified by a cont ext- free grammar. • Recognition of Tokens Consider the Pascal- grammar. The terminals “program”, “;”, etc., generate sets of strings given by the following regular definitions. (add Line 41 and 42 to recognize names and numbers where digit = {0, 1, 2, …, 9} and letter = {A, B,…, Z, a, b, …, z}) Numeral -> digit+

2

Name -> letter (letter | digit)* program -> program ; -> ; const -> const … TypeName -> Name VariableName -> Name ProcedureName -> Name FieldName -> Name ConstantName -> Name We also assume lexemes are separated by white space, consisting of nonnull sequences of blanks, tabs and newlines. The lexical analyzer will do so by comparing a string against the regular definition ws: delim -> blank | tab | newline ws -> delim+ Our goal is to isolate the lexeme for the next token in the input buffer and produce as output a pair consisting of the appropriate token and attribute-value using the translation table: Regular Expression Ws Numeral Name program ; const … TypeName VariableName ProcedureName FieldName ConstantName

Token Numeral Name program Semicolon const

Attribute Value pointer to table entry pointer to table entry PROGRAM SEMICOLON CONST

Name Name Name Name Name

Pointer to table entry Pointer to table entry Pointer to table entry Pointer to table entry Pointer to table entry

The PROGRAM, SEMICOLON, CONST are predefined constants; the pointer to table entry may be the index of the table. • Transition Diagrams: Start state, accepting state (double cirle), and asterisk (retract forward pointer). Each edge is associated with a character. If failure occurs in all transition diagrams, a lexical error has been detected and an errorrecovery routine is invoked. For performance consideration, looking for frequently occurring tokens before less frequently occurring ones, e.g., white space.

3

other 4 { start

}

return(Bar,BAR)

[ 0

]

2

1 other

delimiter

3

*

return(LeftSquare, LSQUARE)

A technique for separating word symbols and names is to initialize a symbol table with word symbols. • Implementing a Transition Diagram: Procedure Begin c := nextChar(); If c = ‘[] then c := nextChar(); switch (c) { case ‘]’: return(Bar, BAR); case c in delimiter: while c in delimiter c:= nextC har(); case ‘{’: while c != ‘}’ c := nextChar(); default: retract(); return(LeftSquare, LSQUARE); } endif; end;

Procedure nextToken() begin skipSeparator(); switch (ch) {

4

case letter: scanWord(); case digit: scanNumeral(); case ‘+’: scanPlus(); … case EOF: return(EOF); end;

procedure skipSeparator() begin while ch in delimiter and ‘{‘ do switch (ch) { case space: nextChar(); case newLine: lineNo++; nextChar(); case tab: nextChar(); case ‘{‘: comment(); } } end; procedure comment() begin nextChar(); while ch != ‘}’ do if ch = ‘{‘ then comment(); else if ch = newLine then lineNo++; nextChar(); if ch = ‘}’ then nextChar(); else error(“comment”); end; Searching Linear search Letter indexing Hashing Direct chaining Linear Probing

5

Related Documents

Week 3
October 2019 18
Week 3
April 2020 7
Week 3
May 2020 9
Week 3
May 2020 11
Week 3
October 2019 9
Week 3
April 2020 8