Week 3 – Lexical Analysis • Tasks of Lexical Analysis • Why separating lexical analysis and parsing? • Tokens, Patterns and Lexemes • Complex tokens like identifier and numeral are described using regularexpression notations. • Attributes for Tokens: provide actual value for a lexeme (a pointer to an entry in symbol table). • A symbol table may contain word symbols, names, and numerals. •
Buffering Techniques: Buffer Pairs and Sentinels
•
Buffer Paris: a buffer contains 2 N-character halves. Each system read system call brings N characters into buffer. Two pointers are maintained: forward and lexeme_beginning. Drawback: can not recognize token with length more than N.
• i
=
i
+
1
;
e
n
d
EOF
e
n
d
EOF
if forward at end of first half then begin reload second half; forward := forward + 1 end else if forward at end of second half then begin reload first half; move forward to beginning of first half end else forward := forward +1
Sentinels: reduce two comparisons to on e. i
=
i
+
1
;
EOF
forward := forward +1 if *forward = EOF then begin if forward at end of first half then begin reload second half; forward := forward + 1 end else if forward at end of second half then begin reload first half; move forward to beginning of first half end else terminate lexical analysis 1
end • • • • • •
•
Token Specification Strings and Languages A language denotes any set of strings over some fixed alphabet. The empty set or the set that contains empty string are languages. Operations on Languages: union, concatenation, and closure. Regular Expressions Regular expressions over alphabet sigma is defined as follows: 1) epsilon is a regular expression that denotes {epsilon} 2) if a is a symbol in alphabet, then a is a regular expression that denotes {a}. 3) suppose r and s are regular expressions denoting the languages L(r) and L(s). then a. (r) | (s) is a regular expression denoting L(r) U L(s). b. (r) (s) is a regular expression denoting L(r)L(s). c. (r)* is a regular expression denoting (L(r))*. d. (r) is a regular expression denoting L(r).
A language denoted by a regular expression is said to be a regular set.
• Regular Definitions A regular definition over sigma is a sequence of definitions of the form d1 -> r1 d2 -> r2 … dn -> rn where each di is a distinct name and each ri is a reqular expression over the symbols in sigma U {d1, d2, …, di-1}. • Non-regular sets Some language can not be described by any regular expression. 1) Balanced or nested construct: e.g., the set of all strings of balanced parentheses can not be described by a regular expression. However, it can be specified by a context- free grammar. 2) Repeating strings cannot be described by regular expressions. {wcw| w is a string of a’s and b’s} can not be described by any regular expression, nor can it be specified by a cont ext- free grammar. • Recognition of Tokens Consider the Pascal- grammar. The terminals “program”, “;”, etc., generate sets of strings given by the following regular definitions. (add Line 41 and 42 to recognize names and numbers where digit = {0, 1, 2, …, 9} and letter = {A, B,…, Z, a, b, …, z}) Numeral -> digit+
2
Name -> letter (letter | digit)* program -> program ; -> ; const -> const … TypeName -> Name VariableName -> Name ProcedureName -> Name FieldName -> Name ConstantName -> Name We also assume lexemes are separated by white space, consisting of nonnull sequences of blanks, tabs and newlines. The lexical analyzer will do so by comparing a string against the regular definition ws: delim -> blank | tab | newline ws -> delim+ Our goal is to isolate the lexeme for the next token in the input buffer and produce as output a pair consisting of the appropriate token and attribute-value using the translation table: Regular Expression Ws Numeral Name program ; const … TypeName VariableName ProcedureName FieldName ConstantName
Token Numeral Name program Semicolon const
Attribute Value pointer to table entry pointer to table entry PROGRAM SEMICOLON CONST
Name Name Name Name Name
Pointer to table entry Pointer to table entry Pointer to table entry Pointer to table entry Pointer to table entry
The PROGRAM, SEMICOLON, CONST are predefined constants; the pointer to table entry may be the index of the table. • Transition Diagrams: Start state, accepting state (double cirle), and asterisk (retract forward pointer). Each edge is associated with a character. If failure occurs in all transition diagrams, a lexical error has been detected and an errorrecovery routine is invoked. For performance consideration, looking for frequently occurring tokens before less frequently occurring ones, e.g., white space.
3
other 4 { start
}
return(Bar,BAR)
[ 0
]
2
1 other
delimiter
3
*
return(LeftSquare, LSQUARE)
A technique for separating word symbols and names is to initialize a symbol table with word symbols. • Implementing a Transition Diagram: Procedure Begin c := nextChar(); If c = ‘[] then c := nextChar(); switch (c) { case ‘]’: return(Bar, BAR); case c in delimiter: while c in delimiter c:= nextC har(); case ‘{’: while c != ‘}’ c := nextChar(); default: retract(); return(LeftSquare, LSQUARE); } endif; end;
Procedure nextToken() begin skipSeparator(); switch (ch) {
4
case letter: scanWord(); case digit: scanNumeral(); case ‘+’: scanPlus(); … case EOF: return(EOF); end;
procedure skipSeparator() begin while ch in delimiter and ‘{‘ do switch (ch) { case space: nextChar(); case newLine: lineNo++; nextChar(); case tab: nextChar(); case ‘{‘: comment(); } } end; procedure comment() begin nextChar(); while ch != ‘}’ do if ch = ‘{‘ then comment(); else if ch = newLine then lineNo++; nextChar(); if ch = ‘}’ then nextChar(); else error(“comment”); end; Searching Linear search Letter indexing Hashing Direct chaining Linear Probing
5