Details of Lex
1
EXAMPLE 1 As a rst exer ise, we shall write the previous example as a lex s ript: %% a*b printf("Token 1 found\n");
+ printf("Token 2 found\n"); %% main() { yylex(); }
Assume that this example is stored in a le example1.l. To make a lexi al analyser from this lex spe i ation: >lex example1.l
This reates a le alled lex.yy. . This has to be ompiled and linked to the lex library. >
-o example1 lex.yy. -ll
Now one an invoke the lexi al analyser: >example1 aabb
ab Token 1 found Token 1 found Token 2 found Token 1 found Token 2 found 2
Note the following: The ore of the lex s ript lies between the two %%s.
This onsists of a series of pattern a tion pairs. a*b is an example of a pattern. An a tion is any valid C statement. Lex takes a lex s ript and produ es a le alled lex.yy. . Among other things, this ontains the de nition of a fun tion yylex(), whi h is the lexi al analyser. The part after the se ond %% ontains the fun tion main alling yylex. Linking to the lex library is required be ause lex.yy. uses ertain prede ned fun tions like yywrap, yyless, and yymore. An invo ation of yylex goes through the mat h pattern ! perform a tion y le repeatedly till the entire input string is over. Thus all tokens in the input string are returned in a single invo ation of yylex. As we have seen, it is sometimes ne essary to return a single token for every invo ation of yylex. To do this: %% a*b {printf("Token 1 found\n"); return;}
+ {printf("Token 2 found\n"); return;} %% main() 3
{ yylex();yylex }
This behaves as: >example1 aabb
ab Token 1 found Token 1 found
Segments of the input string whi h do not mat h
any pattern are opied dire tly onto the output. abb sd f Token 1 found Token 1 found Token 2 found sdToken 2 found f
By default, yylex reads its input from stdin, and
writes it output to stdout.
4
STRUCTURE OF A LEX SPECIFICATION A lex program has three parts: . . . de nition se tion. . .
%%
. . . rules se tion. . .
%%
. . . user de ned fun tions. . . The de nition se tion may ontain, among other
things, literal blo ks and pattern names. Literal blo ks provide a way for initialising variables used in the a tions. While pattern names are shorthand for omplex patterns. As stated earlier, a rule is a pattern { a tion pair. We shall examine the set of allowed patterns later. We have also seen the fun tion main() being de ned in the user de ned se tion. We shall see examples of other fun tions whi h an go in at this pla e. Why annot these fun tion be pla ed in the literal blo k?
5
Structure of lex.yy.c Global Definitions: The C definitions are inserted verbatim over here. Other lex defined functions and data structures also inserted here. yylex(); uses patterns and actions are used here.
User defined functions: main () yywrap()
6
EXAMPLE 2 We want to ount the number of hara ters, words and lines in a pie e of text. The resulting lexi al analyser should be able to read its input from a le whose name is to be supplied along with the ommand to invoke the lexi al analyser. %{ /* Initialisation of variables to be used in a tions */ unsigned harCount = 0, wordCount = 0, lineCount = 0; %} word [^ \t\n℄+ eol \n %% {word} {wordCount++; harCount += yyleng;} {eol} { harCount++; lineCount++;} .
harCount++; %% main(arg , argv) int arg ;
har ** argv; { if (arg >1) { FILE* file; file = fopen(argv[1℄, "r"); if (!file){ fprintf(stderr, " ould not open %s\n", argv[1℄); 7
exit(1);
} yyin = file;
}
} yylex(); printf("%d %d %d\n", harCount, wordCount, lineCount); return 0;
Assume that the resulting lexi al analyser is in a le
alled example2. >example2 lex.yy. >36269 5143 1526
Based on the program we an make the following observations: The portion of the ode between %{ and %} is inserted verbatim in the early part of the generated lexi al analyser. Thse generally onsist of de lrations that would be used by the later parts. The de nition part also ontains names for omplex regular expressions. The names are substituted by the orresponding regular expressions in the rules se tion. The example ontains some examples of patterns. They are 1. [\^ \t\n℄+ { One or more repetetions of any
hara ter ex ept a blank, tab or newline. 8
2. . { any hara ter ex ept a newline To read from a le instead of stdin, rst use fopen() to get a le pointer to the le named in the ommand line argument. This le pointer is assigned to yyin.
EXAMPLE 3 We want to extend the predvious example so that the
ounting an be done over multiple les. %{ unsigned harCount = 0, wordCount = 0, lineCount = 0; %} word [^ \t\n℄+ eol \n %% {word} {wordCount++; harCount += yyleng;} {eol} { harCount++; lineCount++;} .
harCount++; %% int urrentfile = 1;
har **filelist; main(arg , argv) 9
int arg ;
har ** argv; {filelist = argv; if (arg ==1) yylex(); else {FILE * file; file = fopen(filelist[ urrentfile℄,"r"); yyin = file; yylex(); }; printf("%d %d %d \n", harCount, wordCount, lineCount); } int yywrap() { FILE* file; /* f lose(yyin); */ if (filelist[++ urrentfile℄ != ( har*) 0) { file = fopen(filelist[ urrentfile℄, "r"); yyin = file; return 0; } else return 1; }
This example illustrates a non-trivial use of a user
de ned fun tion. The fun tion yywrap has a defualt de nition in the lex library. We want override this de nition for our 10
purpose. When yylex hits the end of the urrent le, it alls yywrap. If yywrap returns 1, yylex exits. Otherwise it assumes that there is more input to be read from the le asso iated with yyin. The key idea is to rede ne yywrap so that when it is alled, yyin is bound to the next le supplied in the ommand line and 0 is returned. This is done until there are no more les to be read, when a 1 is returned.
11
PATTERNS We have seen the de nition se tion and the se tion ontaining the user de ned fun tions in a lex s ript. We shall now examine the rules se tion. As stated earlier, the rules se tion onsist of pattern { a tion pairs. It is not hard to see that the power of the generated lexi al analyser depends on the des riptive power of patterns. 1. Pattern { x, x is a single hara ter not in the lex meta- hara ter set. Meaning { The hara ter x itself. Example { a 2. Pattern { \x, x is a single hara ter in the lex meta hara ter set. Meaning { The hara ter x itself. Example { \. 3. Pattern { "s", s is a string of hara ters. Meaning { The string s literally. Example { "[℄", the string [℄ 4. Pattern { [s℄, s is a string of hara ters.. Meaning { Any single hara ter o
urring in s. Example { [ab ℄, the hara ter a, b or . 5. Pattern { [x-y℄, x and y are hara ters. Meaning { A shorthand for a hara ter lass. Denotes any hara ter in the range x to y. Example { [a- ℄, the hara ter a, b or . 6. Pattern { [^s℄, s is a string of hara ters. Meaning { Any hara ters ex ept those in string s. 12
7. 8.
9.
10.
11.
12.
Example { [^ \t\n℄, All hara ters ex ept for a blank, a tab or a newline. Pattern { . Meaning { Any hara ter ex ept for a newline Example { See Example 3. Pattern { p{m,n}, p is a pattern, m and nare integers Meaning { m through n o
urren es of the pattern p. Example { a{1,3}, The strings a, aa and aaa Pattern { (p), p is a pattern Meaning { The same as p. Parenthesis is used to
larify asso iation Example { (ab )+, One or more o
urren es of ab Pattern { p1|p2, p1 and p2 are patterns Meaning { The strings represented by p1 or p2 Example { (ab )+|de, One or more o
urren es of ab or the string de Pattern { p1p2, p1 and p2 are patterns Meaning { The strings represented by p1 on atenated with those represented by p2 Example { (ab )+de, One or more o
urren es of ab followed by de Pattern { p?, p is a pattern Meaning { Zero or one o
urren es of the strings represented by p Example { (ab )*, Zero or one p o
urren es of ab . 13
13. Pattern { p*, p is a pattern Meaning { Zero or more o
urren es of the strings represented by p Example { (ab )*, Zero or more o
urren es of ab . 14. Pattern { p+, p is a pattern Meaning { One or more o
urren es of the strings represented by p Example { (ab )*, One or more o
urren es of ab . 15. Pattern { {n}, n is a name de ned in the de nition se tion Meaning { The strings de ned by the pattern orresponding to n Example { See Example 2.
14
EXAMPLES OF PATTERN USAGE 1. We want to represent real numbers through patterns. Here are some examples of reals: .36, 45., 3.414, .21e-6. The lex s ript de ning reals is: DIGIT [0-9℄ %% ({DIGIT}+\.{DIGIT}*|{DIGIT}*\.{DIGIT}+)(e-?{DIGIT}+)? %%
Another way of doing the same thing is: DIGIT [0-9℄ %% {DIGIT}+\.{DIGIT}*(e-?{DIGIT}+)? {DIGIT}*\.{DIGIT}+(e-?{DIGIT}+)? %%
|
2. In C, a omment is de ned as follows: The hara ters /* introdu e a omment, whi h terminates with the
hara ters */. Comments do not nest. %% "/*"[^*℄*(\*([^/*℄[^*℄*)?)*"*/" {ECHO;printf("\n");} %% main () { yylex(); }
For the input 15
/*///*/****/ /*///**/ /*///***/ /*///*a*b//**/ /**/ /*/*
The output is: /*///*/ ****/ /*///**/ /*///***/ /*///*a*b//**/ /**/ /*/*
To understand the pattern for omments, note that the most general form of a omment is: 1. The string \* 2. A (possibly empty) string of hara ters not ontaining *. 3. Zero or more repetetions of: a. A * followed by zero or one o
urren es of: i. Any single hara ter other than \ or * . ii. A (possibly empty) string not ontaining *. 4. The string *\. 16
3. As a third example we shall onsider the de nition of strings. A string is a sequen e of hara ters en losed within " and ". A " within a string is represented as ". %% \"[^"℄*(\"\"[^"℄*)*\" {ECHO;printf("\n");} %% main () { yylex(); }
The input "" """" "jghg""nbv" """""" "fgb"""
gives as output "" """" "jghg""nbv" """""" "fgb""" 17
MORE PATTERNS Certain patterns are ontext dependent in the sense that are a tive only if ertain onditions ar ful lled. 1. Pattern { ^p, p is a pattern Meaning { Any string represented by p, provided it o
urs at the beginning of a line. Example { Consider the lex spe i ation: %% ^#define %%
{printf("<");ECHO;printf(">\n");}
For the input: #define
#define
The output is: <#define> #define
18
2. Pattern { p$, p is a pattern Meaning { Any string represented by p, provided it o
urs at the end of a line Example { end$, the string end provided it o
urs at the end of a line. 3. Pattern { p1/p2, p1 and p2 are patterns Meaning { Any string represented by p1, provided it is followed by a string represented by p2 Example { Re all that FORTRAN ignores all spa es ex ept those in omments and Hollerith strings. Therefore it is usual in a Fortran ompiler to remove all spa es in a prepro essor pass, and then start lexi al analysis. Then a DO statement will appear as: DO50k=1,20,1
And an assignment statement might look like: DO50k=1.20
19
Here is a lex s ript that distinguishes between the two: letter [a-zA-Z℄ digit [0-9℄ num {digit}+ id {letter}({letter}|{digit})* %% DO/{num}{id}=({num}|{id}), {printf("<"); ECHO;printf(">\n");} {id} {printf("<"); ECHO;printf(">\n");} %% main () { yylex(); }
For the input: DO101J=1,25 DO101J=1.25
The output is: 101<J> =1,25 =1.25 20
START STATES Sometimes the ontext ondition annot be aptured by the patterns given above. For example, the interpretation of a token might depend on the tokens already seen. In su h a ase one needs states As an example, in C, the string <stdio.h> should be interpreted as a le name, if it is pre eded #in lude. This is done by introdu ing a state, say INCLSTATE. We then want to say that when the lexi al analyser is in the state INCLSTATE, interpret the string within the angular bra kets as a lename. %s INCLSTATE %% ^#in lude {BEGIN INCLSTATE;} "<"[^>\n℄+">" {..pro ess filename..} \n {BEGIN INITIAL;} %%
In a "normal" state, the patterns "<"[^>\n℄+">" and
are ina tive. However, the rest of the patterns are valid when the s anner is in INCLSTATE. As an example, we also had a token onsisting of some hara ters en losed within < and >. Then the following s ript does not work. \n
21
%s INCLSTATE %% \<.*\> {printf("some token found\n");} ^#in lude {BEGIN INCLSTATE;} "<"[^>\n℄+">" {printf("<"); ECHO; printf(">\n");} \n {BEGIN INITIAL;} %%
For the input: #in lude <stdio.h>
We get the output: some token found
To re tify the situation, we use ex lusive states. When the s anner is in an ex lusive state s, only the patterns beginning with <s> are valid. %x INCLSTATE %% \<.*\> {printf("some token found\n");} ^#in lude {BEGIN INCLSTATE;} "<"[^>\n℄+">" {printf("<"); ECHO; printf(">\n");} \n {BEGIN INITIAL;} %%
So for the same input: #in lude <stdio.h>
We get the output: <<stdio.h>> 22
PUTTING IT ALL TOGETHER We now show a omplete lexi al analyser for a small programming language: %{ #define #define #define #define #define #define #define #define #define #define #define #define #define #define
LT 1 LE 2 EQ 3 NE 4 GT 5 GE 6 IF 7 THEN 8 ELSE 9 ID 10 NUMBER 11 RELOP 12 LPAREN 13 RPAREN 14
#define SYMTABSIZE 211 typedef stru t { har* idstring; int other_attributes;} symtab_entry; symtab_entry symtab[SYMTABSIZE℄; extern int yylval; %} 23
ws [ \t\n℄+ letter [A-Za-z℄ digit [0-9℄ id {letter}({letter}|{digit})* number {digit}+(\.{digit}+)?(E[+\-℄?{digit}+)? %% {ws} if then else {id} {number} "<" "<=" "=" "<>" ">" ">=" "(" ")"
{/*no a tion, no return*/} {return(IF);} {return(THEN);} {return(ELSE);} {yylval=install_id(); return(ID);} {yylval=atof(yytext); return(ID);} {yylval=LT; return(RELOP);} {yylval=LE; return(RELOP);} {yylval=EQ; return(RELOP);} {yylval=NE; return(RELOP);} {yylval=GT; return(RELOP);} {yylval=GE; return(RELOP);} {return(LPAREN);} {return(RPAREN);}
%%
24
int hash(s) /* given a string s returns the hash value */
har* s; {
har* p; unsigned h = 0, g; for (p = s; *p != '\0'; p++) { h = (h << 4) + (*p); if (g = h&0xf0000000) { h = h ^ (g >> 24); h = h ^ g; } } return h % SYMTABSIZE; } int install_id () { /* installs the identifier string in yytext in the symbol table symtab */
}
int symtab_index; symtab_index = hash(yytext); str py(symtab[symtab_index℄.idstring, yytext); symtab[symtab_index℄.other_attributes = yyleng; return(symtab_index);
ODDS AND ENDS 25
We end by explaining a some of the fun tions and ma ros whi h are provided by lex but have not appeared in the example: 1 The fun tion yyless(n) pushes ba k all but the rst n hara ters of the urrent token into the input stream. 2 The fun tion yymore(n) appends the next token to the urrent one. Instead of overwriting yytext, the next token is appended to it. As an example, onsider a pattern to mat h quoted strings, where a quotation mark within a string an be es aped with ba kslash. %% \"[^"℄*\"
{if(yytext[yyleng-2℄=='\\'){ yyless(yyleng-1); yymore(); } else printf("<%s>\n", yytext); }
%%
3 The ma ro REJECT tells the lexi al analyser to exe ute the rule that ontains it, and then, before exe uting the next rule, restore the position of the input pointer to where it was before the urrent rule was exe uted. This is used in situations when one token is a substring of another. 26
%% she {printf("<she found>\n");REJECT;} he {printf("\n");} %%
The ee t of REJECT is not the same as yyless(0). (Why?)
27