Lexical analyzer converts stream of input characters into a stream of tokens. The different tokens that our lexical analyzer identifies are as follows: KEYWORDS: int, char, float, double, if, for, while, else, switch, struct, printf, scanf, case, break, return, typedef, void IDENTIFIERS: main, fopen, getch etc NUMBERS: positive and negative integers, positive and negative floating point numbers. OPERATORS: +, ++, -, --, ||, *, ?, /, >, >=, <, <=, =, ==, &, &&. BRACKETS: [ ], { }, ( ). STRINGS : Set of characters enclosed within the quotes COMMENT LINES: Ignores single line, multi line comments For tokenizing into identifiers and keywords we incorporate a symbol table which initially consists of predefined keywords. The tokens are read from an input file. If the encountered token is an identifier or a keyword the lexical analyzer will look up in the symbol table to check the existence of the respective token. If an entry does exist then we proceed to the next token. If not then that particular token along with the token value is written into the symbol table. The rest of the tokens are directly displayed by writing into an output file. The output file will consist of all the tokens present in our input file along with their respective token values. ADVERTISEMENT
INTRODUCTION Lexical analysis involves scanning the program to be compiled and recognizing the tokens that make up the source statements Scanners or lexical analyzers are usually designed to recognize keywords , operators , and identifiers , as well as integers, floating point numbers , character strings , and other similar items that are written as part of the source program . The exact set of tokens to be recognized of course, depends upon the programming language being used to describe it. A sequence of input characters that comprises a single token is called a lexeme. A lexical analyzer can insulate a parser from the lexeme representation of tokens. Following are the list of functions that lexical analyzers perform. Removal of white space and comment Many languages allow “white space” to appear between tokens. Comments can likewise be ignored by the parser and translator , so they may also be
treated as white space. If white space is eliminated by the lexical analyzer, the parser will never have to consider it. Constants An integer constant is a sequence of digits, integer constants can be allowed by adding productions to the grammar for expressions, or by creating a token for such constants . The job of collecting digits into integers is generally given to a lexical analyzer because numbers can be treated as single units during translation. The lexical analyzer passes both the token and attribute to the parser. Recognizing identifiers and keywords Languages use identifiers as names of variables, arrays and functions. A grammar for a language often treats an identifier as a token. Many languages use fixed character strings such as begin, end , if, and so on , as punctuation marks or to identify certain constructs. These character strings, called keywords, generally satisfy the rules for forming identifiers. System Analysis The operating system used is MS-DOS MS-DOS : The MS-DOS is a single user, single process, single processor operating system. Due to the confinement of the device dependent code into one layer, the porting of MS-DOS has theoretically been reduced to writing of the BIOS code for the new hardware. At the command level it provides a hierarchical file system, input output redirection, pipes and filters. User written commands can be invoked in the same way as the standard system commands, giving the appearance as if the basic system functionality has been extended. Being a single user system, it provides only file protection and access control. The disk space is allocated in terms of a cluster of consecutive sectors. The command language of MS-DOS has been designed to allow the user to interact directly with the operating system using a CRT terminal. The principal use of the command language is to initiate the execution of commands and programs. A user program can be executed under MS-DOS by simply typing the name of the executable file of the user program at the DOS command prompt. The programming language used here is C programming ADVERTISEMENT
SYSTEM DESIGN Process The lexical analyzer is the first phase of a compiler. Its main task is to read the input characters and produce as output a sequence of tokens that the
parser uses for syntax analysis. This interaction, summarized schematically in fig below.
Upon receiving a “get next token “command from the parser, the lexical analyzer reads the input characters until it can identify next token. Sometimes , lexical analyzers are divided into a cascade of two phases, the first called “scanning”, and the second “lexical analysis”. The scanner is responsible for doing simple tasks, while the lexical analyzer proper does the more complex operations. The lexical analyzer which we have designed takes the input from a input file. It reads one character at a time from the input file, and continues to read until end of the file is reached. It recognizes the valid identifiers, keywords and specifies the token values of the keywords. It also identifies the header files, #define statements, numbers, special characters, various relational and logical operators, ignores the white spaces and comments. It prints the output in a separate file specifying the line number .
BLOCK DIAGRAM
Compiler Design Looking for GATE Preparation Material? Join & Get here now!
<
Next>>
Overview Of Module And Algorithm
Functions
Verify(): The lexical analyzer uses the verify operation to determine whether there is entry for a lexeme in the symbol table. Data structures Structure : A structure in C is a collection of variables which contains related data items of similar and /or dissimilar data types but logically related items. Each variable in the structure represents an item and is called as a member or field of the structure. Complex data can be important represented in more meaningful way using structures and is one of the very features available in C. An array of structures is basically a array of records. Each an every record has the same format consisting of similar or dissimilar data types but logically related entities. ADVERTISEMENT
Algorithms Procedure main begin if symb=’#’ then begin advance to next token in input file if symb=’i’ then begin advance to next token in input file while symb!=’\n’ do begin advance to next token in input file end {while } print symb is a preprocessor directive end {if symb=’i’} if symb=’d’ then begin advance to next token input file while symb!=’ ‘ do begin advance to next token in input file end{while} advance to next token in input file print symb is a constant advance to next token in input file while symb!=’\n’ do begin advance to the next token in input file end {while} end {if symb=’d’} end {if symb=’#’} if symb is a alphabet or symb=’_’ then begin advance to the next token in input file while symb is a digit or alphabet or symb=’_’ do begin advance to the next token of input file end {while} call function verify to check whether symb is a identifier or keyword end {if} if symb=’+’ then begin advance to the next token in input file if symb=’+’ print symb is ++ operator else ungetc symb from the input file print symb is + operator end {if} if symb=’-’ then begin advance to the next token in input file if symb=’-’ print symb is -- operator else ungetc symb from the input file print symb is - operator end {if} if symb=’|’ then begin advance to the next token in input file if symb=’|’ print symb is logical or operator else ungetc symb from the input file print symb is bitwise or operator end {if} if symb=’*’ then begin print symb is a multiplication operator end {if} if symb=’?’ then begin print symb is a conditional operator end{if} if symb=’!’or symb=’>’or symb=’<’then begin advance to the next token in input file
Compiler Design Looking for GATE Preparation Material? Join & Get here now!
<
Next>>
Overview Of Module And Algorithm Continue...
Contd... if symb=’=’ print symb is a relational operator else ungetc symb from output file print symb is a operator end{if} if symb=’=’ begin advance to next token in input file if symb=’=’then print symb is equal to operator else ungetc symb from output file print symb is assignment operator end{if} if symb=’&’ then begin advance to next token in input file if symb=’&’ then print symb is a logical and operator
else print & symb is an address operator end{if} if symb=’/’ then begin advance to next token in input file if symb=’*’ then begin advance to next token in input file while symb!=’/’ do advance to next token in input file end{while} end{if} else if symb=’/’ then begin advance to next token in input file while symb!=’\n’ do advance to next token in input file end{while} end{if} else ungetc symb from output file print symb is a division operator end{if} if symb is a digit then begin
advance to next token in input file while symb is a digit or symb=’.’ then begin advance to next token in input file end {while} print symb is a number end{if} if symb =’”’ then begin advance to next token in input file while symb!=’”’ do begin advance to next token in input file end{while} print symb is a string end{if}} if symb= ‘{‘ then print open brace if symb=’}’ then print close brace if symb=’[‘ then print open bracket if symb=’]’ then print close bracket if symb=’(‘ then print open parenthesis
if symb=’)’ then print close parenthesis end {procedure main} procedure verify begin scan the symbol table to check if encountered token exists if exists return token value end{procedure} ADVERTISEMENT
USER MANUAL The code for modules appears in two files: lex.c and output.c. The file lex.c contains the main source code of the lexical analyzer. And the input to the lexical analyzer is contained in test.c. Under the DOS operating system, the program is compiled by using alt F9, and is executed by using ctrl F9. The output i.e token types are stored in the output file, output.txt Sample Input #include<stdio.h> #include<stdlib.h> #define abc 100 void main() { int a_,b=30; printf("enter 2 no.s\n"); // printf statement scanf("%d%d",&a,&b); /* scanf statement*/ if(a<20)
a=a+1; } Sample Output: LINE NO TOKENS ----------------------------------------------1:
#include<stdio.h> is a header file
2:
#include<stdlib.h> is a header file
3:
#define statement: abc is a constant
4:
void: token value : 7 main :identifier, token value : 18 (: open parenthesis ): close parenthesis
5:
{: open brace
6:
int: token value : 1 a_ :identifier, token value : 18 , : comma b :identifier, token value : 18 =: assignment operator 30 is a number ; : semi colon
7:
printf: token value : 5 (: open parenthesis enter 2 no.s\n : is a string ): close parenthesis ;: semi colon
8:
scanf: token value : 6 (: open parenthesis
%d%d : is a string ,: comma &a: address operator , : comma &b: address operator ): close parenthesis ;: semi colon 9: 10: 11:
if: token value : 8 (: open parenthesis a :identifier, token value : 18 <: less than operator 20 is a number ): close parenthesis
12:
a: token value : 18 =: assignment operator a: token value : 18 +: plus operator 1 is a number ;: semi colon
13:
}: close parenthesis
CONCLUSION Generally, when syntactic analysis is being carried out by the parser it may call upon the scanner for tokenizing the input. But the LEXICAL ANALYZER designed by us is an independent program. It takes as input a file with an executable code in C. There fore, the parser cannot make use of the designed
scanner as and when required. Consider as an example an array ch[20].The designed lexical analyzer will tokenize 'ch' as an identifier,'[' as an opening brace,'20' as a number, and ']' as a closing brace. But the parser might require a[5] to be identified as an array. Similarly, there may arise a number of cases where the parser has to identify a token by a different mannerism than the one specified and designed. Hence, we conclude that the LEXICAL ANALYZER so designed is an independent program which is not flexible.