Manual On Compiler Design

  • Uploaded by: Alexis
  • 0
  • 0
  • June 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Manual On Compiler Design as PDF for free.

More details

  • Words: 32,405
  • Pages: 75
Compiler Design

1 MODULE MAP

COURSE CODE AND TITLE: ITED 308 - Compiler Design COURSE DESCRIPTION: This is a three-unit course. This course is designed as an introduction and construction of compilers and interpreters. It will open with a discussion of translators related to compilers, followed by an overview of the phases and data structures involved in compilation. Topics including lexical analysis, parsing, semantic analysis, code generation, and optimization will then be covered in depth with a series of projects assigned to illustrate practical issues. The performance of the student will be evaluated according to quizzes, machine problems, and term project and modular examinations. PREREQUISITES: - CS 19C – Numerical Method OBJECTIVES: General Objective: To study the theory and techniques of compiler construction. The student completing the course should have gained an understanding of the major ideas and techniques in compiler writing and a further development of programming skills. Specific Objective: - To explain the basic concepts and principles of compiler. - To discuss the problem issues in designing and implementing lexical analyzers. - To listen critically and purposively to the basic concepts of compiler. - To participate actively in the group term project. - To construct a program that act as a recognizer for the set of strings defined by a regular expression or context-free grammar. COURSE CONTENT: Module

Title

Page

I. II. III.

Introduction to Compilers ------------------------------------Lexical Analysis ---------------------------------------------The Syntactic Specification of Programming Languages ----------

2 6 11

Pre-Midterm Examination IV. V.

Basic Parsing Techniques -------------------------------------Syntax-Directed Translation -----------------------------------

27 31

Midterm Examination VI. VII. VIII.

Symbol Tables ------------------------------------------------Run-Time Storage ---------------------------------------------Error-Detection and Recovery ----------------------------------

37 42 48

Pre-Final Examination IX. X.

Introduction to Code Optimization ----------------------------Code Generation ----------------------------------------------Final Examination

Saint Paul University – San Nicolas Campus

52 60

Compiler Design

MODULE I. INTRODUCTION

2 TO

COMPILER

Compiler writing spans programming languages, machine architecture, language theory, algorithms, and software engineering. Fortunately, a few basic compilerwriting techniques can be used to construct translators for a wide variety of languages and machines. A Compiler is a program that reads a program written in one language – the source language – and translates it into an equivalent program in another language – the target language as illustrated in Figure 1.1 in which the important part of the translation process is that the compiler reports to its user the presence of errors in the source program. target source compiler program program error messages Figure 1.1 A Compiler At first glance, the variety of compilers may appear overwhelming. There are thousands of source languages, ranging from traditional programming languages such as Fortran and Pascal to specialized languages that have arisen in virtually every area of computer application. Target languages are equally as varied; a target language may be another programming language, or the machine language of any computer between a microprocessor and a supercomputer. A compiler translates a source program into machine language. An interpreter program reads individual words of a source program and immediately executes corresponding machine-language segments. Interpretation occurs each time the program is used. Thus, once it has been compiled, a program written into a compiled language will run more quickly than a program in an interpreted language. An interpreter is a computer program that translates commands written in a high-level computer language into machine-language commands that the computer can understand and execute. An interpreter's function is thus similar to that of a compiler, but the two differ in their mode of operation. A compiler translates a complete set of high-level commands, producing a corresponding set of machinelanguage commands that are then executed, whereas an interpreter translates and executes each command as the user enters it. Interpretive languages, such as the widely used BASIC, are relatively easy to learn, and programs written in them are easy to edit and correct. Compiled programs, on the other hand, operate more rapidly and efficiently. THE ANALYSIS-SYNTHESIS MODEL

OF

COMPILATION

There are two parts of compilation: analysis and synthesis. The analysis part breaks up the source program into constituent pieces and creates an intermediate representation of the source program. The synthesis part constructs the desired target program from the intermediate representation. During analysis, the operations implied by and recorded in a hierarchical structure called tree called a syntax tree, in which each node children of a node represent the arguments of the

Saint Paul University – San Nicolas Campus

the source program are determined a tree. It is a special kind of represents an operation and the operation. An example is shown in

Compiler Design

3

Figure 1.2.

:= position

+ initial rate

*

60

Figure 1.2 Syntax tree for position := initial + rate * 60. Software tools that manipulate source programs:

1. Structure editors – takes an input a sequence of commands to build a source

program. The structure editor not only performs the text-creation and modification functions of an ordinary text editor, but it also analyzes the program text, putting an appropriate hierarchical structure of the source program. 2. Pretty printers – analyzes a program and prints it in such a way that the structure of the program becomes clearly visible. Example of this, comments may appear in a special font, statements may appear with an amount of indentation proportional to the depth of their nesting in the hierarchical organization of the statements. 3. Static checkers – reads a program, analyzes it, and attempts to discover potential bugs without running the program. For example, a static checker may detect that parts of the source program can never be executed, or that a certain variable might be used before being defined. 4. Interpreters – performs the operations implied by the source program. For an assignment statement, for example, an interpreter might build a tree like in Figure 1.2 and then carry out the operations at the nodes as it “walks” the tree. Interpreters are frequently used to execute command languages, since each operator executed in a command language is usually in invocation of a complex routine such as an editor or compilers. ANALYSIS

OF THE

SOURCE PROGRAM

It consists of three phases:

1. Linear analysis, in which the stream of characters making up the source 2. 3.

program is read from left-to-right and grouped into tokens that are sequences of characters having a collective meaning. Hierarchical analysis, in which characters or tokens are grouped hierarchically into nested collections with collective meaning. Semantic analysis, in which certain checks are performed to ensure that the components of a program fit together meaningfully.

THE PHASES OF A COMPILER Conceptually, a compiler operates in phases, each of which transforms the source program from one representation to another. A typical decomposition of a compiler is shown in Figure 1.3.

Saint Paul University – San Nicolas Campus

Compiler Design

4 source program

lexical analyzer

syntax analyzer semantic analyzer

symbol-table manager

error handler intermediate code generator code optimizer

code generator

target program Figure 1.3 Phases of Compiler LEXICAL ANALYSIS In a compiler, linear analysis is called lexical analysis or scanning. example, in lexical analysis the characters in the assignment statement

For

position := initial + rate * 60 would 1. 2. 3. 4. 5. 6. 7.

be grouped into the following tokens: The identifier position. The assignment symbol :=. The identifier initial. The plus sign The identifier rate. The multiplication sign. The number 60.

The blanks separating the characters eliminated during lexical analysis.

of

these

tokens

would

normally

be

SYNTAX ANALYSIS Hierarchical analysis is called parsing or syntax analysis. It involves grouping the tokens of the source program into grammatical phrases that are used by the compiler to synthesize output. Usually, a parse tree such as the one shown in Figure 1.4 represents the grammatical phrases of the source program.

Saint Paul University – San Nicolas Campus

Compiler Design

5

assignment statement identifier

expression

:=

position

expression

identifier rate

+

expression

*

expression number 60

Figure 1.4 Parse tree for position := initial + rate * 60 SEMANTIC

ANALYSIS

The semantic analysis phase checks the source program for semantic errors and gathers type information for the subsequent code-generation phase. It uses the hierarchical structure determined by the syntax-analysis phase to identify the operators and operands of expressions and statements. An important component of semantic analysis is type checking. Here the compiler checks that each operator has operands that are permitted by the source language specification. For example, many programming language definitions require a compiler to report an error every time a real number is used to index an array. However, the language specification may permit some operand coercions, for example, when a binary arithmetic operator is applied to an integer and real. In this case, the compiler may need to convert the integer to a real. SYMBOL-TABLE MANAGEMENT An essential function of a compiler is to record the identifiers used in the source program and collect information about various attributes of each identifier. These attributes may provide information about the storage allocated for an identifier, its type, its scope, and in the case of procedure names, such things as the number and types of its arguments, the method of passing each argument, and the type returned, if any. A symbol table is a data structure containing a record for each identifier, with fields for the attributes of the identifier. The data structure allows us to find the record for each identifier quickly and to store or retrieve data from that record quickly. When the lexical analyzer detects an identifier in the source program, the identifier is entered into the symbol table. However, the attributes of an identifier cannot normally be determined during lexical analysis. For example, in a Pascal declaration like var position, initial, rate : real ; the type real is not known when position, initial, and rate are seen by the lexical analyzer. Saint Paul University – San Nicolas Campus

Compiler Design

6

INTERMEDIATE CODE GENERATION After syntax and semantic analysis, some compilers generate an explicit intermediate representation of the source program. We can think of this intermediate representation as a program for an abstract machine. This intermediate representation should have two important properties: it should be easy to produce, and easy to translate into the target program. CODE OPTIMIZATION The code optimization phase attempts to improve the intermediate code, so that faster-running machine code will result. CODE GENERATION The final phase of the compiler is the generation of target code, consisting normally of relocatable machine code or assembly code. Memory locations are selected for each of the variables used by the program. Then, intermediate instructions are each translated into a sequence of machine instructions that perform the same task. A crucial aspect is the assignment of variables to registers. ERROR-HANDLING Each phase can encounter errors. However, after detecting an error, a phase must somehow deal with that error, so that compilation can proceed, allowing further errors in the source program to be detected. A compiler that stops when it finds the first error is not as helpful as it could be. The syntax and semantic analysis phases usually handle a large fraction of the errors detectable by the compiler. The lexical phase can detect errors where the characters remaining in the input do not form any token of the language. Errors where the token stream violates the structure rules (syntax) of the language are determined by the syntax analysis phase. During semantic analysis the compiler tries to detect constructs that have the right syntactic structure but no meaning to the operation involved. COMPILER-WRITING TOOLS The compiler writer, like any programmer, can profitably use software tool such as debuggers, version managers, profilers, and so on. Shortly after the first compilers were written, systems to help with the compiler-writing process appeared. These systems have often been referred to as compiler-compilers, compilergenerators, or translator-writing systems. Largely, they are oriented around a particular model of languages, and they are most suitable for generating compilers of languages similar to the model. Some general tools have been created for the automatic design of specific compiler components. These tools use specialized languages for specifying and implementing the component, and many use algorithms that are quite sophisticated. The most successful tools are those that hide the details of the generation algorithm and produce components that can be easily integrated into the remainder of a compiler. The following is a list of some useful compiler-construction tools:

1. Parser generators.

These produce syntax analyzers, normally from input that is based on a context-free grammar. In early compilers, syntax analysis consumed not only a large fraction of the running time of a compiler, but also a large fraction of the intellectual effort of writing a compiler. This phase is now considered one of the easiest to implement. Many parser generators utilize powerful parsing algorithms that are too complex to be carried out by hand.

Saint Paul University – San Nicolas Campus

Compiler Design

7

2. Scanner generators. These automatically generate lexical analyzers, normally 3.

4.

5.

from a specification based on regular expressions. The basic organization of the resulting lexical analyzer is in effect a finite automation. Syntax-directed translation engines. These produce collections of routines that walk the parse tree, generating intermediate code. The basic idea is that one or more “translations” are associated with each node of the parse tree, and each translation is defined in terms of translations at its neighbor nodes in the tree. Automatic code generators. Such a tool takes a collection of rules that define the translation of each operation of the intermediate language into the machine language for the target machine. The rules must include sufficient detail that we can handle the different possible access methods for data; e.g., variables may be in registers, in a fixed (static) location in memory, or may be allocated a position on a stack. The basic technique is “template matching.” The intermediate code statements are replaced by “templates” that represent sequences of machine instructions, in such a way that the assumptions about storage of variables match from template to template. Data-flow engines. Much of the information needed to perform good code optimization involves “data-flow analysis,” the gathering of information about how values are transmitted from one part of a program to each other part. Different tasks of this nature can be performed by essentially the same routine, with the user supplying details of the relationship between intermediate code statements and the information being gathered.

FORMAL SPECIFICATIONS

AND

GENERATION

OF

COMPILER MODULES

Compiler Subtask Lexical analysis

Specification mechanism Regular expressions

Syntax analysis

Context-free grammars

Semantic analysis

Attribute grammars

Efficiency-increasing transformations

Tree → tree transformation

Code selection generation

in

code

Regular tree grammars

Automaton type Deterministic finite automata Deterministic pushdown automata

Finite Tree transducers Finite tree automata

MODULE II. LEXICAL ANALYSIS THE ROLE

OF

LEXICAL ANALYZER

The lexical analyzer is the first phase of a compiler. read the input characters and produce as output a sequence parser uses for syntax analysis. Upon receiving a “get next the parser, the lexical analyzer reads input characters until next token.

Its main task is to of tokens that the token” command from it can identify the

next_token

input

Lexical analyzer

getnext_token

Symbol Table

Saint Paul University – San Nicolas Campus

Error Handler

Parser

Compiler Design

8

Figure 1.5 Interaction of Lexical Analyzer with Parser Since the lexical analyzer is the part of the compiler that reads the source text, it may also perform certain secondary tasks at the user interface. One such task is stripping out from the source program comments and white spaces in the form of blank, tab, and new line characters. Another is correlating error messages from the compiler with the source program. Sometimes lexical analyzers are divided into a cascade of two phases first called “scanning” and the second “lexical analysis”. The scanner is responsible for doing simple tasks, while the lexical analyzer proper does the more complex operations. ISSUES

IN

LEXICAL ANALYSIS

Reasons for separating the analysis phase of compiling into lexical analysis and parsing: 1. Simpler design is perhaps the most important consideration. The separation of lexical analysis from syntax analysis often allows us to simplify one or the other of these phases. 2. Compiler efficiency is improved. A separate lexical analyzer allows us to construct a specialized and potentially more efficient processor for the task. 3. Compiler portability is enhanced. Input alphabet peculiarities and other device-specific anomalies can be restricted to the lexical analyzer. TOKENS, PATTERNS, LEXEMES A lexeme is a sequence of characters in the source program that is matched by the pattern for a token. Examples of tokens Token Sample Lexemes Informal Description of Pattern Const const const If if if Relation <, <=, =, <>, >, >= < or <= or = or <> or > or >= Id pi, count, D2 letter followed by letters and digits any Num 3.1416, 0, 6.02E23 numeric constant Literal “core dumped” any numeric constant any characters between “ and “ except “ Tokens are treated as terminal symbols in the grammar for the source language, using boldface names to represent tokens. The lexemes matched by the pattern for the token represent strings of characters in the source program that can be treated together as a lexical unit. In other languages, the following constructs are treated as tokens: keywords, operators, identifiers, constants, literal strings, and punctuation symbols such as parentheses, commas, and semicolons. A pattern is a rule describing the set of lexemes that can represent a particular token in a source program. The pattern for the token const just the single const that spells out the keyword. ATTRIBUTES FOR TOKENS When more than one pattern matches a lexeme, the lexical analyzer must provide additional information about the particular lexeme that matched to the subsequent phases of the compiler. The lexical analyzer collects information about tokens into their associated attributes. The tokens influence parsing decisions; the attributes influence the translation of tokens.

Saint Paul University – San Nicolas Campus

Compiler Design

9

LEXICAL ERRORS Few errors are discernible at the lexical level alone, because a lexical analyzer has a very localized view of the source of program. But suppose a situation does arise in which the lexical analyzer is unable to proceed because none of the patterns for the tokens matches a prefix of the remaining input. Perhaps the simplest recovery strategy is “panic mode” recovery. Other 1. 2. 3. 4.

possible error-recovery actions are: deleting an extraneous character inserting a missing character replacing an incorrect character by a correct character transposing two adjacent characters

Error transformations like these may be tried in attempting to repair the input. The simplest such strategy is to see whether a prefix of the remaining input can be transformed into a valid lexeme by just a single error transformation. One way of finding the error in a program is to compute the minimum number of error transformations required to transform the erroneous program into one that is syntactically well formed. INPUT BUFFERING Three general approaches to the implementation of a lexical analyzer: 1. Use a lexical analyzer generator, such as Lex compiler to produce the lexical analyzer from a regular expression based specification. 2. Write the lexical analyzer in a conventional systems-programming language, using the I/O facilities of that language to read the input. 3. Write the lexical analyzer in assembly language and explicitly manage the reading of input. The three are listed in order of increasing difficulty for the implementor. Since the lexical analyzer is the only phase of the compiler that reads the source program character by character, it is possible to spend a considerable amount of time in the lexical analysis phase, even though the later phases are conceptually more complex.

Saint Paul University – San Nicolas Campus

Compiler Design

10

Buffer Pairs There are many times when the lexical analyzer needs to look ahead several characters beyond the lexeme for a pattern before a match can be announced. Because a large amount of time can be consumed moving characters, specialized buffering techniques have been developed to reduce the amount of overhead required to process an input character. : : :E: :=: :M: * C: *: *: 2: eof : : :

lexeme beginning

forward

An input buffer in two halves Sentinels Except at the ends of the buffer halves, it requires tests for each advance of the forward pointer. The two tests can be reduced to one if each buffer half will be extended to hold a sentinel character at the end. :E:

:=:

:M:

* eof

C:

*:

*:

lexeme beginning

2:

eof :

:

:

eof

forward

Sentinels at the end of each buffer half SPECIFICATION

OF

TOKENS

Regular expressions are an important notation for specifying patterns. Each pattern matches a set of strings, so regular expressions will serve as names for sets of strings. Strings and languages Alphabet or character class denotes any finite set of symbols. Example The set { 0 , 1} is the binary alphabet String – is finite sequence of symbols drawn from the alphabet. In language theory, the terms sentence and word are often used as synonyms for the term “string”. The length of a string s, usually written |s| is the number of occurrences of symbol in s. Empty string – denoted ∈, is a special string of length zero. The term language denotes any set of strings over some fixed alphabet. Term Prefix of s

Definition A string obtained by removing zero or more trailing symbols of string s, e.g. ban is a prefix of banana.

Suffix of s

A string formed by deleting zero or more symbols of s; nana is a suffix of banana.

Substring of s

A string obtained by deleting a prefix and a suffix from s; nan is a substring of banana. Every prefix and every suffix of s is a substring of s, but not every substring of s is a prefix or suffix of s.

Proper prefix, suffix, of substring of s

of the

leading

Any nonempty string x that is, respectively, a prefix, suffix, or substring of such that s is not equal to x.

Saint Paul University – San Nicolas Campus

Compiler Design

Subsequence of s

11 Any string formed by deleting zero or more not necessarily contiguous symbols from s; e.g., baaa is a subsequence of banana.

Saint Paul University – San Nicolas Campus

Compiler Design

12

Operations on Languages There are several important operations that can be applied to languages. lexical analysis, the primary operations are union, concatenation, and closure. Operation Union L and M written L ∪ M

Definition L ∪ M = { s |s is in L or s is in M }

Concatenation of L and M written LM

LM = { st | s is in L and t is in M}

Kleene closure of L written L *

L * denotes “zero or more concatenations of” L

Positive closure of L written L +

For

L + denotes “one or more concatenations of” L

REGULAR EXPRESSIONS A regular expression is built up out of simpler regular expressions using a set of defining rules. Each regular expression r denotes a language L(r). The defining rules specifies how L(r) is formed by combining in various ways the languages denoted by the subexpressions of r. Rules that define the regular expressions using a set of defining rules: 1. ∈ is a regular expression that denotes { ∈ }, that is, the set of containing the empty string. 2. If a is a symbol in ∑, then a is a regular expression that denotes {a} is a set containing the string a. 3. Suppose r and s are regular expressions denoting the language L(r) and L(s). Then: a) (r) | (s) is a regular expression denoting L(r) ∪ L(s) b) (r) (s) is a regular expression denoting L(r) L(s) c) (r)* is a regular expression denoting (L(r))* d) (r) is a regular expression denoting L(r)2 A language denoted by a regular expression is said to be a regular set. PRACTICE SET 1. What is the input alphabet of each of the following languages?

Saint Paul University – San Nicolas Campus

Compiler Design

2. 3.

4.

5.

13

a. Pascal c. C++ b. C d. Visual Basic What are the conventions regarding the use of the blanks in each language in number 1? In a string length n, how many of the following are there? a. Prefixes b. Suffixes c. Substrings d. Proper prefixes e. Subsequences Describe the language denoted by the following regular expressions: a. 0(0|1)*) b. (0|1)*0(0|1)(0|1) c. 0*10*10* d. (00|11)*((01|10)(00|11)) Write regular definitions for the following languages: a. all strings of letters that contain the five vowels in order b. all strings of letters in which the letters are in ascending lexicographic order c. comments consisting of a string surrounded by /* and */ without and intervening */ unless it appears inside the quotes “and” d. all strings of digits with at most one repeated digit

Saint Paul University – San Nicolas Campus

Compiler Design

14

REGULAR EXPRESSIONS & REGULAR LANGUAGES • “Familiar” algebraic properties/identities - Alphabet is always a finite set. It can be ANY alphabet but in theory of languages it is usually chosen to consist of two or three characters, typically: a and b or 0 and 1. = =

{a, b} {a} U {b}

= =

(a + b) = (a | b) a U b

are all IDENTICAL regular expression notations for Σ.

Definition Consider a string w made up from the symbols in the alphabet Σ = {a, b} such as “a” or “b” or “ab” or “ababbbbb” or “aaa” or “abbbaaabb” etc., then we define string “e”, called the empty string such that for any string w: ew = we = w The symbol “e” is then “neutral element” for strings when performing the concatenation operation. (This is similar to defining 0 as the neutral element for real numbers when performing the operation of addition or defining 1 as the neutral element for the operation of multiplication). It is important to notice that e in never part of the alphabet. a* = {a}* = {e, a, aa, aaa, aaaa, ....}, this is the Kleene star or operation on one symbol of the alphabet.

simply

Σ* = {a, b}* = (a, b)* = (a | b)* = an infinite set given by: {e, a, b, aa, ab, ba, bb, aaa, aab, aba, baa, abb, bab, bba, bbb, aaaa, ....}, that is, all possible strings of a’s and b’s. This is the star operation on the entire alphabet. It can be proven that another regular expression for this same set is (a*b*)*. Ø = { } ≠ e notice that Ø is a set while e is a symbol. Ø* = {e} ≠ Ø the set Ø* has one symbol in it, mainly e, the empty string, and thus it is NOT the empty set. a | a*b denotes the set containing the string a or all strings of zero or more a’s followed by a b. Note: The notation “a | a*b” can also be written as “(a + a*b)” as we shall see when reviewing the regular expressions. The symbol “|” or the symbol “+” in this case means the so-called “or” operation used in set theory. Therefore, the expression a*b means the following: a*b = {e, a, aa, aaa, ...}b = {b, ab, aab, aaab, ....}, that is, the set of all strings of zero or more a’s followed by a b. • Define a set of strings (i.e., a language L) over a given alphabet Σ and use a set of construction rules: - The empty string e is a regular expression. - Every symbol a in the alphabet Σ is a regular expression - Given the regular expressions r and s denoting the languages L(r) and L(s), then: • (r) + (s) is the regular expression denoting L(r)U L(s) • (r)°(s) is the regular expression denoting L(r)°L(s) • (r)* is a regular expression denoting (L(r))* • (r) is a regular expression denoting L(r) • Precedence rules to avoid parentheses: - the unary operator * has the highest precedence - concatenation ° has the second highest precedence - + (also “|”, U-union-, “or”) has the lowest precedence. • Algebraic properties of regular expressions (r & s): - r+s = s+r => + is commutative - r+(s+t) = (r+s)+t => + is associative - (r°s)°t = r°(s°t) => ° is associative - r°(s+t) = r°s + r°t & (s+t)°r = s°r+t°r => ° distributes over + - e°r = r & r°e = r => e is the identity element for ° - r* = (r+e)* => relation between * and e. - r** = r* => * is idempotent. NOTE: If two regular expressions denote the same regular language we say that they are equivalent.

Saint Paul University – San Nicolas Campus

Compiler Design

15

Examples of regular expressions: Strings over the alphabet Σ = {0, 1} • L = 0*1(0 | 1)*0 = 0*1Σ*0; set of non-zero binary numbers that are multiples of 2. (They end in 0). • L = (0 + 1)* = Σ*;

set of all strings of 0's and 1s.

• L = (0 + 1)*00(0 + 1)* = Σ*00Σ*; set of all strings of 0's and 1's with at least two consecutive 0's. • L = (0 + 1)*011 = Σ*011;

set of all strings of 0's and 1's ending in 011.

• L = letter (letter | digit)*; set of all strings of letters and digits beginning with a letter. • (e + 00)* • 1 + 01

it is equivalent to it is equivalent to

(00)*.

(e + 0)1.

• 0* + 0*(1 + 11)(00*(1 + 11))*0* ; set of all strings that do not have the substring 111. MODULE III. THE SYNTACTIC SPECIFICATION

OF

PROGRAMMING LANGUAGES

FINITE AUTOMATA • Machine Model that Recognizes Regular Languages • The finite automata, (FA), machine M defined by the 5-tuple M = {∑, S, s0, δ, F}; where the alphabet is: ∑ = {0,1}; the set of states is: S = {s0, s1, s2}; the starting state is s0; the final state is: F = {s2}; and the transitions δ are defined by the table below. δ 0 1 . s0 s0 s1 s1 {s1, s2} s1 s2 s2 Ø M can also be represented by the transition graph:

0

s0

1,0

1

s1

0 0

s2

this figure (which corresponds to the transition table above) is a nondeterministic finite automaton, NFA. The big circles represent states and the double circles represent accepting or final states. The state with an unlabeled arrow coming from its left is the starting state. (A brief list of the differences between a Deterministic Finite Automaton, DFA, and a Non-deterministic Finite Automaton, NFA, given below.) NFA vs. DFA Deterministic Finite Automaton (DFA) • For every state there is exactly one outgoing edge per alphabet symbol (the transition function IS a function) • For each symbol in the alphabet there is a corresponding output and there is only one. Non-Deterministic Finite Automaton (NFA) • At least one of the states has more than one outgoing edge for the same alphabet symbol (the transition function IS NOT a function, it is a relation.) • There may be e transitions (transitions that occur without the presence of an input symbol from the alphabet.) Are NFA more powerful than DFA? • No at all. Both, NFAs and DFAs recognize Regular Languages and nothing more. • Therefore, NFAs can be converted to DFA, which in turn can be "programmed". (Only DFA can be computer programmed!) Example The FAs below are equivalent and they recognize the same regular language. A regular expression (there may be many regular expressions) corresponding to the Saint Paul University – San Nicolas Campus

Compiler Design

16

regular language can be easily captured from the NFA construction: 0*1Σ*00* that is, the regular language includes all strings of 0’s and 1’s with at least an identified 1 which in the end has to be followed by an identified 0 regardless how many characters there are between them or regardless how many characters follow or precede them.

0

1,0

1 s0

0

0 s1

s2

NFA 0

1

1 A

0

0 B DFA

C 1

ALGORITHM TO SIMULATE A DFA Algorithm 1 Input: A string x ended by an EOF character and a DFA defined as a 5-tuple with s0 as its initial state and F as its set of accepting states. Output: A “YES” if the DFA accepts x or “NO” otherwise. Method: Apply the “pseudo code” algorithm below to the input string x. The function move(x, c) gives the state to which there is a transition from state s on input character c. The function nextchar returns the next character of the input string x. s = s0; c = nextchar; while c =! EOF s = move(s, c); c = nextchar; end if s is in F then return “YES” else return “NO”; Finite Automata Parsing • Accept input string iff there exists at least one path from the initial state (s0) to a final state (sf) that spells the input string. • If no path exists, reject the string • Example: Use the input string x = ‘11000’

Saint Paul University – San Nicolas Campus

Compiler Design

17

0

1,0

0

0 0 0 1 1-> •s0

s1

1

s2

0

1,0

0 0 0 0 1 -> s0

0

•s1

1

0

s2

0

1,0

0 0 0 ->

s0

•s1

0

1 0

•s0

Saint Paul University – San Nicolas Campus

1,0

1

•s2

s1

0

0

•s2

Compiler Design

18

Example Consider the NFA N = {{q0,.., q4}, {0, 1}, q0, , {q2, q4}} shown below.

0, 1

q0

1

0

q1

q3

1

0

0, 1

1, 0

q2

q0 q1 q2 q3 q4

| | | | | |

q4

0 {q0, q3} Ø {q2} {q4} {q4}

1 {q0, q1} {q2} {q2} Ø {q4}

.

We show next the proliferation of states of the NFA under the input string 01001

q0

0

q0

0

1

q0

q0

0

1 q3

0

q1

0

q0

q0

1

0 q3

1

q1

q3 0 q4

1

q4

DEFINITIONS RELATED TO THE e-closure ALGORITHM Definition Given a state s of some NFA N, the e-closure(s) is the set of states in N reachable from s on e-transitions only. Definition Given a set of states T of some NFA N, the e-closure(T) is the set of states in N reachable from some state s in T on e-transitions alone. Example X = e-closure({0}) = {0, 1, 2, 4, 7} Y = e-closure({2}) = {2} Z = e-closure({6}) = {6, 1, 2, 4, 7} = {1, 2, 4, 6, 7} T = e-clos.({1, 2, 5}) = e-clos.({1}) U e-clos.({2}) U e-clos.({5}) = {1, 2, 5, 4, 6, 7} = {1, 2, 4, 5, 6, 7}

Saint Paul University – San Nicolas Campus

Compiler Design

19

e e

e 0

2

a

3

e

1

e 6

e

4

b

5

7

e

e

NFA (with e-transitions) THE e-closure ALGORITHM The computation of e-closure(T) is a typical process of searching a graph for nodes reachable from a given set T of nodes by following the e-labeled edges of the NFA. The (pseudo code) algorithm below used to compute the e-closure (T) uses a stack to hold the states whose edges have not been checked for e-labeled transitions Algorithm (e-closure) push all states in T onto stack; initialize e-closure(T) to T; while stack is not empty pop t, the top element, off stack; for each state u with an edge from t to u labeled e if u is not in the e-closure(T) add u to e-closure(T); push u onto stack end end ALGORITHM TO BUILD THE DFA FROM AN NFA Potentially, the total number of states of the DFA is the power set of the number of states in the NFA (2N). Input: An NFA N with starting state s0, accepting some regular language. Output: A DFA D accepting the same language as N. Method: The algorithm constructs a transition table DTran for D. We call Dstates the set of states in D. We will use the definitions given for the e-closure algorithm plus: move(T, a) = set of states to which there is a transition on input symbol a in Σ from some NFA state s in T. Algorithm 2 (“subset construction”). initially, e-closure(s0) is the only state in Dstates unmarked; while there is an unmarked state T in Dstates mark T; for each input symbol a in Σ U = e-closure(move(T, a)); if U is not in Dstates then add U as an unmarked state to Dstates; DTran(T, a) = U end end Algorithm for Subset Construction. (taken from Parson’s textbook)

and

it

is

Consider an NFA N and a DFA D accepting the same regular language. [Equivalent to Algorithm 2]. 1) Initially the list of states in D is empty. Create the starting state as eclosure(s0), named after initial state {s0} in N. That is, new start state: state(1) = e-closure (s0). 2) While (there is an uncompleted row in the transition table for D) do: a) Let x = {s1, s2, ...,sk) be the state for this row. Saint Paul University – San Nicolas Campus

Compiler Design

20

b) For each input symbol a do: i) Find the e-closure({s1,s2,...sk},a) = N({s1},a) U N({s2},a) U..... U N({sk},a) = some set we'll call T . ii) Create the D-state y = {T}, corresponding to T. iii) If y is not already in the list of D-states, add it to the list. (This results in a new row in the table) iv) Add the rule D(x , a) = y to the list of transition rules for D. 3) Identify the accepting states in D. (They must include the accepting states in N). That is, make state(j) final state iff at least one state in state(j) is final state in the NFA. Yet another algorithm for the construction of a DFA equivalent to an NFA. state[0] = { }; state[1] = e-closure (s0); p = 1; j = 0; while (j <= p) { foreach c in ∑ do { e = DFAedge(state[j],c); if (e == state[i] for some i <= p) { trans[j,c] = i; } else { states[p] = e; trans[j,c] = p; } } j = j + 1; } Example Find all possible theoretical states for the NFA given below.

0

1,0

0

1

0

0

1

2

There are in total 23 = 8 states for the corresponding DFA. This is the Power Set of set S = {1, 2, 3} represented by 2|S|. That is, the 8 states of the DFA are: 2|S| ={Ø, {0}, {1}, {2}, {0, 1}, {0, 2}, {1, 2}, {0, 1, 2}} and graphically, the transition diagram is:

0

1

0

1,0

1 {0} 1

0

{0, 2} 0

1

{1, 2} 0

1

{2}

{1}

Ø

1

1 {0, 1}

0

{0, 1, 2}

0

Trivially, the states {2} and Ø plus the states {0, 2} and {0, 1} that have no input can be eliminated. After the 1st elimination cycle is complete, {0, 1, 2} has no input to it and can be eliminated. Only A = {0}, B= {1} and C = {1, 2} remain.

Saint Paul University – San Nicolas Campus

Compiler Design

21

Example The NFA below represents the regular expression letter (letter | digit)*. Find its corresponding DFA. (The theoretical number of states in this case is 2 10 = 1,024). e

letter letter 1

e 2

e

e

6

5

e e

4

3

e

9

digit 7

10

e

8 e

Solution: A = e-closure ({1}) = {1} move(A, letter) = {2} (function move defined in Alg. 2) move(A, digit) = Ø B = e-closure({2}) = {2, 3, 4, 5, 7, 10} move(B, letter) = {6} move(B, digit) = {8} C = e-closure({6}) = {6, 9, 10, 4, 5, 7} = {4, 5, 6, 7, 9, 10} move(C, letter) = {6} move(C, digit) = {8} D = e-closure({8}) = {8, 9, 10, 4, 5, 7} = {4, 5, 7, 8, 9, 10} move(D, letter) = {6} move(D, digit) = {8} State A is the start state of the DFA and all states that include state 10 of the NFA are accepting states of the DFA. The transition diagram of the DFA is given below.

letter start A

letter

digit

B digit

Ø

letter

digit

C letter

D digit

Example The NFA below represents the regular expression letter(letter | digit)*. its corresponding DFA.

Saint Paul University – San Nicolas Campus

Find

Compiler Design

22

e

e 0

a

2

3

e

e

e

a

6

1 e

4

b

b 8

7

b 9

10

e

5

e a b a

a b

a

a b

b b AN NFA FROM A REGULAR EXPRESSION (Thompson’s construction) Algorithm 3 (Thompson’s construction) Input: A regular expression R over an alphabet Σ. Output: An NFA N accepting L(R). Method: Resolve R into its component parts and then construct NFA’s for each of the basic symbols using rules (1) and (2) below. Next combine the component basic NFA’s using rules (3) until an NFA for the entire expression R. Rule 1: For e, construct the NFA

start

i

e

f

State i is a new start state and state f is a new accepting state. NFA recognizes {e}. Rule 2: For each symbol a in Σ construct the NFA

start

Same comment for states i and f.

i

a

Clearly this

f

This NFA recognizes {a}.

Rule 3: Suppose that s and t are regular expressions which are component parts of R. Call N(s) and N(t) their corresponding NFA’s. (a) For the regular expression s + t (also s | t, s or t, s U t), which corresponds to the union of two sets, construct the composite NFA M = N(s) U N(t) given below.

Saint Paul University – San Nicolas Campus

Compiler Design

23

NFA M: N(s)

e start

i e

N(t)

Here we have that a new star state i have been added. There are e-transitions from i to the two old start states of N(s) and N(t) respectively. The old accepting states of N(s) and N(t) are the accepting states of the combined NFA M. Notice that any path from start state i to final accepting state f can pass through either N(s) or N(t) exclusively. Thus, it is clear that the composite NFA M recognizes the regular language L(s) U L(t), or equivalently corresponds to the regular expression s + t. Do not use other possible constructions for N(s) U N(t). (b) For the regular expression s°t (also st or s•t), which corresponds to the concatenation of two regular sets, construct the composite NFA M = N(s)°N(t) given below.

NFA M:

start

i

e

N(s)

N(t)

f

Here we have that the only start state is i, the start state of N(s), the first NFA in the concatenation. The only accepting states are the accepting states of N(t), the last NFA in the concatenation chain. No new states that have been added. Also there are e-transitions from each one of the old accepting states of N(s) to the old start state of N(t). Notice that any path from start state i to final accepting state f must pass first through N(s) and then through N(t). Thus, it is clear that the composite NFA M recognizes the regular language L(s)°L(t), or equivalently corresponds to the regular expression s°t. It should be noted that there are other constructions for the concatenation, mainly one combining all the accepting states of N(s) and the start state of N(t) into a single state. We will not use that construction. (c) For the regular expression s*, the Kleene star operation on a set, construct the composite NFA M = (N(s))* given below.

e NFA M: e start

i

N(s)

Here we have that a new state has been added. New start state i which is also a new accepting state. There is an e-transition from the new start state i to the old start state of N(s) and e-transitions from the old accepting states of N(s) to the old start state of N(s). The old accepting states of N(s) are still accepting states of the new NFA M. Notice that in the resulting NFA M, we can accept directly at state i and also i we go through the NFA along an e edge passing through N(s) one or more times. Thus, it is clear that the composite NFA M recognizes the regular language (L(s))*, or equivalently corresponds to the regular expression s*. Saint Paul University – San Nicolas Campus

Compiler Design

24

Example Convert the regular expression a

(ab U a)*

to an NFA.

a b

b ab

a

e

ab U a

b

a

e

e

b

a

e

(ab U a )*

e

a

e e

a

e

e

Saint Paul University – San Nicolas Campus

e

b

Compiler Design

25

Example Convert the regular expression (a U b)*aba to an NFA. (a U b)*

e

e

b

e e

(a U b)*aba

a

e e

e e e e

e

e

e a

MINIMIZING THE NUMBER OF STATES OF A DFA This is an important theoretical result: every regular set is recognized by a minimum-state DFA. Algorithm 4 Input: A DFA M with set of states S, start state s0 and a set of accepting states F. Output: A DFA M’ accepting the same language as M and having a minimum number of states. Method: 1. Construct an initial partition P of the set of states with two groups: the accepting states F and the non-accepting states S – F. 2. Apply the procedure given below to P and construct a new partition Pnew. for each group G of P • partition G into subgroups such that two states s and t of G are in the same subgroup if and only if for all input symbols a, states s and t have transitions on a to states in the same group of P; /* at worst, a state will be in a subgroup by itself */ • partition G in Pnew by the set of all subgroups formed end 3. If Pnew = P, let Pfinal = P and continue to step 4. Otherwise, repeat step 2 with P = Pnew. 4. Choose one state in each group of the partition Pfinal as the representative for that group. The representative will be the states of the reduced DFA M’. Let s be a representative state, and suppose on input a there is a transition of M from s to t. Let r be the representative of t’s group (r may be t). The M’ has a transition from s to r on a. Let the start state of M’ be the representative of the group containing the start state s0 of M, and let the accepting states of M’ be the representatives that are in F. Note that each group of Pfinal either consists only of Saint Paul University – San Nicolas Campus

Compiler Design

26

states in F or has no states in F. 5. If M’ has a dead state, that is, a state d that is not accepting and that has transitions to itself on all input symbols, then remove d from M’. Also, remove any states not reachable from the start state. Any transitions to d from other states become undefined. Example Consider the DFA given by the following transition table and transition diagram. A B C D E

| | | | | |

a B B B B B

a

b C D C E C

b B

a A

D

a

b

a

a b C

b

E

b Since E is the only accepting state we have the following partition: [ABCD] & [E] Under a [ABCD] goes to B which is in [ABCD] and under b [ABCD] goes to [CDE], thus we have the new table | [ABCD] E

|

|

a B B

b [CDE] C

.

We must separate D from the subgroup [ABCD] since D is going to E under b. now build the following table. [ABC] D E

|

B

|

a

| |

B B

Now we separate B | [AC] | B B | D | E |

[CD]

b

We

.

E C

which is going to D under b. a b C B D B E B C

We can build the following table.

The final table and transition diagram is given below. | a b . [AC] | B [AC] B | B D D | B E E | B [AC]

a B

b a a

a A C

b

D b E

b

Thus the minimal DFA has only four states versus five states in the initial DFA. CONSTRUCTING YOUR SCANNER Steps: 1) Identify the regular patterns to match 2) Build the NFA for each pattern 3) Merge individual token NFA's into a single NFA 4) Convert the NFA into a single DFA 5) Implement resulting DFA as a function To implement the last step we have three general approaches: (a) Use a lexical-analyzer generator, (such as the Lex to be briefly discussed later), to produce the lexical analyzer from a regular-expression based Saint Paul University – San Nicolas Campus

Compiler Design

27

specification. In this case, the generator provides routines for reading and buffering the input. (b) Write the lexical analyzer in a conventional systems-programming language, using the l/O facilities of that language to read the input. (c) Write the lexical analyzer in assembly language and explicitly manage the reading of input. The three choices are listed in order of increasing difficulty for the implementor. Unfortunately, the harder-to-implement approaches often yield faster lexical analyzers. Since the lexical analyzer is the only phase of the compiler that reads the source program character-by-character, it is possible to spend a considerable amount of time in the lexical analysis phase, even though the later phases are conceptually more complex. Thus, the speed of lexical analysis is a concern in compiler design. While the bulk of our discussion is devoted to the first approach, the design and use of an automatic generator is also briefly considered. Example: Problem: Recognize the "IF" and "ELSE" keywords, the integer values and the strings of letters (other than the keywords, i.e., identifiers of letters), that is recognize the 4 patterns. Solution: Implement a function scan that scans the input looking for the regular expressions "if", "else", [0-9]+, [a-z]+ and return an integer code for each matched token Accepting DFA states require look-ahead to distinguish identifiers from keywords. That is, we need to distinguish tokens that can appear as a prefix of another longer token. That is, for example we need to distinguish strings such as: "i = 0" vs. "if(t)" vs. "iflast = 1" "1" vs. "1e+3" vs. "1+e+3"

• Build the NFA: i

1 ε 4

ε

e

f

2

5

l

3

6 a-z

0

ε 9

a - z

10

ε

0-9

11

0 - 9

Saint Paul University – San Nicolas Campus

12

s

7

e

8

Compiler Design

28

• Convert to DFA:

s2

f

s1 a-z/f

a-z

a-z

i s3

a-z/[i,e]

a-z s0

a-z/l e s4

0 - 9 A LEXICAL ANALYZER GENERATOR Lex (flex - fast lex) [mostly taken from Kenneth Practice", 1997]

Louden's s 8

a-z/s

l

s5

"Compiler

s

a-z/e s6

Construction,

e

s7

Principles

0-9

and

• a program that takes as its input a text file containing regular expressions, together with actions to be taken when each expression is matched. ("text" also "lexeme_beginning" is a pointer to the first character of a lexeme; "yyleng" tells how long is the lexeme.) • Lex produces an output file that contains C source code defining a procedure "yylex" that is a table-driven implementation of a DFA corresponding to the regular expressions of the input file, and that operates like a "getToken" procedure. • Lex output file, usually called "lex.yy.c" or "lexyy.c", is then compiled and linked to a main program to get a running program. ("yylval" is a variable whose definition appears in the Lex output lex.yy.c and which is also available to the Parser. Its purpose is to hold the lexical value returned.) • the lexical analyzer created by Lex works in concert with the Parser: the lexical analyzer reads its input one character at a time, until it finds the longest prefix of the input that is matched by one of the regular expressions (disambiguating by longest match) and then it executes a specified action returning control to the Parser (provided that all white spaces have been processed). Example:

Using Lex tokens.lex lex.yy.c #define IF_TOK 0 int yyleng; #define ELSE_TOK 1 char *yytext; #define INT_TOK 2 union {int ival; #define ID_TOK 3 char *sval} yylval; ............. union {int ival; char *sval;} yylval; int yylex( ) { ............. %% /*Lex Definitions*/ digits [0-9]+ letter [a-z]+ %% /*Lex Reg. Exprs. & actions*/ =====> Finite Automaton if {return IF_TOK;} implementation: else {return ELSE_TOK;} • transition table digit+ {yylval.ival = atoi(yytext); generation return INT_TOK;} • state minimization letter+ {yylval.sval = String(yytext); } return ID_TOK;} #include "tokens.h" extern union {....} yylval; extern char *yytext; Saint Paul University – San Nicolas Campus

Compiler Design

extern

29

int yyleng;

int

}

parse() { while (not done) { token = yylex(); switch (token) { case INT_TOK: itotal += yylval.ival; break; case ID_TOK: idx = search(yylval); if (idx ==0) {idx = insert(yylval.sval);} hist[idx]++; break; case IF_TOK: numIfs++; break; case ELSE_TOK: numElses++; break; } }

Note: extern is a C++ storage class specifier (and so are: storage, class, scope, linkage). Global variables and function names default to storage class specifier extern. LIMITATIONS OF THE FINITE AUTOMATA Observation: If Automaton M has recognizes longer than

n states, for input string in the language that n, M must enter a given state more than once.

M

Consequence: Once a given state is entered a second time the information regarding the number of times the FA entered the state is lost. Result: Regular languages with unbounded strings cannot be nested structure (e.g., balanced arithmetic expressions). Moreover, as we know, no FA will recognize languages such as: L = {w = an bn / n > 0} which is a Context Free Language (CFL). Context-free Grammars A programming language can be defined by describing what its programs look or how the program can be build (the syntax of the language) and what its programs means (semantics of the language). The context-free grammars or BNF are widely used notation at present. Besides specifying the syntax of a language, a context-free grammar can be used to help guide the translation of programs. A grammar-oriented compiling technique, known as syntax-directed translation, is helpful in for organizing a compiler front end. The lexical analyzer converts the stream of input characters into a stream of tokens that becomes the input to the following phase:

Character Stream

Lexical Analyzer

Token stream

Syntax directed translator

Intermediate representation

Structure of compiler front end. The syntax directed translator is a notation of a syntax analyzer and an intermediate code generator. One reason for starting with expressions consisting of digits and operators is to make lexical analysis initially very easy; each input character forms a single token. Syntax definition A grammar naturally describes the hierarchical structure of many programming language constructs. For example, an if-else statement in C has the form If (expression) statement else statement A context-free grammar has four components: 1. A set of tokens known as terminal symbols. 2. A set of nonterminals which determines a language constructs. 3. A set of production where each production consists of a nonterminal, called the left side of the production, an arrow, and a sequence of tokens and/or nonterminals, called the right side of the production. 4. A designation of one of the nonterminals as the start symbol. Saint Paul University – San Nicolas Campus

Compiler Design

30 list → list + digit list → list – digit list → digit digit → 0|1|2|3|4|5|6|7|8|9

A production is for non terminal if the nonterminal appears on the left side of the production. A string of tokens is a sequence of zero or more tokens. The string containing zero tokens, written as ∈, is called the empty string. A grammar derives strings by beginning with the start symbol and repeatedly replacing a nonterminal by the right side of a production for that nonterminal. The token strings derived from that symbol form the language defined by the grammar. Derivations and Parse Trees A parse tree shows how the start symbol of a grammar derives a string in the language. If nonterminal B has a production B → XYZ, then a parse tree may have an interior node labeled B with three children labeled X, Y, and Z, from left to right.

B X

Y

Z

Properties of a Parse tree: 1. The root is labeled by the start symbol. 2. Each leaf is labeled by a token or by ∈. 3. Each interior node is labeled by the nonterminal. 4. If B is the nonterminal labeling some interior node and X1, X2, … Xn are the labels of the children of that node from left to right, then B → X1, X2, … Xn is a production X1, X2, … Xn stand for a symbol that is either a terminal and nonterminal. If B → ∈ then a node labeled B may have a single child labeled ∈. list → list + digit list → list – digit list → digit digit → 0|1|2|3|4|5|6|7|8|9 Example: Parse tree for 5 – 7 + 3

list list list

digit

digit

digit 7

- 7

+

3

The root is labeled list, the start symbol of the grammar. root are labeled, from left to right. list → list + digit

The children of the

It is production in the grammar and same with – is repeated at the left child of the root and the three nodes labeled digit each have one child that is labeled by a digit. The leaves of a parse tree read from left to right form the yield of the tree, which is the string generated or derived from the nonterminal at the root of the parse tree. Parsing – is the process of finding a parse tree for a given string of token. It is also a process of determining if a string of tokens can be generated by a grammar.

Saint Paul University – San Nicolas Campus

Compiler Design

31

MODULE IV. BASIC PARSING TECHNIQUES ADVANTAGES OFFERED BY GRAMMAR TO BOTH LANGUAGES AND DESIGNERS 1. A grammar gives a precise, yet easy to understand, syntactic specification of a programming language. 2. From certain classes of grammars, it can automatically construct an efficient parser that determines if a source program is syntactically well formed. 3. A properly designed grammar imparts a structure to programming language that is useful for the translation of source programs into correct object code and for the detection of errors. Tools are available for converting grammar-based descriptions of translations into working programs. 4. Languages evolve a period of time, acquiring new constructs and performing additional tasks. THE ROLE OF THE PARSER

Token Source Program

Lexical Analyzer

Parser

Parse Tree

Rest of Front End

Intermediate Representation

Get_Next Token

Symbol Table THE POSITION

OF

PARSER

IN

COMPILER MODEL

There are three general parsers for grammars. Universal methods such as the Cocke-Younger-Kasami algorithm and Earley’s algorithm can parse any grammar. These two methods are too efficient to use in the production compilers. The most efficient top-down and bottom up methods work only on the classes of grammars, but several of these subclasses, such as the LL and LR grammars, are expensive enough to describe most syntactic constructs in programming languages. SYNTAX ERROR HANDLING If a compiler had to process correct programs, its design and implementation would be greatly simplified. But programmers frequently write incorrect programs, and a good compiler should assist the programmer in identifying and locating errors. It is striking that although errors are so commonplace, few languages have been designed with error handling in mind. 1. 2. 3. 4.

EXAMPLE OF ERRORS Lexical, such as misspelling an identifier, keyword, or operator. Syntactic, such as n arithmetic expression with unbalanced parenthesis. Semantic, such as an operator applied to an incompatible operand. Logical, such as an infinitely recursive call.

Often much of the error detection and recovery in a compiler is centered on the syntax analysis phase. One reason for this is that many errors are syntactic in nature or are exposed when the stream of tokens coming from the lexical analyzer disobeys the grammatical rules defining the programming language. GOALS OF ERROR HANDLER IN A PARSER: 1. It should report the presence of errors clearly and accurately. 2. It should recover from each error quickly enough to be able to detect subsequent errors. 3. It should not significantly slow down the processing of correct programs. ERROR RECOVERY STRATEGIES: 1. Panic Mode. The simplest method to implement and can be used by most parsing methods. On discovering an error, the parser discards input symbols one at a time until one of a designated set of synchronizing tokens is found. 2. Phrase-Level Recovery. On discovering an error, a parser may perform local correction on the remaining input; that is, it may replace a prefix of the remaining input by some string that allows the parser to continue. 3. Error Productions. Common errors might be encountered, the grammar can be augmented for the language at hand with productions that generate the erroneous constructs. Saint Paul University – San Nicolas Campus

Compiler Design

32

4. Global Corrections.

There are algorithms for choosing a minimal sequence of changes to obtain a globally least cost correction.

REGULAR EXPRESSIONS VS. CONTEXT FREE GRAMMARS Every construct can be described by a regular expression can also described by a grammar. Since every regular set is a context free-language, it reasonable why use regular expressions to define the lexical syntax of a language: 1. The lexical rules of a language are frequently quite simple, and the to describe them doesn’t need a notation as powerful as grammars. 2. Regular expressions generally provide a more concise and easier to understand notation for tokens than grammars. 3. More efficient lexical analyzers can be constructed automatically from regular expressions than from arbitrary grammars. 4. Separating the syntactic structure of a language into lexical and non-lexical parts provides a convenient way of modularizing the front end of a compiler into two manageable sized components. ELIMINATION OF LEFT RECURSION A grammar is left recursive if it has a nonterminal A such that there is a derivation A ⇒ A for some string. Top down parsing methods cannot handle left recursive grammars, so a transformation that eliminates left recursion is needed. LEFT FACTORING Left Factoring is a grammar transformation that is useful for producing a grammar suitable for predictive parsing. The basic idea is that when it is not clear which activities to use to expand a nonterminal A, it will be good to rewrite A productions to defer the decision until enough to make the right choice. NON-CONTEXT FREE LANGUAGE CONSTRUCTS It should come as no surprise that some languages cannot be generated by any grammar. A few syntactic constructs found in many programming languages cannot be specified using grammars alone. Ex. L1 = {wcw| w is in (a|b)*}. L1 consists of all words composed of a repeated string of a’s and b’s separated by a c, such as aabcaab. It can be proven this language is not context free. This language abstracts the problem of checking that identifiers are declared before their use in a program. TOP DOWN PARSING RECURSIVE-DESCENT PARSING Top down parsing can be viewed as an attempt to find a leftmost derivation for an input string. Equivalently, it can be viewed as an attempt to construct a parse tree for the input starting from the root and creating the nodes of the parse tree in preorder. Backtracking is rarely needed to parse programming language constructs. Consider the grammar S ⊗ cAd A ⊗ ab|a And the input string w = cad. Ton construct a parse tree for this string top down, a tree is initially created consisting of a single node labeled S. And input pointer points to c, the first symbol of w. S S S

c

A

d

c

A

a (a)

d b

(b)

c

A

d

a (c)

The leftmost leaf, labeled c, matches the first symbol of w, so the input pointer is advanced to a, the second symbol of w, and consider the next leaf labeled A. A recursive grammar can cause a recursive descent parser, even one with backtracking, to go into an infinite loop. PREDICTIVE PARSERS A grammar that can be parsed by a recursive descent parser can be obtained by carefully writing a grammar, eliminating left recursion from it, and left factoring Saint Paul University – San Nicolas Campus

Compiler Design

33

the resulting grammar. To construct a predictive parser, given the current input symbol a and the nonterminal A to be expanded, which one of the alternatives of production A ⊗ 1|2…|n is the unique alternative that derives a string beginning with a. That is, the proper alternative must be detectable by looking at only the first symbol it derives. NON-RECURSIVE PREDICTIVE PARSING It is possible a to build a nonrecursive predictive parser by maintaining a stack explicitly, rather than implicitly via recursive calls. The key problem during predictive parsing is that of determining the production to be applied for a nonterminal. INPUT

X

a

+

b

$

Predictive Parsing

OUTPUT

Y Z $

MODEL

Parsing Table M

OF A NONRECURSIVE PREDICTIVE PARSER

A table driven predictive parser has an input buffer, a stack, a parsing table, and an output stream. The input buffer contains the string to be parsed, followed by $, a symbol used as a right end marker to indicate the end of the input string. The stack contains a sequence of grammar symbols with $ on the bottom, indicating the bottom of the stack. Initially, the stack contains the start symbol of the grammar on the top of $. The parser is controlled by a program that behaves as follows. The program considers X, the symbol on top of the stack, and a, the current input symbol. These two symbols determine the action of the parser. There are three possibilities. 1. If X = a = $, the parser halts and announces successful completion of parsing. 2. If X = a = $, the parser pops X off the stack and advances the input pointer to the next input symbol. 3. If X is a nonterminal, the program consults entry M[X,a] of the parsing table M. FIRST AND FOLLOW The construction of a predictive parser is aided by two functions associated with a grammar G. These functions, FIRST AND FOLLOW, allow us to fill in the entries of a perspective parsing table for G, whenever possible. If is any string of grammar symbols, let FIRST() be the set of terminals that begin the strings derived from. Define FOLLOW (a), for nonterminal A, to be the set of terminals a that can appear immediately to the right of A in some sentential form, that is, the set of terminals a such that there exists a derivation of the form S ⊗ Aaβ for some and β. ERROR RECOVERY IN PREDICTIVE PARSING The stack of a nonrecursive predictive parser makes explicit the terminals and nonterminals that the parser hopes to match with the remainder of the input. An error is detected during predictive parsing when the terminal on top of the stack does not match the next input symbol or when nonterminal A is on top of the stack, a is the next input symbol, and the parsing table entry M[A,a] is empty. BOTTOM UP PARSING In this section, we introduce a general style of bottom-up syntax analysis known as a shift-reduce parsing. An easy to implement form of shift-reduce parsing, called operator-precedence parsing. Shift-reduce parsing attempts to construct a parse tree an input string beginning at the leave (the bottom) and working up towards the root (the top). We can think of this process as one of “reducing” a string w to the start symbol of a grammar. At each reduction step, a particular substring matching the right side of a production is replaced by the symbol on the left of the production, and if the substring is chosen correctly at each step, a rightmost derivation is traced out in reverse. Saint Paul University – San Nicolas Campus

Compiler Design

34

Example 4.21. Consider the grammar S → aABe A → Abc | b B → d The sentence abbcde can be reduced to S by the following steps: Abbcde aAbcde aAde aABe S IMPLEMENTATION

OF

LR PARSING TABLES

LR PASSERS This section presents and efficient, bottom up syntax analysis technique that can be used to parse a large class of context free grammars. The technique is called LR (k) passing; the "L" is for left to right scanning of the input, the "R" for constructing a right most derivation in reverse, and the "k" for the number of input symbols in making parsing decisions. When (k) is omitted, k is assumed to be 1. LR parsing is attractive for variety reasons. An LR parser can detect a syntactic error as soon as it is possible to do so on a left to a right scan of the input. LR parser can be constructed to recognize virtually all-programming language constructs for which context-free grammars can be written. The LR parsing methods is the most general no backtracking shift reduce parsing method known; yet it can be implemented as efficiently as other shiftreduce methods. The class of grammars that can be parsed using LR methods is a proper superset of the class of grammars that can be parsed with predictive THE LR PARSING ALGORITHM It consists of an input, an input, an output, a stack, a driver program, and a parsing table that has two parts (action and goto). The parsing program reads characters from input buffer one at a time a stock to store a string of the form SoX 1 s1 x2 s2… xmsm, where Sm is on top. In an implementation, the grammar symbols need not appear on the stock; however, we shall always include them in our discussion to help explain the behavior of an LR parser. The parsing table consists of two parts, a parsing action function action and a goto function goto. The parsing action table entry for state Sm and input a1 which can have one of four values; 1.Shift s. where s is a state, 2.Rreduce by a grammar production A-B 3. Accept, and 4. Error The function goto text a state and grammar symbol as arguments and produced a step. The goto function of a parsing table constructed from a grammar G using the SLR, canonical LR, or LALR method is the transition function of a deterministic finite automaton the recognizes the viable prefixes of G. A Configuration off an LR parser is a pair whose first component is the stack contents and whose second component is the unexpended input; ( So X1 s1 X2s2…Xm sm, ai ai+1…an $) this configuration represents s the right - sentential form X1 X2… Xm ai ai+1…an The configurations resulting after each off the for types of move are as follows: 1. if action [Sm, a ]: = shift s, the parser executes a shift move, entering the configuration (So X1 S1 X2 S2….. Xm Sm, a1a1+1…. AnSn) 2. if action [Sm , a ]= reduce A - B, then the parser executes a reduce move, entering the configuration. 3. if action [Sm, a !] = accept, parsing is completed. Saint Paul University – San Nicolas Campus

Compiler Design

4. if action [Sm, a !] = error, the parser has discover an error error recovery routine.

Saint Paul University – San Nicolas Campus

35 and calls an

Compiler Design

36

MODULE V. SYNTAX-DIRECTED TRANSLATION Input String

Parse tree

Dependency graph

Evaluation order for semantic rules

CONCEPTUAL VIEW OF SYNTAX DIRECTED TRANSLATION

SYNTAX DIRECTED DEFINITIONS: A syntax – directed definition is a generalization of a context free grammar in which grammar symbol has an associated set of attributes, partitioned into two subsets called the synthesized and inherited attributes of that grammar symbol. The process of computing the attribute values at the nodes is called annotating or decorating the parse tree. The value of an attribute at the parse tree node is defined by a semantic rule associated with the production used at the node. The value of the synthesized attribute at a node is computed from the values of attributes at the children of that node in the in the parse tree, the value of inherited attribute is computed from the values of attributes at the siblings and the parent of that node. Annotated parse tree – a parse tree showing the values of attributes at each node. Annotating or decorating – the process of computing the attribute values at the nodes. Attribute grammar- is a syntax-directed definition which the function in semantic rules cannot have side effects. FORM

OF A

SYNTAX DIRECTED DEFINITION:

In a syntax – directed definition, each grammar production A a has associated with it a set of semantic rules of the form b: = f (C1, C2,…Ck) where f is a function, and either 1. b is a synthesized attribute of A and C1, C2,…Ck are attributes belonging to the grammar symbols of the production 2. b is an inherited attribute of one of the grammar symbols on the right side of the production, and C1, C2,…Ck are attributes belonging to the grammar symbols of the production. INHERITED ATTRIBUTES: An inherited is one whose value at a node in a parse tree is defined in terms of attributes at the parent and or siblings of that node. Inherited attributes are convenient for expressing the dependence of a programming language construct on the context in which it appears DEPENDENCY GRAPHS: If an attribute bat a node in a parse tree depends on an attribute C, then the semantic rule for b at that node must be evaluated after the semantic rule that defines C. The interdependencies among the inherited and synthesized attributes at the nodes in a parse tree can be depicted by a directed graph called dependency graph. D

t. type = real

L in = real

Real

L. in = real

i. in = real

I d1

,

,

i d3

t. type = real

Parse tree with inherited attribute in at each node labeled L. Saint Paul University – San Nicolas Campus

Compiler Design

37 TOP – DOWN TRANSLATION:

ELIMINATING LEFT RECURSION FROM A TRANSLATION SCHEME Since most arithmetic operators associate to the left, it is natural to use left recursive grammars for expressions. The transformation underlying grammar of a translation scheme is transformed. The transformation applies to translation schemes with synthesized attributes. Example: The translation scheme is transformed below into the translation scheme. The new scheme produces the annotated parse tree for the expression 9-5+2. The arrows in the figure suggest a way of determining the value of the expression. E E E T T

E1 + T E1 – T T (E) num

{ E.val :=E1.val + T.val } { E.val :=E1.val + T.val } { E.val :=T.val } { T.val :=E.val} { T.val := num.val}

Translation scheme with left-recursive grammar. For top-down parsing, we can assume that an action is executed at the time that a symbol in the same position would be expanded. E R R R T T

T R + T R1 T R1 E (E) num

{ R.i :=T.val } { E. val := R.s } { R1.i :=R.i + T.val } { R.s := R1.s } { { { { {

R1.i := R.i – T.val } R.s := R1.s } R.s := R1.s } T.val := E.val } T.val := num.val }

E

T.val = 9

R.i = 9

num.val = 9

T.val = 5 num.val = 5

R.i = 4 T.val = 2

num.val = 2

R.i = 6

E

In order to adapt other left-recursive translation schemes for predictive parsing, we shall express the use of attributes r.i and R.s. more abstractly. Suppose we have the following translation scheme A A1 Y { A.a := g (A1.a, Y.y) } A X {A.a := f (X.x) } Each grammar symbol has a synthesized attribute written using the corresponding lower case letter, and f and g are arbitrary functions. A X R R Y R Taking the semantic actions into account, the transformed scheme becomes A R R

X R Y R1 E

{ { { { {

R.i := f (X.x) } A.a := R.s } R1.i := g (R.i, Y.y) } R.s := R1.s } R.s := R.i }

Saint Paul University – San Nicolas Campus

Compiler Design

38 a.a = g(f(f(X.x),Y1.y),Y2.y) Y2

A X

a.a = g(f(X.x),Y1.y)

R.i = f(X.x)

Y1

Y1

A.a = f(X.x)

R.i =g(f(X.x),Y1.y)

X

Y2 R.i = g(g(f(X.x),Y1.y),Y2.y) E

E E E

E1 E1 T

+T -T

E

{ R.i := T.npir } { E.npir := R.s }

R

T R + T R1 T R1 E

T T T

(E) id num

{ { {

R R

{ { { { {

{ E.npir := mknode (‘+’, E1.npir) } { E.npir := mknode (‘-’, E1.npir) } { E.npir := T.npir }

R1.i := mknode(‘+’, T.npir) } R.s := R1.s } R1.i := mknode(‘-‘, R.i, T.npir) } R.s := R1.s } R.s := R.i } T.npir := E.npir } T.npir := mkleaf (id, id.entry) } T.npir := mkleaf (num. Num.val) } Two ways of computing an attribute value. BOTTOM-UP EVALUATION OF INHERITED ATTRIBUTES

The method is capable of handling all L-attributed definitions considered in the previous section in that it can implement any L-attributed definition based on an LL (1) grammar. It can also implement many (but not all) L-attributed definitions based on LR (1) grammars. REMOVING EMBEDDING ACTIONS FROM TRANSLATION SCHEMES In the bottom-up translation method, we relied upon all translation actions being at the right end of the production, while in the predictive parsing method, we needed to embed actions at various places within the right side. The transformation inserts new marker nonterminals generating E into the base grammar. We replace each embedded action by a distinct marker nonterminal M and attach the action to the end of the production M E. Example: E R T

T

R +T {print (‘+’) R| - T num {print (num.val}

{print(‘’)} R | E

Is transformed using marker nonterminals M and N into E T R R + T M R| - T N R | E T num {print (num.val)} M E {print (‘+’)} N E {print (‘-‘)} The grammars in the two translation schemes accept exactly the same language and, by drawing a parse tree with extra nodes for the actions, we can show that the actions are performed in the same order. Actions in the transformed translation scheme terminate production, so they can be performed just before the right side is reduced during bottom-up parsing. Saint Paul University – San Nicolas Campus

Compiler Design

39

i S

i

E

E

S

t i

r

S

i

t

E1

t

i

S

E

S

id

S

t

Fig 5.0 Dependency Graph

RECURSIVE EVALUATION

OF

ATTRIBUTES

The dependency Graph is formed by pasting together smaller graphs corresponding to the semantic rules for a production forms the dependency graph for a parse tree. The dependency graph Dp for production p is based only on the semantic rules for a single production. i.e. on the semantic rules for the synthesized attributes of the left side and the inherited attributes of the grammar symbols on the right side of the production. That is graph D p shows local dependencies only for example all edges in the dependency graph. For E E1 E2 in fig 5.58 are between instances of the same attribute. Strongly Noncircular Syntax –Directed Definition Recursive Evaluators can be constructed for a class a syntax-directed definition, called “strongly non circular” definition. For a definition in this class the attributes at each node for a non-terminal can be evaluated according to the same (partial) order. When we construct the function for synthesized attributes of the non-terminal, this order is used to select the inherited attributes that become the parameter of the function. Several methods have been proposed for evaluating semantic rules: 1. Parse-tree methods. At compile time, these methods obtain an evaluation order from a topological sort of the dependency graph constructed from the parse tree for each input. These methods will fail to find an evaluation order only if the dependency graph for the particular parse tree under consideration has a cycle. 2. Rule-based methods. At compiler-construction time, the semantic rules associated with productions are analyzed, either by hand, or by a specialized tool. For each production, the order in which the attributes associated with that production are evaluated is predetermined at compiler-construction time. 3. Oblivious methods. An evaluation order is chosen without considering the semantic rules. For example, if translation takes place during parsing, then the order of evaluation is forced by the parsing method, independent of the semantic rules. An oblivious evaluation order restricts the class of syntax-directed definitions that can be implemented. Rule-based and oblivious methods needs not explicitly construct the dependency graph at compile time, so they can be more efficient in their use of compile time and space. A syntax-directed definition is said to be circular if the dependency graph for some parse tree generated by its grammar has a cycle. Section 5.10 discusses how to test a syntax-directed definition for circularity. CONSTRUCTION OF SYNTAX TREES In this section, we show how syntax-directed definitions can be used to specify the construction of syntax trees and other graphical representations of language constructs. The use of syntax trees as an intermediate representation allows translation to be decoupled from parsing. Translation routines that are invoked during parsing must live with two kinds of restrictions. First, a grammar that is suitable for parsing may not reflect the natural hierarchical structure of the constructs in the language. For example, a grammar for Fortran may view a subroutine as consisting simply has a list of statements. However, analysis of the subroutine may be easier if we use a tree representation that reflects the nesting of DO loops. Second, the parsing method constrains the order in which nodes in a parse tree are considered. This order may not match the order in which information about a construct becomes available. For this reason, compilers for C usually construct syntax trees for declarations.

Saint Paul University – San Nicolas Campus

Compiler Design

40

SYNTAX TREES A (abstract) syntax tree is a condensed form of parse tree useful for representing language constructs. The production S if B then S1 else S2 might appear in a syntax tree as, if-then-else

B

S1

S2

In a syntax tree, operators and keywords do not appear as leaves, but rather are associated with the interior node that would be the parent of those leaves in the parse tree. Another simplification found in syntax trees is that chains of single productions may be collapsed; the parse tree of Fig. 5.3 becomes the syntax tree. + * 3

4 5

Syntax-directed translation can be based on syntax trees as well as parse trees. The approach is the same in each case; we attach attributes to the nodes as in parse tree. CONSTRUCTING SYNTAX TREES

FOR

EXPRESSIONS

The construction of a syntax tree for an expression is similar to the translation of the expression into postfix form. We construct subtrees for the subexpressions by creating a node for each operator and operand. The children of an operator node are the roots of the nodes representing the subexpressions constituting the operands of the operator. Each node in a syntax tree can be implemented as a record with several fields. In the nodes for an operator, one field identifies the operator and the remaining fields contain pointers to the nodes for the operands. The operator is often called the label of the node. When used for translation, the nodes in a syntax tree may have additional fields to hold the values (or pointers to values) of attributes attached to the node. In this section, we use the following functions to create the nodes of syntax trees for expressions with binary operators. Each function returns a pointer to a newly created node. 1. mknode (op, left, right) creates an identifier node with label op and two fields containing pointers to left and right. 2. mkleaf (id, entry) creates an identifier node with label id and a field containing entry, a pointer to the symbol-table entry for the identifier. 3. mkleaf (num, val) creates a number node with label num and a filed containing val, the value of number. Example 1. The following sequence of functions calls creates the syntax tree for the expression a-4+c in Fig. 5.8. In this sequence, P1, P2, ….,P5 are pointers to nodes, and entrya and entryc are pointers to the symbol-table entries for identifiers a and c, respectively.

(1) (2) (3) (4) (5)

P1 P2 P3 P4 P5

:= := := := :=

mkleaf mkleaf mknode mkleaf mknode

(id, entrya); (num,4); (‘-‘, P1, P2); (id, entryc); (‘+’, P3, P4);

Saint Paul University – San Nicolas Campus

Compiler Design

41

The tree is constructed bottom up. The function calls mkleaf (id, entrya) and mkleaf (num, 4) construct the leaves for a and 4; the pointers to these nodes are saved

+ -

id

id

Num, 4 Fig. 5.8. Syntax tree for a-4+c

using P1and P2. The call mknode (‘-‘, P1, P2) then constructs the interior node with the leaves for a and 4 as children. After two more steps, P5 is left pointing to the root. DIRECTED ACYCLIC GRAPHS

FOR

EXPRESSIONS

A directed acyclic graph (hereafter called a dag) for an expression identifies the common subexpressions in the expression. Like syntax tree, a dag has a node for every subexpression of the expression; an interior node represents an operator and its children represent its operands. The difference is that a node in a dag representing a common subexpression has more than one “parent;” in a syntax tree, the common subexpression would be represented as a duplicated subtree. Figure 5.11 contains a dag for the expression a + a * ( b – c ) + ( b – c ) * d The leaf for a has two parents because a is common to the two subexpressions a and a* (b-c). Likewise, both occurrences of the common subexpression b-c are represented by the same node, which also has two parents. +

+

*

*

a

d

-

b

c

Fig. 5.11. Dag for the expression a + a * ( b - c ) + ( b - c ) * d. The syntax-directed definition of Fig. 5.9 will construct a dag instead of a syntax tree if we modify the operations for constructing nodes. A dag is obtained if the function constructing a node first checks to see whether an identical node already exists. For example, before constructing a new node with label op and fields with pointers to left and right, mknode (op, left, right) can check whether such a node has already been constructed. If so, mknode (op, left, right) can return a pointer to the previously constructed node. The leaf-constructing functions mkleaf can behave similarly.

Saint Paul University – San Nicolas Campus

Compiler Design

42

MODULE VI. SYMBOL TABLES A symbol table is generally used to store information about various source language constructs. The information is collected by the analysis phases of the compiler and used by the synthesis phases to generate the target code. THE SYMBOL TABLE INTERFACES The symbol table routine is concerned primarily with saving and retrieving lexemes. When a lexeme is saved, we also save the token associated with the lexeme. The following operation will perform on the symbol table. Inserts (s,t): Returns index of new entry for string s, token t. Lookup (s): Returns index of the entry for string s, or 0 if s is not found. HANDLING RESERVED KEYWORDS The symbol table routines above can handle any collection of reserved keywords. For example, consider tokens div and mod with-lexemes div and mod, respectively. We can initialize the symbol table using the calls Insert (“ div”, div); Insert (“mod”, mod); Any subsequent call lookup (“div”) returns the token div, so div cannot be used as an identifier. Any collection of reserved keywords can be handled in this way by appropriately initializing the symbol table. A SYMBOL TABLE IMPLEMENTATION The data structure for a particular implementation of a symbol table is sketched in Fig. 1. Separate array lexemes holds the character string forming an identifier. The string is terminated by an end of string character, denoted by EOS that may not appear in identifiers. Each entry in the symbol table array symtable is a record consisting of two fields, lexptr, pointing to the beginning of a lexeme, and token. A SIMPLE COMPILER Lexptr token attributes Div Mod Id Id

d i v Array lexemes

EOS

m

o

d

EOS

c

o

u

n

t

EOS

i

EOS

Fig. 1 Symbol tables and array for storing strings.

ABSTRACT STACK MACHINES The front end of a compiler construct an intermediate representation of the source program from which the back and generates the target program. One popular form of intermediate representation is code for an abstract stack machine. In this section, we present an abstract stack machine and show how the code is: function lexan; integer; var lexbuf : array [0..100] of char; c: char; begin loop begin read a char into c; if c is a blank or a tab then do anything else if c is a newline then lineno := lineno + 1 else if c is a digitthen begin set tokenval to the value of this and following digits: return NUM end else if c is a letter then begin place c an succesive letters and digits into lexbuf; p :=lookup (lexbuf); if p = 0 then p := insert(lexbuf, ID); tokenval :p; Saint Paul University – San Nicolas Campus

Compiler Design

43

return the token field of table entry p end else begin /* token is a single character */ set tokenval to NONE; /* there is no attribute */ return integer encoding of character end end end Fig. 2 Pseudo-code for a lexical analyzer CALL BY NAME Call by name is traditionally defined by the copy-rule of Algol, which is: 1. The procedure is treated as if it were a macro; that is, its body is substituted for the call in the caller, with the actual parameters literally substituted for the formals. Such a literal substitution is called macro-expansion or in-line expansion. 2. The local names of the called procedure are kept distinct from the names of the calling procedure. We can think of each local of the called procedure being systematically renamed into a distinct new name before the macro-expansion s done. 3. The actual parameters are surrounded by parentheses if necessary to preserve their integrity. SYMBOL TABLE A compiler uses a symbol table to keep track of scope and binding information about names. The symbol table is searched every time a name is encountered in the source text. Changes to the table occur if a new name or new information about an existing name is discovered. It is useful for a compiler to be able to grow the symbol table dynamically, if necessary, at compile time. If the size of the symbol table is fixed when the compiler is written, then the size must be chosen large enough to handle any source program that might be represented. Such a fixed size is likely to be too large for most, and inadequate for some, program. SYMBOL TABLE ENTRIES Each entry in the symbol table is for the declaration of a name. The format of entries does not have to be uniform, because the information saved about a name depends on the usage of the name. Each entry can be implemented as a record consisting of a sequence of consecutive words of memory. HASH TABLES Variations of the searching technique known as hashing have been implemented in many compilers. Here we considers a rather simple variant known as open hashing, where open refers to the property that there need be no limit on the number of entries that can be made in the table. The basic hashing scheme is illustrated in Fig.3. There are two parts to the data structure: 1. A hash table consisting of a fixed array of m pointers to table entries. 2. Table entries organized into m separate linked lists, called buckers (some buckets may be empty.) Each record in the symbol table appears on exactly one of these lists. Storage for the records may be drawn from an array of records, as discussed in the next section. Alternatively, the dynamic storage allocation facilities of the implementation language can be used to obtain space for the records, often at some loss of efficiency. Array of list headers, Indexed by hash value List elements created for names shown

Fig. 3. A hash table of size 211. CHARACTERS IN A NAME Strings of characters may be unwieldy to work with, so compiler used to some fixed-length representation of the name rather than the lexeme. The lexeme is needed when a symbol-table entry is set up for the first time, and when we look up Saint Paul University – San Nicolas Campus

Compiler Design

44

a lexeme found in the input to determine whether it is a name that has already appeared. A common representation of a name is a pointer to a symbol-table entry for it. If there is a modest upper bound on the length of a name, then the characters in the name can be in the symbol-table entry, as in (a). If there is no limit on the length of a name, or if the limit is rarely reached, the indirect scheme of figure (b) can be used. Name

(a) In fixed- size space within a record

s a r i

o

r

t

e

a

d

a

r

Attribute

r

a

y

Name

Attributes

(b) In a separate array

s

o

r

t

EOS

a

EOS

r

e

a

d

a

r

r

a

y

EOS

i

EOS

STORAGE ALLOCATION INFORMATION Information about the storage locations that will be found to names at run time is kept in the symbol table. Considers names with static storage first. If the target code is assembly language, we can let the assembler take care of storage locations for the various names. If machine code is to be generated by the compiler, however, then the position of each data object relative to a fixed origin, such as the beginning of an activation record must be ascertain. THE LIST DATA STRUCTURE FOR SYMBOLS TABLE The simplest and easiest to implement data structure for a symbol table is a linear list of records. id 1 info 1 id 2 info2

id n info n

Fig. 4. A linear list of records One suitable approach for computing hash functions is to proceed as follows: 1. Determine a positive integer h from the character c1. c2 ….. c4 in string s. The conversion of single characters to integers is usually supported by the implementation language. Pascal provides a function ord for this purpose; C automatically converts a character to an integer if an arithmetic operation is performed on it. 2. Convert the integer h determined above into the number of a list, i.e., and an integer between 0 and m-1. Simply dividing by m and taking the remainder is a reasonable policy. Taking the remainder seems to work better if m is a prime, hence the choice 211 rather than 200. RUN-TIME ENVIRONMENTS (1)#define PRIME 211 (2)#define EOS ‘ \0’ (3)int hashpjw (s) (4)char *s; Saint Paul University – San Nicolas Campus

Compiler Design

(5){ (6) (7) (8) (9) (10) (11) (12) (13) (14) (15) (16)}

45 char &p; unsigned h = 0, g; for ( p = s; *p ! = EOS; p p+1) { h = (h << 4) + (*p); if (g = h&0xf0000000) { h = h ^ (g >> 24 0; h = h ^ g; } } return h % PRIME; Fig. 5. Hash function hashpjw, written in C.

Hashpjw is computed by starting with h = 0. For each character c, shift the bits of left 4 positions and add in C. If any of the four high-order bits of h is 1, shift the four bits right 24 positions, exclusive- or them into, and reset to 0 any of the four high-order bits was 1. For best results, the size of the hash table and the expected input must be taken into account when a hash function is designed. One way of testing a hash function is to look at a number of strings that fall onto the same list. Given a file F consisting of n strings, suppose bj strings fall onto list j, for 0<j<m-1. A measure of how uniformly the strings are distributed across lists is obtained by computing m-1 Σ bj(bj + 1)/2 j=0 The intuitive justification for this term is that we need to look at 1 list element to find the first entry on list j, at 2 to find the second, and so on up to bj to find the last entry. The sum of 1,2,……… bj is bj(bj + 1)/2. From Exercise 7.14, the value of (7.2) for a hash function that distributes strings randomly across buckets is (n/ 2m) (n + 2m – 1) The ratio of the terms (7.2) and (7.3) is plotted for several hash function applied to nine files. The files are: 1. The 50 most frequently occurring names and keywords in a sample of C programs. 2. Like (1), but with the 100 most frequently occurring names and keywords. 3. Like (1), but with the 500 most frequently occurring names and keywords. 4. 952 external names in the UNIX operating system kernel. 5. 627 names in a C program generated by C++ ( Stroustrup [1986]). 6. 915 randomly generated character strings. 7. 614 words from Section 3.1 of this book. 8. 1201 words in English with xxx added as prefix and suffix. 9. The 300 names v100, v101…v399. REPRESENTING SCOPE INFORMATION The entries in the symbol table are for declarations of names. When an occurrence of a name in the source text is looked up in the symbol table, the entry for the appropriate declaration of that name must be returned. The scope rules of the source language determine which declaration is appropriate. A simple approach is to maintain a separate symbol table for each scope. In effect, the symbol table for a procedure or scope is the compile-time equivalent of an activation record. Information for the nonlocals of a procedure is found by scanning the symbol tables for the enclosing procedures following the scope rules of the language. Equivalently, information about the locals of a procedure can be attached to the node for the procedure in a syntax tree for the program. With this approach the symbol table is integrated into the intermediate representation of the input. When we look up a newly scanned name, a match occurs only if the characters of the name match an entry character for character, and the associated number in the symbol-table entry is the number of the procedure being processed. Most closely nested scope rules can be implemented in terms of the following operations on a name: Lookup: find the most recently created entry Insert: make a new entry Delete: remove the most recently created entry “Deleted” entries must be preserved; they are just removed from the active symbol table. In a one-pass compiler, information in the symbol table about a scope consisting of, say, a procedure body is not needed at compile time after the procedure body is processed. However, it may be needed at run time, particularly if Saint Paul University – San Nicolas Campus

Compiler Design

46

a run-time diagnostic system is implemented. In this case, the information in the symbol table must be added to the generated code for use by the linker or by the run-time diagnostic system. Each of the data structures discussed in this section – lists and hash tables – can be maintained so as to support the above operations. When a linear list consisting of an array of records was described earlier in this section, we mentioned how lookup can be implemented by inserting entries at one end so that the order of the entries in the array is the same as the order of insertion of the entries. A scan starting form the end and proceeding towards the beginning of the array finds the most recently created entry for name. The situation is similar in a linked list, as shown in Fig. 6. A pointer front points to the most recently created entry in the list. The implementation of insert takes constant time because a new entry is placed at the front of the list. The implementation of lookup is done by scanning the list starting at the entry pointed to by the front and following links until the desired name is found, or the end of the list is reached. In Fig. 6, the entry for a declared in a block B2, ,nested within block B0 , appears nearer the front of the list than the entry for a declared in B0. front B2 B0 ...

...

Fig 6. The most recent entry for a is near the front. For the delete operation, note that the entries for the declarations in the most deeply nested procedure appear nearest the front of the list. Thus, we do not need to keep the procedure number with every entry – if we keep track of the first entry for each procedure, then all entries up to the first can be deleted form the active symbol table when we finish processing the scope of this procedure. A hash table consists of m lists accessed through an array. Since a name always hashes to the same list, individual lists are maintained as in Fig 6. however, for implementing the delete operations we would rather not have to scan the entire hash table looking for lists containing entries to be deleted. The following approach can be used. Suppose each entry has two links: 1. A hash link that chains the entry to other entries whose names hash to the same value end. 2. A scope link that chains all entries in the same scope. If the scope link is left undisturbed when an entry is deleted from the hash table, then the chain formed by the scope links will constitute a separate (inactive) symbol tables for the scope in question. LANGUAGE FACILITIES FOR DYNAMIC STORAGE ALLOCATION In this section, we briefly describe facilities provided by some languages for the dynamic allocation of storage for data, under program control. Storage for such data is usually taken from a heap. Allocated data is often retained until it is explicitly deallocated. The allocation itself can be either explicit or implicit. Implicit allocation occurs when evaluation of an storage being obtained to hold the value of the expression.

expression

results

in

Example 1. The Pascal program in Fig. 7 builds the linked list shown in Fig. 8 and prints the integers held in the cells; its output is 76 4 7

3 2 1

when execution of the program begins at line 15, storage for the pointer head is in the activation record for the complete program. Each time control reaches (11)

new( p); pt. Key : m k; pt. info := I;

the call new ( p) results in a cell being allocated somewhere within the heap; pt refers to this cell in the assignments on line 11. Note from the output of the program that the allocated cells are accessible when control returns to the main program form insert. In other words, cells allocated using new during an activation of insert are retained when control returns to the main program form the activation. Saint Paul University – San Nicolas Campus

Compiler Design

47

LANGUAGE FACILITIES FOR DYNAMIC STORAGE ALLOCATION (1)program table(input, output); (2)type link = ^cell; (3) cell = record (4) key, info : integer; (5) next : link (6) end; (7) var head : link; (8)procedure insert (k, i : integer); (9) var p : link, (10) begin (11) new(p); p^.key := k; p^.info := i; (12) p^.next := head; head := p (13) end; (14) begin (15) head := nil; (16) insert (7,1); insert(4,2); insert (76,3); (17) writeln(head^.key, head^.info); (18) writeln(head^.next^.key, head^.next^.info); (19) writeln(head^.next^.next^.key,,head^.next^.next^.info); (20) end. Fig. 7. Dynamic allocation of cells using new in Pascal. Head

76 3

4

2

7

1 nil

Fig 8. Linked list built by program in Fig.7. GARBAGE Dynamically allocated storage can become unreachable. Storage that a program allocates but cannot refer to is called garbage. In Fig.7, suppose nil is asssigned to headt.next between lines 16 and 17: (16) (17)

insert(7,1); insert (4,2); insert(76,3); head^.next := nil; writeln(head^.key, head^.info);

The leftmost cell in Fig 8 will now contain a nil pointer rather than a pointer to the middle cell. When the pointer to the middle cell is lost, the middle and rightmost cells become garbage. MODULE VII. RUN-TIME STORAGE The allocation and deallocation of data objects is managed by the run-time support package, consisting of routines loaded with the generated target code. The design of the run-time support package is influenced by the semantics of procedures. Each execution of a procedure is referred to as an activation of the procedure. If the procedure is recursive, several of its activations may be alive at the same time. Its type determines the representation of the data object at run-time. Often, equivalent data objects in the target machine can represent elementary data types, such as characters, integers, and reals. However, aggregates, such as arrays, strings, and structures, are usually represented by collections of primitive objects. SOURCE LANGUAGE ISSUES PROCEDURES A procedure definition is a declaration that, in its simplest form, associates an identifier with a statement. The identifier is the procedure name, and the statement is the procedure body. Procedures that return values are called functions in many languages; however, it is convenient to refer them as procedures. A complete program will also be treated as a procedure. When a procedure name appears within an executable statement, we say that the procedure is called at that point. The basic idea is that a procedure call executes the procedure body. Some of the identifiers appearing in a procedure definition are special, and are called formal parameters (or just formals) of the procedure. Arguments, known as actual parameters (or actuals) may be passed to a called procedure; they are substituted for the formals in the body. ACTIVATION TREES We make the following assumptions about the flow of control among procedures during the execution of a program:

Saint Paul University – San Nicolas Campus

Compiler Design

48

1. Control flows sequentially; that is, the execution of a program consists of a sequence of steps, with control being at some point in the program at each step. 2. Each execution of a procedure starts at the beginning of the procedure body and eventually returns control to the point immediately following the place where the procedure was called. This means the flow of control between procedures can be depicted using trees. Each execution of a procedure body is referred to as an activation of the procedure. The lifetime of an activation of a procedure p is the sequence of steps between the first and last steps in the execution of the procedure body, including time spent executing procedures called by p, the procedure called by them, and so on. In general, the term “lifetime” refers to a consecutive sequence of steps during he execution of a program. A procedure is recursive if anew activation can begin before an earlier activation of the same procedure ended. In an activation tree, 1. each node represents an activation of a procedure, 2. the root represents the activation of the main program, 3. the node for a is the parent of the node for b if and only if control flows from activation a to b, and 4. the node for a is to be left of the node for b if and only if the lifetime of a occurs before the lifetime b. CONTROL STACKS We can use a stack, called a control stack to keep track of live procedure activations. The idea is to push the node for activation onto the control stack as the activation begins and to pop the node when the activation ends. Then the contents of the control of stack are related to paths to the root of the activation tree. s r q(1,9) p(1,9) q(1 3) p(1,3) the root.

q(1,0)

q(2,3)

Figure7.4 The control stack contains nodes along a path to

THE SCOPE OF A DECLARATION There may be independent declarations of the same name in different parts of a program. The scope rules of language determine which declaration of a name applies when the name appears in the text of a program. The portion of the program to which a declaration applies is called the scope of that declaration. An occurrence of a name in a procedure is said to be local to the procedure if it is in the scope of a declaration within the procedure; otherwise, the occurrence is said to be nonlocal. The distinction between local and nonlocal names carries over to any syntactic construct that can have declarations within it. While scope is a property of the declaration of a name, it is sometimes convenient to use the abbreviation “the scope of a name x” for the “the scope of the declaration of name x that applies it this occurrence of x.” BINDING

NAMES Even if each name is declared once in a program, the same name may denote different data objects at run-time. The informal term “data object” corresponds to a storage location that can hold values. In programming language semantics, the term environment refers to a function that maps a name to a storage location, and the term state refers to a function that maps a storage location to the value held there. Environments and states are different; an assignment changes the state, but not the environment. For example, suppose that storage address 100, associated with a variable pi, holds 0. After the assignment pi: =3.14, the same storage address is associated with pi, but the value held there is 3.14. When an environment associates storage location s with a name x, we say that x is bound to s; the association itself is referred to as a binding of x. The term storage “location” is to bee taken figuratively. If x is not of a basic type, the storage s for x may be ac collection of memory words. OF

STORAGE ORGANIZATION

Saint Paul University – San Nicolas Campus

Compiler Design

49

SUBDIVISION OF RUN-TIME MEMORY Suppose that the compiler obtains a block of storage from the operating system for the compiled program to run in. From the discussion in the last section, this run-time storage might be subdivided to hold: 1. the generated target code, 2. data objects, and 3. a counterpart of the control stack to keep track of procedure activations. When a call occurs, execution of activation is interrupted and information about the status of the machine, such as the value of the program counter and machine registers, is saved on the stack. When control returns from the call, this activation can be restarted after restoring their values of relevant registers and setting the program to the point immediately after the call. Data objects whose lifetimes are contained in that of activation can be allocated on the stack, along with other information associated with the activation. A separate area of run-time memory, called the heap, holds all the other information. By convention, stacks grow down. That is, the “top” of the stack is drawn toe=wards the bottom of the page. Since memory addresses increase as we go down a page, “downwards-growing” means toward higher addresses. If top marks the top of the stack, offsets from the top of the stack can be computed by subtracting the offset from top. On many machines this computation can be done efficiently by the value of top in a register. Stack addresses can be represented as offsets from top. ACTIVATION RECORDS Information needed by a single execution of a procedure is managed using a contiguous block of storage called an activation record or frame, consisting of the collection of fields. The purpose of the fields of an activation record is as follows, starting from the field for temporaries. 1. Temporary values, such as those arising in the evaluation of expressions, are stored in the field for temporaries. 2. The field for local data holds data that is local to an execution of a procedure. 3. The filed for saved machine status holds information about the state of the machine just before the procedure is called. This information includes the values of the program counter and machine registers that have to restore when control returns from the procedure. 4. The optional access link is used to refer to nonlocal data hold in other activation records. 5. The optional control link points to the activation record of the caller. 6. The field for actual parameters is used by the calling procedure to supply parameters to the called procedure. We shoe space parameters in the activation record, but in practice parameters are often passed in machine registers for greater efficiency. 7. The field for the returned value is used by the called procedure to return a value to the calling procedure. Again, in practice this value is often returned in a register for greater efficiency. The sizes of each of these fields can be determined at the time a procedure is called. In fact, the sizes of almost all fields can be determined at compile time. An exception occurs if a procedure may have a local array whose size is determined by the value of an actual parameter, available only when the procedure is called at run-time. COMPILE-TIME LAYOUT OF LOCAL DATA The amount of storage needed for a name is determined from its type. An elementary data type, such as character, integer, or real, can usually be stored in an integral number of bytes. Storage for an aggregate, such as an array record, must be large enough to hold all its components. For easy access to the components, storage for aggregates is typically allocated in one contiguous block of bytes. The field for local data is laid out as the declarations in a procedure are examined at compiles time. Variable-length data is kept outside this field. We keep a count of the memory locations that have been allocated for previous declarations. From the count we determine a relative address of the storage for local with respect to some position such as the beginning of the activation record. The relative address, or offset, is the difference between the addresses of the position and data object. The storage layout for data objects is strongly influenced by the addressing constraints of the target machine. For example, instructions to add integers may expect integers to be aligned, that is, placed at certain positions in memory Saint Paul University – San Nicolas Campus

Compiler Design

50

such as an address divisible by 4. Although an array of ten characters needs only enough bytes to hold ten characters, a compiler may therefore allocate 12 bytes, leaving 2 bytes unused. Space left unused due to alignment considerations is referred to as padding. When space is not at a premium, a compiler may pack data so that no padding is left.; additional instructions may hen need to be executed at run-time to position packed data so that it can be operated on as if it were properly aligned. STORAGE-ALLOCATION STRATEGIES A different storage-allocation strategy is used in each of the three data areas: 1. Static allocation lays out storage for all data objects at compile time. 2. Stack allocation manages the run-time storage as a stack. 3. Heap allocation allocates and deallocates storage as needed at run-time from a data area known as a heap. STATIC ALLOCATION In static allocation, names are bound to storage as the program is compiled, so there is no need for a run-time support package. Since the bindings do not change at run-time, every time a procedure is activated, its names are bound to the same storage locations. This property allows the values of local names to be retained across activations of a procedure. That is, when control returns to a procedure, the values of the locals are the same as they were when control left the last time. However, some limitations go along with the using static allocation alone. 1. The size of a data object and constraint s on its position in memory must be known at compile time. 2. Recursive procedures are restricted, because all activation of a procedure uses the same bindings for local names. 3. Data structures cannot be created dynamically, since there is no mechanism for storage allocation at run-time. STACK ALLOCATION Stack allocation is based on the idea of a control stack, and activation records are pushed and popped as activation begin and end, respectively. Storage for the locals in each call of a procedure is contained in the activation record for that call. Thus locals are bound to fresh storage in each activation, because the storage for locals disappears when the activation record is popped. CALLING SEQUENCE Procedure calls are implemented by generating what are known as calling sequences in the target code. A call sequence allocates an activation record and enters information into fields. A return sequence restores the state of the machine so the calling procedure can continue execution. A principle that aids the design of calling sequences and activation records is that fields whose sizes are fixed early are placed in the middle. In general activation record the control link, access link, and machine-status fields appear in the middle. The decision about whether or not to use control and access links is part of the design of the compiler, so these fields can be fixed at compilerconstruction time. Moreover, programs such as debuggers will have an easier time deciphering the stack contents when an error occurs. VARIABLE-LENGTH DATA A common strategy for handling variable-length is where procedure p has three local arrays. The storage for these arrays is not part of the activation record for p, only a pointer to the beginning of each array appears in the activation record. The relative addresses of these pointers are known at compile-time, so the target code can access array elements through the pointers. DANGLING MODIFIERS Whenever storage can be deallocated, the problem of dangling references arises. A dangling reference occurs when there is a reference to storage that has been deallocated. It is a logical error to use dangling references, since the value of deallocated storage is undefined according to the semantics of most languages. Worse, since that storage may later be allocated to another datum, mysterious bugs can appear in programs with dangling references. HEAP ALLOCATION The stack allocation strategy discussed above cannot be used if either of the following is possible: 1. the values of local names must be retained when an activation ends. 2. A called activation outlives the caller. This possibility cannot occur for those languages where activation trees correctly depict the flow of control between procedures.

Saint Paul University – San Nicolas Campus

Compiler Design

51

In each of the above cases, the deallocation of activation records need not occur in last –in-first-out fashion, so storage cannot be organized as a stack. Heap allocation parcels out pieces of contiguous storage, as needed for activation records or other objects. Pieces may be deallocated in any order, so over time the heap will consist of alternate areas that are free and in use. ACCESS TO NONLOCAL NAMES The scope of rules of a language determines the treatment of references to nonlocal names. A common rule, called the lexical or static-scope rule, determines the declaration that applies to a name by examining the program text alone. An alternative rule, called the dynamic-scope rule, determines the declaration applicable to a name at run-time, by considering the current activation. BLOCKS A block is a statement containing its own local data declarations. The concept of a block originated with Algol. In C, a block has the syntax {declarations statements} The scope of a declaration in a block-structured language is given by the most closely nested rule: 1. The scope of a declaration in a block B includes B. 2. If a name x is not declared in a block B, then an occurrence of x in B is in the scope of a declaration of x in an enclosing block B’ such that i) B’ has a declaration of x, and ii) B’ is more closely nested around B than any other block with a declaration of x. LEXICAL SCOPE WITHOUT NESTED PROCEDURES An important benefit of static allocation for nonlocals is that declared procedures can freely be passed as parameters and returned as results (a function is passed in c by passing a pointer to it). With lexical scope and without nested procedures, any name nonlocal to one procedure is nonlocal to all procedures. Its static address can be used by all procedures, regardless of how they are activated. Similarly, if procedures are returned as results, nonlocals in the returned procedure refer to the storage statically allocated for them. LEXICAL SCOPE WITH NESTED PROCEDURES A nonlocal occurrence of a name a in a Pascal procedure is in the scope of the most closely nested declaration of a in the static program text. The nesting of procedure in Pascal is indicated by the ff. indentations: sort readarray exchange quicksort partition DYNAMIC SCOPE Under dynamic scope, a new activation inherits the existing bindings of nonlocal names to storage. A nonlocal name a in the called activation refers to the same storage that it did in the calling activation. New bindings are set up for local names of the called procedure; the names refer to storage in the new activation record. The following two approaches to implementing dynamic scope bear some resemblance to the use of access links and displays, respectively, in the implementation of lexical scope. 1. Deep access. The term deep access comes from the fact that the search may go “deep” into the stack. The depth to which the search may go depends on the input to the program and cannot be determined at compile time. 2. Shallow access. Here the idea is to hold the current value of each name in statically allocated storage. When a new activation of a procedure p occurs, a nonlocal name n in p takes over the storage statically allocated for n. PARAMETER PASSING When one procedure calls another, the usual method of communication between them is through parameters of the called procedure. Several common methods for associating actual and formal parameters are discussed in this section. They are: call-by-value, call-by-reference, copyrestore, call-by-name, and macro expansion. It is important to know the parameter passing method used. CALL-BY-VALUE This is, in a sense, the simplest possible method of passing parameters. Call-by-Value can be implemented as follows: 1. A formal parameter is treated just like a local name, so the storage for the formals is in the activation record of the called procedure. Saint Paul University – San Nicolas Campus

Compiler Design

52

2. The caller evaluates the actual parameters and places their values in the storage for the formals. CALL-BY-REFERENCE When parameters are passed by reference (also known as call-by-address or call-by-location), the caller passes to the called procedure a pointer to the storage address of each actual parameter. 1. If an actual parameter is a name or an expression having an l-value, then that l-value itself is passed. 2. However, if the actual parameter is an expression, like a+b or 2, that has no l-value, then the expression is evaluated in a new location and the address of that location is passed. COPY-RESTORE A hybrid between call-by-value and call-by-reference is copy-restore linkage, (also known as copy-in-copy-out, or value-result). 1. Before control flows to the called procedure, the actual parameters are evaluated. The r-values of the actuals are passed to the called procedure as in call-by-value. In addition, however, the l-values of those actual parameters having l-values are determined by the call. 2. When control returns, the current r-values of the formal parameters are copied back into the l-values of the actuals, using the l-values computed before the call. Only actuals having l-values are copied, of course. CALL-BY-NAME Call-by-name is traditionally defined by the copy-rule of Algol, which is 1. The procedure is treated as if it were a macro; that is, its body is substituted for the call in the caller, with the actual parameters literally substituted for the formals. Such a literal substitution is called macro-expansion or in-line expansion. 2. The local names of the called procedure are kept distinct from the names of the calling procedure. We can think of each local of the called procedure being systematically renamed into a distinct new name before the macro-expansion is done. 3. The actual parameters are surrounded by parentheses if necessary to preserve their integrity. LANGUAGE FACILITIES FOR DYNAMIC STORAGE ALLOCATION In this section, we briefly describe facilities provided by some languages for the dynamic allocation of storage for data, under program control. Storage for such data is usually taken from a heap. Allocated data is often retained until it is explicitly deallocated. The allocation itself can be either explicit or implicit. Implicit allocation occurs when evaluation of an expression results in storage being obtained to hold the value of the expression. GARBAGE

Dynamically allocated storage can become unreachable. Storage that a program allocates but cannot refer to is called garbage. DANGLING REFERENCES Dangling references and garbage are related concepts; dangling references occur if deallocation occurs before the last reference, whereas garbage exists if the last reference occurs before deallocation. DYNAMIC STORAGE ALLOCATION TECHNIQUES The techniques needed to implement dynamic storage allocation depend on how storage is deallocated. If deallocation is implicit, the run-time support package is responsible for determining when a storage block is no longer needed. There in less a compiler has to do if the programmer does deallocation explicitly EXPLICIT ALLOCATION OF VARIABLE-SIZED BLOCKS When blocks are allocated and deallocated, storage can become fragmented; that is, the heap may consist of alternate blocks that free and in use. One method for allocating variable-sized block is called the first-fit method. When al block of size s is allocated, we search for the first free block that is of size f>=s. This block is then subdivided into a used block of size s, and a free block of size f-s. Note that allocation incurs a time overhead because we must search for a free block that is large enough. When a block is deallocated, we check to see if it is next to a free block. If possible, the deallocated block is combined with a free block next to it to create a larger free block. Combining adjacent free blocks into a larger free block Saint Paul University – San Nicolas Campus

Compiler Design

53

prevents further fragmentation from occurring. There are a number of subtle details concerning how free blocks are allocated, deallocated, and maintained in an available list or lists. IMPLICIT DEALLOCATION Implicit deallocation requires cooperation between the user program and the run-time package, because the latter needs to know when a storage block is no longer in use. This cooperation is implemented by fixing the format of storage blocks. Two approaches can be used for implicit deallocation. These are:

1. Reference counts. We keep track of the number of blocks that point directly

to the present block. If this count ever drops to 0, then the block can be deallocated because it cannot be referred to. Reference counts are best used when pointers between blocks never appear in cycles.

2. Marketing techniques. An alternative approach is to suspend temporarily execution of the user program and the use the frozen pointers to determine which blocks are in use. This approach requires all the pointers into the heap to be known. Any block that is reached by the point is in use and the rest can be deallocated. In more detail, we go through the heap and mark all blocks unused. Then, we follow pointers marking as used any block that is reached in the process. A final sequential scan of the heap allows all blocks still marled unused to be collected.

With variable-sized blocks, we have the additional possibility of moving used storage blocks from their current positions. This process, called compaction moves all used blocks to one need of the heap so that all the free storage can be collected into one large free block. Composition also requires information about the pointers in blocks because when a is used block is moved, all pointers to it have to be adjusted tp reflect the move. Its advantage is that afterwards fragmentation of available storage is eliminated. MODULE VIII. ERROR-DETECTION ERROR DETECTION

AND

AND

RECOVERY

REPORTING

Syntax – Phase and Semantic - Phase usually handles a large fraction of errors detectable by the compiler. Lexical Phase – can detect errors where the characters remaining in the input do not form any token of the language. Syntax Analysis Phase – errors where the token stream violates the structure rules (syntax) of the language are determined. Semantic Analysis Phase – the compiler tries to detect construct that have the right syntactic structure but no meaning to the operation involved, e.g., if we try to add two identifiers. Error Module Error.c Manages the error reporting, which is extremely primitive. On encountering a syntax error, the compiler prints a message saying that an error has occurred on the current input line and then halts. A better error recovery technique might skip to the next semicolon and continue parsing; the reader is encouraged to make this modification to the translator. LEXICAL ERRORS Other possible error – recovery actions are: 1.)Deleting an extraneous character. 2.)Inserting a missing character. 3.)Replacing an incorrect character by a correct character. 4.)Transposing two adjacent characters. Error transformation like these may be tried in an attempt to repair the input. The simplest such strategy is to see whether a prefix of the remaining input can be transformed into a valid lexeme by just a single error transformation. This strategy assumes most lexical errors are the result of a single error transformation, an assumption usually, but not always, borne out in practice. SYNTAX ERROR HANDLING Programs can contain errors at many different levels. For example, errors can be: • Lexical, such as misspelling an identifier, keyword, or operator. • Syntactic, such as arithmetic expression with unbalanced parentheses. • Semantic, such as an operator applied to an incompatible operand. Saint Paul University – San Nicolas Campus

Compiler Design

• THE

54

Logical, such as an infinitely recursive call.

ERROR HANDLER IN A PARSER HAS SIMPLE-TO-STATE GOALS:

• • •

It should report the presence of errors clearly and accurately. It should recover from each error quickly enough to be able to detect subsequent errors. It should not significantly show down the processing of correct programs.

VIABLE- PREFIX PROPERTY – meaning they detect that an error as occurred as soon as they see a prefix of the input that is not a prefix of any string in the language. EXAMPLE OF (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13)

PASCAL PROGRAM: program prmax ( input, output ); var x, y: integer; function max ( I: integer; j: integer): integer; { return maximum of integers I and I } begin if i > j then max:=I else max:=j end; begin reading (x,y); writeIn (max (x,y)) end.

THE

ERROR- RECOVERY STRATEGIES: 1. Panic- mode recovery – this is the simplest method to implement and can be used by most parsing methods. 2. Phrase- level recovery – on discovering an error, a parser may perform local correction on the remaining input; that is, it may replace a prefix of the remaining input by some string that allows the parser to continue. 3. Error production – if we have a good idea of the common errors that might be encountered, we can augment the grammar for the language at hand with production that generates the erroneous constructs. 4. Global correction – ideally, we would like a compiler to make as few changes as possible in processing an incorrect input string. ERROR RECOVERY IN PREDICTIVE PARSING: An error is detected during predictive parsing when the terminal on top of the stack does not match the next input symbol or when the non-terminal A is on top of the stack, a is the next input symbol, and the parsing table entry M[A,a] is empty. Panic-mode error recovery – is based on the idea of skipping symbols on the input until a token in a selected set of synchronizing tokens appears. ERROR RECOVERY IN OPERATOR-PRECEDENCE PARSING Operator – precedence parser can discover syntactic errors in the parsing process: 1.)If no precedence relation holds between the terminal on top of the stack and the current input 2.)If a handle has been found, but there is no production with this handle as a right side. HANDLING ERRORS DURING REDUCTIONS • For example, suppose a b c is popped, and there is no production right side consisting of a,b, and c together with zero or more non-terminals. Then we might consider if deletion of one of a, b and yields a side aEcE we might issue the diagnostic illegal b on line (line containing c) • we might also consider changing or inserting a terminal. Thus if abEdc were a right side, we might issue a diagnostic missing d on line ( line containing c). • we may also find that there is a right side with the proper sequence of terminals, but the wrong pattern of non-terminals. For example, if abc is popped off the stack with no intervening or surrounding non-terminals and abc is not a right side but missing E on line ( line containing b) • here E stands for an appropriate syntactic category represented by nonterminal E. for example, if a,b, or c is an operator, we might say “expression;” if a is a keyword like if, we might say “ conditional”. +

-

Saint Paul University – San Nicolas Campus

* c

/ >

I

id

Compiler Design

Graph for precedence matrix + + .> .> .> .> * .> .> / .> .> I .> .> id .> .> ( <. <. ) .> .> $ <. <.

55 * <. <. .> .> .> .> <. .> <.

/ <. <. .> .> .> .> <. .> <.

I <. <. <. <. <. .> <. .> <.

id <. <. <. <. <.

( <. <. <. <. <.

<.

<.

<.

<.

) .> .> .> .> .> .> = .>

$ .> .> .> .> .> .> .>

Operator – precedence relations Let us consider this grammar: E----E+E |E-E | E*E | E\E | EIE | (E) | -E | id +, -, *, /, id and I – paths from an initial to a final node right parenthesis – initial path left parenthesis – final path SPECIFICALLY, THE CHECKER DOES THE FOLLOWING: 1.)If +, -, * , / , or I is reduced, it checks that non-terminal appear on both sides. If not, it issues the diagnostic missing operand. 2.)If id is reduced, it checks that there is no non-terminal to the right or left. If there is, it can warn missing operator. 3.)If () is reduced, it checks that there is a non-terminal between the parentheses. If not, it can say no expression between parentheses. Also, it must checks that no non-terminal appears on either parenthesis. If one does, it issues the same diagnostic as in (2).

side

of

the

Handling Shift / Reduce Errors id ( ) $ id e3 e3 .> .> ( <. <. = e4 ) e3 e3 .> .> $ <. <. e2 e1 Fig.1 Operator – precedence matrix with error entries. THE

SUBSTANCE OF THESE ERROR HANDLING ROUTINES IS AS FOLLOWS:

e1: /* called when whole expression is missing */ insert id onto the input issue diagnostic: “ missing operand “ e2: /* called when expression begins with a right parenthesis */ delete > from the input issue diagnostic: “ unbalanced right parenthesis” e3: /* called when id or ) is followed by id or ( */ insert + onto the input issue diagnostic: “ missing operator” e4: /* called when expression ends with a left parenthesis */ pop ( from the stack issue diagnostic: “ missing right parenthesis “ Let us consider how this error-handling mechanism would treat the erroneous input id +). The first actions token by the parser are to shift id reduce it to E ( for anonymous non-terminals on the stack), and then to shift the +. We now have configuration. STACK INPUT $E+ )$ since + .> ) a reduction is called for, and the handle is +. The error checker for reductions is required to inspect for E’s to left and right. Finding one missing, it issues the diagnostic missing operand and does the reduction anyway. Our configuration is now $E )$ There is no precedence relation between $ and ), and the entry in Fig.4.28 for this pair of symbols is e2. routine e2 causes diagnostic unbalanced right parenthesis to be printed and removes the right parenthesis from the input. We are now left with the final configuration for the parser. Saint Paul University – San Nicolas Campus

Compiler Design

56

$E $ ERROR RECOVERY IN LR PARSING An LR parser will detect an error when it consults the parsing action table and finds an error entry. Errors are never detected by consulting the go to table. Unlike an operator-precedence parser, an LR parser will announce error as soon as there is no valid continuation for the portion of the input thus far scanned. LR Parsing – will never make even a single reduction before announcing an error. SLR and LALR Parsers – may make several reductions before announcing an error, but they never shift an erroneous input symbol onto the stack. ERROR RECOVERY IN YACC Yacc – stands for “ yet another compiler – compiler “ - reflecting the popularity of parser generators in the early 1970’s when the first version of Yacc was created by S.C. Johnson. - Is available as a command on the UNIX system, and has been used to help implement hundreds of compilers. In Yacc, error recovery can be performed using a form of error productions. First, the user decides what “ major “ non-terminals will have error recovery associated with them. Typical choices are some subset of the non-terminals generating expressions, statements, blocks and procedures. The user then adds to the grammar error productions of the form A- error {, where: A - is a major non-terminal { - is a string of grammar symbols, perhaps the empty string. Error – is a Yacc reserved word Yacc – will generate a parser from such a specification, treating the error productions as ordinary productions. EXAMPLE: % # # # %

{ include < ctype.h> include < stdio.h> define YYSTYPE double /* double type for Yacc stack */ }

% token NUMBER % left ‘+’ ‘-‘ % left ‘*’ ‘/’ % right UMINUS %% lines : lines expr ‘In’ { printf) “%gIn”, & 2);} | lines ‘In’ | /* empty */ |error ‘In’ { Yerror ( “ reenter last line : “); yyerrok } ; expr : expr ‘+’ expr { && = & ++ &3; } | expr ‘-‘ expr { &&= & +- &3; } | expr ‘*’ expr { && = & + * &3; } | expr ‘/’ expr { && = & =/&3; } | ‘(‘ expr ‘)’ { && = &2; } | ‘-‘ expr %prec UMINUS { && = - &2; } | NUMBER ; %% # include “ lex.yy.c “ Fig.1 Desk calculator with error recovery

Saint Paul University – San Nicolas Campus

Compiler Design

57

MODULE IX. INTRODUCTION

TO

CODE OPTIMIZATION

To create an efficient target language program, a programmer needs more than an optimizing compiler. In this section, we review the options available to a programmer and a compiler for creating efficient target programs. We mention of code-improving transformations that a programmer and a compiler writer can be expected to use it improve the performance of a program. We also consider the representation of programs on which transformations will be applied. Criteria for Code-Improving Transformations Simply stated, the best program transformations are those that yield the most benefit for the least effort. The transformations provided by an optimizing compiler should have several properties. First, a transformation must preserve the meaning of programs. That is, an “optimization” must not change the output produced by a program for a given input, or cause an error, such as division by zero, that was not present in the original version of the source program. Second, a transformation must, on the average, speed up programs by a measurable amount. Sometimes we are interested in reducing the space taken by compiled code, although the size of code has less importance than it once had. Of course, not every transformation succeeds in improving every program, and occasionally an “optimization” may slow down a program slightly, as long as on the average it improves things. Third, a transformation must be worth the effort. It does not make sense for a compiler writer to expend the intellectual effort to implement a code-improving transformation and to have the compiler expend the additional time compiling source programs if this effort is not repaid when the target programs are executed. Getting Better Performance Dramatic improvements in the running time of a program –such as cutting the running time from a few hours to few seconds-are usually obtained by improving the program at all levels, from the source level, the available options fall between the two extremes of finding a better algorithm and of implementing a given algorithm so that fewer operations are performed.

Source code

front-end

Us er c an Profile program Change algorithm Transform loops

Intermediate code

Com pil er can Improve loops Procedure calls Address calculations

Code generator

target code

Com pil er can Use registers Select instructions Do peephole transformations

Figure 1. Places for improvements by the user and the compiler Algorithmic transformations occasionally produce spectacular improvements in running time. For example, Bentley [1982] relates that the running time of a program for sorting N elements dropped from 2.02N2 microseconds to 12Nlog2N microseconds when a carefully coded insertion sort was replaced by quicksort. For N=100 the replacement speeds up the program by a factor of 2.5. For N =100,000 the improvement is far more dramatic: the replacement speeds up the program by a factor of more than a thousand. Unfortunately, no compiler can find the best algorithm for a given program. Subsequently however, a compiler can replace a sequence of operations by an algebraically equivalent sequence, and thereby reduce the running time of a program significantly. Such savings are more common when algebraic transformations are applied to programs in very high-level languages, e.g. query languages for databases.

Saint Paul University – San Nicolas Campus

Compiler Design

58

In this section, and the next, a sorting program called quicksort will be used to illustrate the effect of various code-improving transformations. The C program below is derived from Sedgewick [1978], where hand-optimization of such a program is discussed void quicksort (m, n) int m, n { int i, j; int v, x; if (n <= m) return ; /* fragment begins here*/ i = m – 1; j = n; v = a[n]; while (1) { do i = i + 1; while ( a[ j ] < v ); do j = j – 1; while ( a[ j ] > v ); if ( i >= j ) break; x = a [i ] ; a[ i ] = a[ j ]; a [ n ] = x; /*fragment ends here*/ quicksort ( m, j ) ; quicksort ( i + 1,n); }

Figure 2. C code for quicksort

An Organization for an Optimizing Compiler Advantages of figure 3: 1. The operations needed to implement high-level constructs are made explicit in the intermediate code, so it is possible to optimize them. For example, the address calculations for a [i ] are explicit in figure 4 so the recomputation of expressions like 4*i can be eliminated as discussed in the next section.

front end

Code optimizer

Control-flow analysis

Code generator

data-flow analysis

Transformations

Figure 3. Organization of the code optimizer (1) i : = m – 1 (2) j : = n (3) t1 := 4 * n (4) v : = a [ t1 ] (5) i : = i + 1 (6) t2: = 4 * i (7) t3: = a [t2] (8) if t3 < v goto (5) (9) j : = j – 1 (10)t4 : = 4 * j (11)t5 : = a [t4 ] (12)if t5 > v goto (9) (13)if i >=j goto (23) (14)t6 : = 4 * i (15)x : = a [t6 ]

(16)t7 : = 4 * i (17)t8 : = 4 * j (18)t9 : = a [t8 ] (19)a[t7]: = t9 (20)t10 : = 4 * j (21)a [ t10] : = x (22)goto (5) (23)t11 : = 4 * i (24)x := a [t11] (25)t12 : = 4 * i (26)t13 : = 4 *n (27)t14: = a [t13] (28)a[t12] : = t14 (29)t15 : = 4 * n (30)a [ t15] : = x

Figure 4. Three-address code for fragment in figure 3 2. The intermediate code can be relatively independent of the target machine, so the optimizer does not have to change much if one for a different machine replaces the code generator. The intermediate code in figure 4 assumes that each element of the array a takes four bytes. Some intermediate codes, e.g. P-code for Pascal, leave it to the code generator to fill in the size of a machine word. We could have done the same in our intermediate code if we replaced 4 by symbolic constant. Saint Paul University – San Nicolas Campus

Compiler Design

59

In code optimizer, programs are represented by flow graphs, in which edges indicates the flow of control and nodes represent basic blocks. Example 1. figure 5 contains the flow graph for the program in figure 4. B1 is the initial node. All conditional and unconditional jumps to statements in figure 4 have been replaced in figure 5 by jumps to the block of which the statements are leaders. In figure 5, there are three loops. B2 and B3 RE LOOPS BY THEMSELVES. Blocks B2, B3, B4, and B5 together form a loop, with entry B2.

i : = m – 1 j : = n t1 : = 4 * n v : = a [t1]

B1

B2 i : = i + 1 t2 : = 4*i t3 : = a [t2] if t3 < v goto B2 B3 j : = j – 1 t4 : = 4 * j t5 : = a [ t4 ] if t5 > v goto B3 B4 if i >= j goto B6

B5

t6 : = 4 * i x : = a [ t6] t7 : = 4 * i t8 : = 4 * j t9 : = a [ t8 ] a [t7 ] : = t9

t11 : = 4 * i x : = a [t11 ] t12 : = 4 * i t13 : = 4 * n t14 : = a [t13 ] a [t12] : = t 14 Figure 5. Flow Graph

Saint Paul University – San Nicolas Campus

B6

Compiler Design

PRINCIPAL SOURCE

60 OF

OPTIMIZATION

A transformation of program is called local if looking only at the statements in a basic block can perform it; otherwise, it is called global. Many transformations can be performed at both the local and global levels. Local transformations are usually performed first. FUNCTION-PRESERVING TRANSFORMATIONS There are a number of ways in which a compiler can improve a program without changing the function it computes. Common subexpression elimination, copy propagation, dead-code elimination, and constant folding are common examples of such function-preserving transformations. Frequently, a program will include several calculations of the same value, such an offset in an array. As mentioned, the programmer cannot avoid some of these duplicate calculations because they lie below the level of detail accessible within the source language. For example block B5 shown below recalculates 4*i and 4*j.

B5

t6: = 4*i x:= a [t6] t7:= 4* i t8 := 4*j t9 := a [t8] a [t7] := t9 t10 : = 4* j a[ t 10 ]: = x (a) Before

B5

t6 : =4* i x : = a [ t6 ] t8 : = 4* j t9 : = a [ t8 ] a [t6 ] : = t9 a [t8 ] : = x (b) After

Figure 6 Local common sub expression elimination

Saint Paul University – San Nicolas Campus

Compiler Design

61

COMMON SUBEXPRESSIONS An occurrence of an expression E is called a common subexpression if E was previously computed, and the values of variables in E have not changed since the previous computation. We ca avoid recomputing the expression if we can use the previously computed value. For example the assignments to t7 and t10 have the common subexpressions 4* i and 4*j, respectively, on the right side in Figure 6(a). they have been eliminated in Figure 6(b), by using t6 reconstructed the intermediate code from the dag for the basic block. Example2 Figure 7 shows the result of eliminating both global and local common subexpressions from blocks B5 and B6 in the flow graph of Figure 5.

i : = m – 1 j : = n t1 : = 4*n v : = a [ t1]

B1

B2 i : = i + 1 t2 : = 4* j t3 : = a [t2] If t3 < v goto B3 B3 j : = j – 1 tt : = 4* j t5 : = a [t4] If t5 > v goto B3 B4 if i >= j goto B6 B5

x : = t3 a [t2] : = t5 a [t4] : = x goto B2

B6 x : = t3 t14 : = a [t1] a [t2] : = t14 a [t1] : = x

Figure 7. B5 and B6 after common subexpression elimination After local common subexpressions are eliminated, B5 still evaluates 4*i and 4*j, as shown in figure 6(b). Both are common subexpressions; in particular, the three statements t8 : = 4*j ; t9 : = a [t8] ; a[t8] : = x in B5 can be replaced by t9 : = a [t4] ; a [t4 ] : = x using t4 computed in block B3. in figure 7, observe that as control passes from the evaluation of 4*j in B3 to B5, there is no change in j, so t4 can be used if 4*j is needed. Copy Propagation Block B5 in figure 7 can be further improved by eliminating x using two new transformations. One concerns assignments of the form f:=g called copy statements, or copies for short. Had we gone into more detail in Example 2, copies would have arisen much sooner, because the algorithm for eliminating common subexpressions introduces them, as do several other algorithms. For example, when the common subexpression in c : = d + e is eliminated in figure 8, the algorithm uses a new variable t to hold the value of d + e. since control may reach c : = d + e either after the assignment to a or after the assignment to b, it would be incorrect to replace c: = d + e by either c : =a or c : = b. The idea behind the copy-propagation transform is to use g for f, wherever possible after the copy statement f : = g. for example, the assignment x : = t 3 in block B5 of fig. 7 is a copy. Copy propagation applied to B5 yields x : = t3 a [t2 ] : = t5 a [ t4 ] : = t3 goto B2 Saint Paul University – San Nicolas Campus

Compiler Design

62

a:=d+e

t:=d+e a:=t

b:=d+e

t:=d+e b:=t

c:=d+e

c:=t

Figure 8 Copies introduced during common subexpression elimination DEAD-CODE ELIMINATION A variable is live at a point in a program if its value can be subsequently; otherwise, it is dead at a point. A related idea is dead or useless code, statements that compute values that never get used. While the programmer is unlikely to introduce any dead code intentionally, it may appear as the result of previous transformations. Three techniques important for loop optimization: 1) Code motion, which moves codes outside the loop 2) Induction-variable elimination, which we apply to eliminate i and j from inner loops B2 and B3 of figure 7 and 3) Reduction in strength, which replaces an expensive operation by a cheaper one, such as a multiplication by an addition. OPTIMIZATION

OF

+

BASIC BLOCKS

c

Example 1.5 A dag for the block (10.3) a:=b+c b:=a–d c:=b+c d:=a–d

-

+

d0

a

b0

b, d

c0 Dag for basic block (10.3)

DEALING WITH ALIASES If two or more expressions denote the same memory address, we say that the expressions are aliases of one another. In this section we shall consider dataflow analysis in the presence of pointers and procedures, both of which introduce aliases. A SIMPLE POINTER LANGUAGE We must also make certain assumptions about which arithmetic operations on pointers are semantically meaningful. First, if pointer p points to a primitive (one word) data element, then any arithmetic operation on p produces a value that may be an integer, but not a pointer. If p points to an array, then addition or subtraction of an integer leaves p pointing somewhere in the same array, while other arithmetic operations on pointers produce a value that is not a pointer. While not all languages prohibit, say, moving a pointer from one array a to another array b by adding to the pointer, such action would depend on the particular implementation to make sure that array b followed a in storage. THE EFFECTS

OF

POINTER ASSIGNMENTS

Under these assumptions, the only variables that could possibly be used as pointers are those declared and temporaries that receive a value that is a pointer plus or minus a constant. We shall refer to all these variables as pointers. We shall define in[B], for block B, to be the function that gives for each pointer p the set of variables to which p could point at the beginning of B. Formally, in[B] is a set of pairs of the form (p , a), where p is a pointer and a variable, meaning that p might point to a. In practice, in[B] might be represented

Saint Paul University – San Nicolas Campus

Compiler Design

63

as a list for each pointer, the list for p giving the set of a’s such that (p , a) is in[B]. We define our[B] similarly for the end of B. TRANSFER FUNCTION transB

- this will defines the effect of block B. - is a function taking as argument a set of pairs S, each pair of the form (p , a) for a p a pointer and a nonpointer variable, and producing another set T.

The rules for computing trans: 1. If s is p:=&a or p:=&a±c in the case a is an array, then transs(S) = (S – {(p , b)| any variable b}) ∪ {(p , a)} 2. If s is p:=q±c for pointer q and nonzero integer c, then transs (S) = (S – {(p , b)| any variable b}; ∪ {(p , b) \ (q , b) is in S and b is an array variable} Note that this rule makes sense even if p=q. 3. If s is p:=q, then transs (S) = (S – {(p , b) | any variable b}) ∪ {(p , b) | (q , b) is in S} 4. If s assigns to pointer p any other expressions, then transs (S) = S – {(p , b) | any variable b} 5. If s is not an assignment to a pointer, then transs (S) =S. COMPUTING ALIASES

Before we can answer the question of what variables might change in given procedure, we must develop an algorithm for finding aliases. The approach we shall use here is a simple one. We compute a relation = on variables that formalizes the notion “can be an aliases of.” Example #1 Simple alias computation. INPUT. A collection of procedures and global variables. OUTPUT. An equivalence relation ≡ with the property that whenever there is a position in the program where x and y are aliases of one another, x≡y; the converse need not be true. METHOD. 1. Rename variables, if necessary, so that no two procedures use the same formal parameter or local variable identifier, nor do a local, formal, or global share an identifier. 2. If there is a procedure p(x1, x2,…,xn) and an invocation p(y1, y2,…, yn) of that procedure, set xi ≡ yi for all i. That is, each formal parameter can be an alias of any of its corresponding actual parameters. 3. Take the reflexive and transitive closure o f the actual-formal correspondences by adding a) x ≡ y whenever y ≡ x b) x ≡ z whenever x ≡ y and y ≡ z for some y. Example #2 global g, h; procedure main ( ); local i; g := … ; one (h, I) end procedure one (w, x); x := … ; two (w, w); two (g, x) end procedure two (y, z); local k; h := … ; one (k, y) end DATA-FLOW ANALYSIS

IN THE

PRESENCE

OF

PROCEDURE CALLS

We may define, for each procedure p, a set change[p], whose value is to be the set of global variables and formal parameters of p that might be change during an execution of p. At this point, we do not count a variable as changed if a member of its equivalence class of aliases is changed.

Saint Paul University – San Nicolas Campus

Compiler Design

64

Let def[p] be the set of formal parameters and globals having explicit definition within p (not including those defined within procedures called by p). To write the equations for change[p], we have only to relate the globals and formals of p that are used as actual parameters in calls made by p to the corresponding formal parameters of the called procedures. We may write: change[p] = def [p] ∪ A ∪ G where

1. A = {a | a is a global variable or formal parameter of p such that, for some

procedure q and integer i, p calls q with a as the ith actual parameter and the ith formal parameter of q is in change[q]}

2. G = {g | g is global in change[q] and p calls q}. Iterative algorithm to compute change.

(1) for each procedure p do change[p] :=def[p]; /*initialize*/ (2) while changes to any change[p] occur do (3) for I := 1 to n do (4) for each procedure q called by p, do begin (5) add any global variables in change[q] to change[p]; (6) for each formal parameter x (the jth) of q do (7) if x is in change[q] then (8) for each call of q by p, do (9) if a, the jth actual parameter of the call,

is a global or formal parameter of p1 then add a to change[p1]

(10) end DATA-FLOW ANALYSIS OF STRUCTURED FLOW GRAPHS

This section is a variety of flow- graphs concepts, such as “interval analysis,” that are primarily relevant to structured flow graphs. Depth-First Ordering is a useful ordering of the nodes of a flow graph, which is a generalization of the depth-first traversal of a tree. can be used to detect loops in any flow graph it helps speed up iterative data-flow algorithms. Depth-First Ordering of the nodes is the reverse of the order in which we last visit the nodes in the preorder traversal. Depth-first search algorithm procedure search(n); begin (1) mark n “visited”; (2) for each successor s of n do (3) if s is “unvisited” then begin (4) add edge n → s to T; (5) search(s) end; (6) dfn[n] := i; (7) i := i - 1 end; /* main program follows / (8) T := empty; /* set of edges */ (9) for each node n of G do mark n “unvisited”; (10) i := number of nodes of G; (11) search(n0) INTERVALS An “interval” in a flow graph is a natural loop plus an acyclic structure that dangles from the nodes of that loop. An important property of intervals is that they have header nodes that dominate all the nodes in the interval; that is, every interval is a region.

Saint Paul University – San Nicolas Campus

Compiler Design

65

Interval Graph Sequence

1,2

1,2 1,2



3

3





1,…,,10

3,…,,10 4,5,6

4,..,,,10

7,8,9,10 (b)

(a)

(c)

(d)

MODULE X CODE GENERATION INTERMEDIATE LANGUAGE

Syntax trees and postfix notation are introduced in Sections 5.2 and 2.3, respectively they are the two kinds of intermediate representations and the third is called the three-address code which will be used in this topic. The semantic rules for generating threeaddress code from common programming language constructs are similar to those for constructing syntax trees or for generating postfix notation. GRAPHICAL REPRESENTATIONS

A syntax tree depicts the natural hierarchical structure of a source program. A dag gives the same information but in a more compact way because common subexpressions are identified. A syntax tree and dag for the assignment statement a := b * - c + b * - c appears in Fig. 1.1. assign

assign

a

+ * b

a

+

*

uminus b

uminus

c

c

* b

uminus c

(a) Syntax Tree

(b) dag

Fig. 8.2. Graphical Representations of a := b * - c + b * - c Postfix notation is a linearized representation of a syntax tree; it is a list of the nodes of the tree in which a node appears immediately after its children. The postfix notation for the syntax tree in Fig. 8.2(a) is a b c uminus * b c uminus * + assign The edges in a syntax tree do not appear explicitly in postfix notation. They can be recovered from the order in which the nodes appear and the number of operands that the operator at a node expects. Saint Paul University – San Nicolas Campus

Compiler Design

66

Syntax trees for assignment statements are produced by the syntax-directed definition in Fig. 8.3; it is an extension of one in Section in Section 5.2. Nonterminal S generates an assignment statement. The two binary operators + and * are examples of the full operator set in a typical language. Operator associativities and precedence are the usual ones, even though they have not been put into the grammar. This definition constructs the tree of Fig. 8.2(a) from the input a := b * - c + b * - c.

S E E E E E

PRODUCTION id := E E1 + E2 E1 * E2 - E1 ( E1 ) id

SEMANTIC RULE S.nptr := mknode (‘assign’, mkleaf (id, id.place), E.nptr) E.nptr := mknode (‘+’, E1.nptr, E2.nptr ) E.nptr := mknode (‘*’, E1.nptr, E2.nptr) E.nptr := mkunode (‘uminus’, E1.nptr ) E.nptr := E1.nptr E.nptr := mkleaf(id, id.place)

Fig. 8.3. Syntax-directed definition to produce syntax trees for assignment statements. This same syntax-directed definition will produce the dag in Fig. 8.2(b) the functions mknode (op, child) and mknode (op, left, right) return a pointer to an existing node whenever possible, instead of constructing new nodes. The token id has an attribute place that points to the symbol-table entry for the identifier. Two representations of the syntax tree in Fig. 8.2(a) appear in Fig. 8.4. Each node is represented as a record with a field for its operator and additional fields for pointers to its children. In Fig 8.4(b), nodes are allocated from an array of records and the index or position of the node serves as the pointer to the node. All the nodes in the syntax tree can be visited by following pointers, starting from the root at position 10. assign id

a

+ * id

* b

id

uminus id

b

uminus c

id

c

(a)

0 1 2 3 4 5 6 7 8 9 10 11

id id uminus * id id uminus * + id assign ...

b c 1 0 b c 5 4 3 a 9

2

6 7 8

(b)

Fig.8.4. Two representations of the syntax tree in Fig. 8.2(a). Three-Address Code Three-address code is a sequence of statements of the general form x := y op z where x, y, and z are names, constants, or compiler-generated temporaries; op stands for any operator, such as a fixed or floating-point arithmetic operator, or a logical operator on Boolean-valued data. Thus a source language expression like x+ y * z might be translated into a sequence t1 := y * z t2 := x + t1 where t1 and t2 are compiler-generated temporary names. This unraveling of complicated arithmetic expressions and of nested flow-of-control statements makes three-address code desirable for target code generation and optimization. The use of names for the intermediate values computed by a program allows three-address code to be easily rearranged unlike postfix notation. Three-address code is a linearized representation of a syntax tree or a dag in which explicit names correspond to the interior nodes of the graph. The syntax tree and dag in Fig. 8.2 are represented by the three-address code sequences in Saint Paul University – San Nicolas Campus

Compiler Design

67

Fig. 8.5. Variable names can appear directly in three-address statements, so Fig. 8.5(a) has no statements corresponding to the leaves in Fig. 8.4. t1 := - c t2 := b * t1 t3 := - c t4 := b * t3 t5 := t2 + t4 a := t5

t1 := - c t2 := b * t1 t5 := t2 + t2 a := t5

(a) Code for the syntax tree

(b) Code for the dag

Fig. 8.5. Three-address code corresponding to the tree and dag in Fig. 8.2. The reason for the term “three-address code” is that each statement usually contains three addresses, two for the operands and one for the result. In the implementations of three-address code given later in this section, a programmerdefined name is replaced by a pointer to a symbol-table entry for that name. Types of Three-Address Statements Three-address statements are skin to assembly code. Statements can have symbolic labels and there are statements for flow of control. A symbolic label represents the index of a three-address statement in the array holding intermediate code. book:

Here are the common three-address statements used in the remainder of this

1. Assignment statements of the form x := y op z, where op is a binary arithmetic or logical operation.

2. Assignment instructions of the form x := y op z, where op is a unary operation. Essential unary operations include unary minus, logical negation, shift operators, and conversion operators that, for example, convert a fixed-point number to a floating-point number. 3. Copy statements of the form x := y op z, where the value of y is assigned to z. 4. The unconditional jump goto L. The three-address statement with label L is the next to be executed. 5. Conditional jumps such as if x relop y goto L. This instruction applies a relational operator (<, =, >=, etc.) to x and y, and executes the statement with label L next if x stands in relation relop to y. 6. Param x and call p, n for procedure calls and return y, where y representing a returned value is optional. Their typical use is as the sequence of threeaddress statements param x1 param x2 . . . param xn call p, n generated as part of a call of the procedure p(x1, x2,…., xn). The integer n indicating the number of actual parameters in “call p, n” is not redundant because calls can be nested. 7. Indexed assignments of the form x := y[i] and x[i] := y. The first of these sets x to the value in the location i memory units beyond location y. The statement x[i] := y sets the contents of the location i units beyond x to the value of y. In both these instructions, x, y, and i refer to data objects. 8. Address and pointer assignments of the form x := &y, x := *y, and *x := y. The first of these sets the value of x to be the location of y. Presumably y is a name, perhaps a temporary, that denotes an expression with an l-value such as A[i,j], and x is a pointer name or temporary. That is, the r-value of x is the l-value (location) of some object. In the statement x := *y, presumably y is a pointer or a temporary whose r-value is a location. The r-value of x is made equal to the contents of that location. Finally, *x := y sets the r-value of the object pointed to by x to the r-value of y. Syntax-Directed Translation into Three-Address Code When three-address code is generated, temporary names are made up for the interior nodes of a syntax tree. The value of nonterminal E on the left side of E E1 + E2 will be computed into a new temporary t. In general, the tree-address code for id := E consists of code to evaluate E into some temporary t, followed by the assignment id.place := t. Saint Paul University – San Nicolas Campus

Compiler Design

68

The S-attributed definition in Fig.8.6 generates three-address code for assignment statements. Given input a := b*-c + b*-c, it produces the code in Fig 8.5(a). The synthesized attribute S.code represents the three-address code for the assignment S. The nonterminal E has two attributes:

1. E.place, the name that will hold the value of E, and 2. E.code, the sequence of three-address statements evaluating E. The function newtemp returns a sequence of distinct names t1, t2, to successive calls.

. . in response

For convenience, we use the notion gen(x ‘:=’ y ‘+’ z) in Fig. 1.5 to represent the three-address statement x := y + z. Flow-of-control statements can be added to the language of assignments in Fig. 8.6 by productions and semantic rules like ones for while statements in Fig. 8.7.

S E

PRODUCTION id := E E1 + E2

E

E1 * E2

E

- E1

E

( E1 )

E

id

SEMANTIC RULE S.code := E.code || gen(id.place ‘:=’ E.place) E.place := newtemp; E.code := E1.code || E2.code || Gen(E.place ‘:=’ E1.place ‘+’ E2.place) E.place := newtemp; E.code := E1.code || E2.code || Gen(E.place ‘:=’ E1.place ‘*’ E2.place) E.place := newtemp; E.code := E1.code || gen(E.place ‘:=’ ‘uminus’ E1.place) E.place := E1.place; E.code := E1.code E.place := id.place; E.code := ‘ ‘

Fig. 8.6. Syntax-directed definition to produce three-address code for assignments S.begin : E.code if E.place = 0 goto S.after S1.code S.after :

goto S.begin ...

PRODUCTION

SEMANTIC RULES

S

S.begin := newlabel; S.after := newlabel; S.code := gen(S.being ‘:’) || E.code || gen(‘if’ E.place ‘=’ ‘0’ ‘goto’ S.after) || S1.code || gen(‘goto’ S.begin) || gen(S.after ‘:’)

while E do S1

Fig. 8.7. Semantic rules generating code for a while statement In the figure, the code for S while E do S1 is generated using new attributes S.begin and S.after to mark the first statement in the code for E and the statement following the code for S, respectively. These attributes represent labels created by a function newlabel that returns a new label every time it is called. Note that S.after becomes the label of the statement that expression resents true; that is, when the value of E becomes zero, control leaves the while statement. Implementations of Three-Address Statements A three-address statement is an abstract form of intermediate code. representations are quadruples, triples and indirect triples.

Saint Paul University – San Nicolas Campus

Three such

Compiler Design

69

Quadruples A quadruple is a record structure with four fields, which we call op, arg 1, arg 2, and result. The op field contains an internal code for the operator. The threeaddress statement x := y op z is represented by placing y in arg 1, z in arg 1, and x in result. Statements with unary operators like x := -y or x := y do not use arg 2. Triples To avoid entering temporary names into the symbol table, we might refer to a temporary value by the position of the statement that computes it. If we do so, three-address statements can be represented by records with only three fields: op, arg 1 and arg 2, as in Fig. 8.8(b). Since three fields are used, this intermediate code format is known as triples.

(0) (1) (2) (3) (4) (5)

op uminus * uminus * + :=

arg 1 c b c b t2 t5

arg 2

Result t1 t2 t3 t4 t5 a

t1 t3 t4

(0) (1) (2) (3) (4) (5)

(a) Quadruples

op uminus * uminus * + assign

arg 1 c b c b (1) a

arg 2 (0) (2) (3) (4)

(b) Triples

Fig. 8.8. Quadruple and triple representations of three-address statements A ternary operation like x[i] := y requires two entries in the triple structure, as shown in Fig. 8.9(a), while x := y[i] is naturally represented as two operations in Fig. 8.9(b).

(0) (1)

op [ ]= assign

arg 1 x (0)

arg 2 i y

(0) (1)

(a) x[i] := y

op =[ ] assign

arg 1 Y x

arg 2 i (0)

(b) x := y[i] Fig. 8.9. More triple representations

Indirect Triples Another implementation of three-address code that has been considered is that of listing pointers to triples, rather than listing the triples themselves. This implementation is naturally called indirect triples. For example, let us use an array statement to list pointers to triples in the desired order. Then the triples in Fig. 8.8(b) might be represented as in Fig. 8.10. (0) (1) (2) (3) (4) (5)

statement (14) (15) (16) (17) (18) (19)

(14) (15) (16) (17) (18) (19)

op uminus * uminus * + assign

arg 1 c b c b (15) a

arg 2 (14) (16) (17) (18)

Fig. 8.10. Indirect triples representation of three-address statements DECLARATIONS As the sequence of declarations in a procedure or block is examined, we can storage for names local to the procedure. For each local name, we create a table entry with information like the type and the relative address of the for the name. The relative address consists of an offset from the base static data area or the field for local data in an activation record.

lay out symbolstorage of the

Declarations in a Procedure In the translation scheme of Fig. 8.11 nonterminal P generates a sequence of declarations of the form id:T. Before the first declaration is considered, offset is set to 0.

Saint Paul University – San Nicolas Campus

Compiler Design

70

The procedure enter(name, type, offset) creates a symbol-table entry for name, gives it type type and relative address offset in its data area. We use synthesized attributes type and width for nonterminal T to indicate the type and width. Attribute type represents a type expression constructed from the basic types integer and real by applying the type constructors’ pointer and array. In Fig. 8.11, integers have width 4 and real have width 8. The width of an array is obtained by multiplying the width of each element by the number of elements in the array. The width of each pointer is assumed to be 4. P D D T T T T

{ offset := 0 } D D;D id : T integer real array [num] of T1 T1

{ enter(id.name, T.type, offset); offset := offset + T.width } { T.type := integer; T. width := 4 } { T.type := real; T. width := 8} { T.type := array(num.val, T.type); T. width := num.val,x T1.width } {T.type := pointer(T1.type); T. width := 4 }

Fig. 8.11. Computing the types and relative addresses of declared names Keeping Track of Scope Information The semantic rules are defined in terms of the following operations:

1. mktable (previous) creates a new symbol table and returns a pointer to the new

2. 3. 4.

table. The argument previous points to a previously created symbol table, presumably that for the enclosing procedure. The pointer previous is placed in a header for the new symbol table, along with additional information such as the nesting depth of a procedure. We can also number the procedures in the order they are declared and keep this number in the header. enter (table, name, type, offset) creates a new entry for name name in the symbol table pointed to by table. Again, enter places type type and relative address offset in fields within the entry. addwidth (table, width) records the cumulative width of all the entries in table in the header associated with this symbol table. enterproc (table, name, newtable) creates a new entry for procedure name in the symbol table pointed to by table. The argument newtable points to the symbol table for this procedure name.

The translation scheme in Fig. 8.13 shows how data can be laid out in one pass, using a stack tblptr to hold pointers to symbol tables of the enclosing procedures. P MD M є D D1 ; D 2 D proc id ; N D1 ; S D

id : T

N

є

{addwidth(top(tblptr),top(offset)); pop(tblptr); pop(offset)} { t := mktable(nil); push(t, tblptr); push(0, offset) }

{t := top(tblptr); addwidth(t, top(offset); pop(tblptr); pop(offset)); enterproc(top(tblptr); id.name, t)} {enter(top(tblptr), id.name, T.type, top(offset)); top(offset)) := top(offset) + T.width} { t := mktable(top(tblptr)); push(t, tblptr); push(0, offset) }

Fig. 8.13. Processing declarations in nested procedures Field Names in Records The following production allows nonterminal T to generate records in addition to basic types, pointers, and arrays: T

record D end

The actions in the translation scheme of Fig. 8.14 emphasize the similarity between the layout of records as a language construct and activation records. T

record L D end

L

є

{T.type := record(top(tblptr)); T.width := top(offset)); pop(tblptr); pop(offset)} { t := mktable(nil); push(t, tblptr); push(0, offset) }

Fig. 8.14. Setting up a symbol table for field names in a record

Saint Paul University – San Nicolas Campus

Compiler Design

71

ASSIGNMENT STATEMENTS Expressions can be of type integer, real, array, and record in this section. As part of the translation of assignments into three-address code, we show how names can be looked up in the symbol table and how elements of arrays and records can be accessed. Addressing Array Elements Elements of an array can be accessed quickly if the elements are stored in a block of consecutive locations. If the width of each array element w, then the ith element of array A begins in location base + (i – low) X w

(8.4)

where low is the lower bound on the subscript and base is the relative address of the storage allocated for the array. That is base is the relative address of A[low]. The expression (8.4) can be partially evaluated at compile time if it is rewritten as i X w + (base – low X w) The subexpression c = base – low X w can be evaluated when the declaration of the array is seen. We assume that c is saved in the symbol table entry for A, so the relative address of A[i] is obtained by simply adding i X w to c. Type Conversions within Assignments In practice, there would be many different types of variables and constants, so the compiler must either reject certain mixed-type operations or generate appropriate coercion (type conversion) instructions. Consider the grammar for assignment statements as above, but suppose there are two types – real and integer, with integers converted to reals when necessary. We introduce another attribute E.type, whose value is either real or integer. The semantic rule for E.type associated with the production E E + E is: E E + E { E.type := If E1.type = integer and E2.type = integer then integer else real } The entire semantic rule for E E + E and most of the other productions must be modified to generate, when necessary, three-address statements of the form x := inttoreal, whose effect is to convert integer y to a real of equal value, called x. For example, for the input X := y + i * j Assuming x and y have type real, and i and j have type integer, the output would like t1 t3 t2 x

:= i int * j := inttoreal t1 := y real + t3 := t2 BOOLEAN EXPRESSIONS

In the programming languages, Boolean expressions have two primary purposes. They are used to compute logical values, but more often they are used as conditional expressions in statements that alter the flow of control, such as ifthen, if-then-else, or while-do statements. Boolean expressions are composed of the boolean operators (and, or, and not) applied to elements that are boolean variables or relational expressions. In turn, relational expressions are of the form E1 relop E2, where E1 and E2 are arithmetic expressions. Some languages, such as PL/1, allow more general expressions, where, Boolean arithmetic, and relational operators can be applied to expressions of any type whatever, with no distinction between Boolean and arithmetic values; coercion is performed when necessary. In this section, we consider Boolean expressions generated by the following grammar: E → E or E E and E not E  (E)  id relop id  true  false Saint Paul University – San Nicolas Campus

Compiler Design

72

We use the attribute op to determine which of the comparison operators <, ≤, =, ≠, >, or ≥ is represented by relop. As is customary, we assume that or and and are left associative, and that or has lowest precedence, then and, then not. Methods of Translating Boolean Expressions There are two principal methods of representing the value of a Boolean expression. The first method is to encode true and false numerically and to evaluate a Boolean expression analogously to an arithmetic expression. Often 1 is used to denote true and 0 to denote false, although many other encoding is also possible. For example, we could let any nonzero quantity denote true and any negative number denote false. The second principal method of implementing Boolean expressions is by flow of control, that is, representing the value of a Boolean expression by a position reached in a program. This method is particularly convenient in implementing the Boolean expressions in flow-of-control statements, such as the if-then and while-do statements. For example, given the expression E1 or E2, if we determine that E1 is true, then we can conclude that the entire expressions is true without having to evaluate E2. Short-Circuit Code We can also translate a Boolean expression into three-address code without generating code for any of the Boolean operators and without having the code necessarily evaluate the entire expression. This style of evaluating is sometimes called "short-circuit" or "jumping" code. It is possible to evaluate Boolean expressions without generating code for the Boolean operators and, or, and not if we represent the value of an expression by a position in the code sequence. For example, in Fig. 8.21, we can tell what value t1 will have by whether we reach statement 101 or statement 103, so the value of t1 is redundant. For many Boolean expressions, it is possible to determine the value of the expression without having to evaluate it completely. Flow -of-Control Statements We now consider the translation of Boolean expressions into three-address code in the context of if-then, if-then-else, and while-do statements such as those generated by the following grammar: S → if E then S2 if E then S1 else S2 while E do S1 In each of these productions, E is the Boolean expression to be translated. In the translation, we assume that a three-address statement can be symbolically labeled, and that the function newlabel returns a new symbolic label each time it is called. PRODUCTION S → if E then S1

S → if E then S1 else S2

S → while E do S1

SEMANTIC RULES E.true := newlabel; E.false:= S.next; S1.next:=S.next; S.code:= E.code  Gen(E.true ':') S1.code E.true := newlabel; E.false := newlabel; S1.next:= S.next; S2.next:= S.next S.code:= E.code Gen(E.true':') S1.code Gen('goto'S.next) Gen(E.false':') S2.code S.begin:= newlabel; E.true:= newlabel; E.false:= S.next; S1.next:= S.begin; S.code:= Gen(S.begin':') E.code Gen(E.true':') E.code Gen(goto'S.begin)

Fig. 8.23. Syntax-directed definition for flow-of-control Statement. Saint Paul University – San Nicolas Campus

Compiler Design

73

We discuss the translation of flow-of-control statements in more detail in Section 8.6 where an alternative method, called 'backpatching," emits code for such statements in one pass. Control-Flow Translation of Boolean Expressions We now discuss E.code, the code produced for the Boolean expressions E in Fig. 8.23. As we have indicated, E is translated into a sequence of three-address statements that evaluates E as a sequence of conditional and unconditional jumps to one of two locations: E.true, the place control is to reach if E is true, and E.false, the place control is to reach if E is false. The basic idea behind the translation is the following. Suppose E is of the form a
If E1 is true, then we immediately know same as E.true. If E1 is false, then E2 the label of the first statement in the E2 can be made the same as the true and

Analogous considerations apply to the translation of E1 and E2. No code is needed for an expression E of the form not E1: We just interchange the true and false exits of E1 to get the true and false exits of E. A syntax-directed definition that generates three-address code for Boolean expressions in this manner is shown. In Fig. 8.24. Note that the true and false attributes are inherited. Example 8.4. Let us again consider the expression a < b or c < d and e < f Example 8.5. Consider the statement while a < b do if c < d then x := y + z else x := y – z The syntax-directed definition above, coupled with schemes for assignment statements and Boolean expressions, would produce the following code: L1: if a < b goto L2 goto Lnext L2: if c < d goto L3 goto L4 L3: t1 := y + z x := t1 goto L1 L4: t2 := y – z x := t2 goto L1 Lnext: We note that the first two gotos can be eliminated by changing the directions of the tests. This type of local transformation can be done by peephole optimization discussed in Chapter 9. Case Statements The “switch” or “case” statement is available in a variety of languages; even the Fortran computed and assigned goto’s can be regarded as varieties of the switch statement. Our switch-statement syntax is shown in Fig. 8.26. Switch expression begin case value: statement case value: statement … case value: statement default: statement end Fig. 8.26. Switch-statement syntax Saint Paul University – San Nicolas Campus

Compiler Design

74

There is a selector expression, which is to be evaluated. Followed by n constant values that the expression might take, perhaps including a default “value,” which always matches the expression if no other value does. The intended translation of a switch is code to: 1. Evaluate the expression. 2. Find which value in the list of case is the same as the value of the expression. Recall that the default value matches the expression if none of the values explicitly mentioned in case does. 3. Execute the statement associated with the value found. Step (2) is an n-way branch, which can be implemented in one of several ways. If the number of cases is not too great, say 10 at most, then it is reasonable to use a sequence of conditional goto’s, each of which tests for an individual value and transfers to the code for the corresponding statement. A more compact way to implement this sequence of conditional goto’s is to create a table of pairs, each pair consisting of a value and a label for the code of the corresponding statement. Code is generated to place at the end of this table the value of the expression itself, paired with the label for the default statement. A simple loop can be generated by the compiler to compare the value of the expression with each value in the table, being assured that if no other match is found; the last (default) entry is sure tom match. Input to the Code Generator The input to the code generator consists of the intermediate representation of the source program produced by the front end, together with information in the symbol table that is used to determine the run-time address of the data objects denoted by the names in the intermediate representation. As we noted in the previous chapter, there are several choices for the intermediate language, including: linear representations such as postfix notation, three-address representation such as quadruples, virtual machine representations such as stack machine code, and graphical representations such as syntax trees and dags. Although the algorithms in this chapter are couched in terms of three-address code, trees, and dags, many of the techniques also apply to the other intermediate representations. Target Programs The output of the code generator is the target program. Like the intermediate code, this output may take on a variety of forms: absolute machine language, relocatable machine language or assembly language. Producing an absolute machine-language program as output has the advantage that it can be placed in a fixed location in memory and immediately executed. A small program can be compiled and executed quickly. A number of “student-job” compilers, such as WATFIV and PL/C, produce absolute code. Producing a relocatable machine-language program (object module) as output allows subprograms to be compiled separately. A set of relocatable object modules can be linked together and loaded for execution by a linking loader. Although we must pay the added expense of linking and loading if we produce relocatable object modules, we gain a great ideal of flexibility in being able to compile subroutines separately and to call other previously compiled program from an object module. If the target machine does not handle relocation automatically, the compiler must provide explicit relocation information to the loader to the link the separately compiled program segments. Register Allocation Instructions involving register operands are usually shorter and faster than those involving operands in memory. Therefore, efficient utilization of registers is particularly important in generating good code. The use of registers is often subdivided into two sub problems:

1. During register allocation, we select the set of variables that will 2.

with

reside in registers at a point in the program. During a subsequent register assignment phase, register that a variable will reside in.

we

pick

the

specific

Finding an optimal assignment of registers to variables is difficult, even single-register values. Mathematically, the problem is NP-complete. The

Saint Paul University – San Nicolas Campus

Compiler Design

75

problem is further complicated because the hardware and/or the operating system of the target machine may require that certain register-usage conventions be observed. Certain machines require register-pairs (an even and next odd-numbered register) for some operands and results. For example, in the IBM System 370 machines integer multiplication and integer divisions involve register pairs. The multiplication instruction is of the form. M

x, y

where x, the multiplicand, is the even register of an even/odd register pair. The multiplicand value is taken from the odd register of the pair. The multiplier y is a single register. The product occupies the entire even/odd register pair. REFERENCES: -

Aho A., Sethi R., Ullman J., COMPILERS: Principles, Techniques and Tools (1988)

-

Hopcroft J., Ullman J., Intro. to Automata Theory, Languages, and Computation (1979)

-

Nyhoff, L., Leestma, S., Data Structures and Program Design in Pascal (1992)

-

http://www.cs.rpi.edu/~moorthy/Courses/compiler/Lectures/lectures/lecture1/

Saint Paul University – San Nicolas Campus

Related Documents

Compiler Design
June 2020 13
Compiler Design
June 2020 9
Compiler Design
August 2019 43
Compiler Design
November 2019 20
Compiler Design
June 2020 16

More Documents from "Alexis"

Virus
June 2020 4
Notes.docx
June 2020 5
June 2020 6
May 2020 10