Regular Expression

  • July 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Regular Expression as PDF for free.

More details

  • Words: 1,443
  • Pages: 8
Regular Expression - important notations for specifying patterns - used to define sets of string - example: letter ( letter | digit )* where | = or is a regular expression for identifier Rules for defining regular expression over alphabet ε 1. ∈ is a regular expression that denotes {∈} 2. if a is a symbol in ε , then a is a regular expression that denotes {a} 3. let r and s are regular expressions denoting the language L(r) and L(s). Then a. ( r ) | ( s ) is a regular expression denoting L( r ) | L( s ) b. ( r ) ( s ) is a regular expression denoting L( r ) L( s ) c. ( r )* is a regular expression denoting (L( r ))* d. ( r ) is a regular expression denoting L( r ) Conventions for Regular Expression 1. Unary operator * has highest precedence and is left associative 2. Concatenation has second highest precedence and is lest associative 3. | has lowest precedence and is left associative So ( a ) | ( ( b )* ( c ) ) = a | b*c Either a single a or zero or more b`s followed by a single c Examples - let ε = {a,b} o a | b = {a,b} o (a | b )(a | b ) = {aa,ab,ba,bb} o a* = {∈, a,aa,aaa,…} Algebraic properties for Regular Expression - r | s = s | r → | is commutative - r | ( s | t) = ( r | s ) | t → is associative - ( rs ) t = r ( s t ) → concatenation is associative - r ( s | t ) = rs | rt → concatenation distributes over | - ∈ r = r or r ∈ = r → identity element - r* = ( r | ∈)*

-

r** = r* → * is idem potent

Regular Definition - For notational convenience, we give names to regular expressions - If ∑ is an alphabet of basic symbols, then a regular definition is a sequence of a definitions of the form o D1 → r1 o D2 → r2 o … Dn → rn Where Di is a distinct name and ri is a regular expression over the symbol ∑ U { d1,d2,di-1} i.e. the basic symbols and the previously defined names By restricting each ri to symbols of ∑ and previously defined names, we an construct regular expression over ∑ for any ri by repeatedly replacing regular expression names by the expressions they denote. If ri used dj for some j≥I, then ri must be recursively defined, and the substitution process would not terminate - Example o Letter → A|B|….|Z|a|b|….|z o digit → 0|1|….|9 o Id → letter (letter | digit)* - Example o Regular definition for unsigned numbers o digit → 0|1|2|…|9 o digits → digit digit* o optional-fraction → . digits | ∈ o optional-exponent → ( E ( +| - | ∈ ) digits ) | ∈ o num → digits optional-fraction optional-exponent

Using Notational Shorthand -

(r)?=r|∈

-

r+ = r r*

-

r* = r+ | ∈

- using these shorthand we can rewrite the above definitions as -

digit → 0|1|….|9

-

digits → digit+

-

optional-fraction → ( . digits )?

-

optional-exponent → ( E ( + | - ) ? digits ) ?

-

num → digits optional-fraction optional-exponent

Recognition of Token - Takes place by implementing a stylized flowchart, called Transition Diagram - Transition Diagram o Depicts actions that take place when a Lexical Analyzer is called by the parser to get next token o Positions in a TD are drawn as circles, called States o States are connected by arrows, called Edges o Edges leaving state S have Labels indicating the i/p character that can next appear after the TD has reached state S o Start state is the initial state of the TD where control resides when we begin to recognize a token o E.g. start > 6v = 7

8

other o o o o o o o o o

*

The above TD works as follows Start state is state Ø In state Ø, we read the next i/p character The edge labeled > from state Ø is to be folloed to state 6 if this i/p character is >. The double circle on 7 indicates that it is an accepting state , a state in which the token >= has been found In case of > only, reading of an extra character will lead to state 8 , where token for > will be recognized and the FP will be retracted (sumbol indicates retraction in TD) If failure occurs in one TD , another is activated by pinting FP to start Lexical Error is generated only when all TD’s Fails example start < = Ø 1 2 return (relop, LE)

=

5

>

6

>

3

return (relop, NE)

other

4

* return (relop, LT )

return ( relop, EQ ) =

other

7

8

return (relop, GE ) * return (relop, GT )

- The lexical Analyzer return tokens and attribute values, using the translation table Regular Expression Patterns for Token Regular Exp

Token

Attribute

ws If then else Id

--If Then Else id

num

num

< <= >

relop relop relop

--------Pointer to symbol Table entry Pointer to ST entry LT LE GT

>= <> =

relop relop relop

GE NE EQ

Regular Definition for White Spaces Delim → blank | tab | newline Ws

→ delim +

RD for operators Relop → < | > | <= | >= | <> | =

Recognition of Keywords - there are two approaches for the recognition of keywords\ - Define a separate TD of keywords. If a letter is encountered , first check it in the TD of keywords. If a match is not found then analyze the string in TD of of identifier * Start

S o

B

1

E

G

2

E

3

I

N

7

N

4

D

8

6

ws/{

5

* ws/:

9

1 0 *

L

1 1

S

1 2 1 5

I

T

E

1 8

H

1 9

E

1 3 1 6

F

2N 0

N

2 1

ws

1 4

ws/(

1 7

ws

2 2

Note : All final States (6,10,14,17,22) have * like one shown on state 6. the symbol is called retraction symbol which indicates that pointer moves back to previous sate R.E for Keywords Keyword → BEGIN | END | IF | THEN | ELSE Disadvantage of using this techniques to identify keywords o The number of states will be increased which increases the coding and handling of pgm also o Incase of addition in keywords , TD will have to be changed

- The second approach is to initialize all keywords in symbol table and accept the string in the TD of Identifiers and keywords combined then match the string in symbols table for keywords .if match is found generate token or that keyword otherwise generate token for id TD for Keywords and Identifier combined Start

9

letter

Letter or digit other 1 1 0

1

* Return(gettoken(),installid())

- Keywords are sequence of letter and id is sequence of letter and digits - Keywords are stored in symbol table - 2 tasks are performed o if lexeme is matched against the pattern then 1st task is to look for the keyword in Symbol Table that matches and if so generate corresponding token and if not then generate id token. Gettoken() is used to here for this purpose o if id token is generated then store that lexeme in symbol table and return pointer to that entry or if it is already stored than return pointer to that entry. Return 0 if keyword is identified. Installid () is used for this purpose

Advantage of using this technique - In case of additional keywords, TD will not be changed only symbol Table will be initialized with new keywords TD for unsigned number digit * S

1 2

digit 1 3

.

digit 1 4

digit 1 5

E

dg 1 6

+/-

1 7

digit

1 8

ot

1 9

Install-num() E

digit

digit 2 0

S

digit

.

2 1

digit

2 2

digit digit

2 5

S

2 6

other

2 3

digit other

* 2 7 Install_num()

TD for white space Delim

* S

2 8

delim

2 9

other

3 0

Nothing is returned when accepting condition is reached

* 2 4

Install_num()

Related Documents