http://lambda.uta.edu/cse5317/notes/node9.html
Deterministic finite-state machine From Wikipedia, the free encyclopedia
Jump to: navigation, search In the theory of computation, a deterministic finite state machine (also known as deterministic finite state automaton (DFSA) or deterministic finite automaton (DFA) ) is a finite state machine where for each pair of state and input symbol there is one and only one transition to a next state. DFAs recognize the set of regular languages and no other languages. A DFA will take in a string of input symbols. For each input symbol it will then transition to a state given by following a transition function. When the last input symbol has been received it will either accept or reject the string depending on whether the DFA is in an accepting state or a non-accepting state.
Advantages and disadvantages DFAs are one of the most practical models of computation, since there is a trivial linear time, constant-space, online algorithm to simulate a DFA on a stream of input. Given two DFAs there are efficient algorithms to find a DFA recognizing: • • •
the union of the two DFAs the intersection of the two DFAs complements of the languages the DFAs recognize
Because DFAs can be reduced to a canonical form (minimal DFAs), there are also efficient algorithms to determine: • • • •
whether a DFA accepts any strings whether a DFA accepts all strings whether two DFAs recognize the same language the DFA with a minimum number of states for a particular regular language
DFAs are equivalent in computing power to nondeterministic finite automata. On the other hand, Finite State Automata are of strictly limited power in the languages they can recognize — many simple languages, including any problem that requires more than constant space to solve, cannot be recognized by a DFA. The classical example of a simply described language that no DFA can recognize is bracket language, that is, language that consists of properly paired brackets, such as (()()). More formally the language consisting of strings of the form anbn — some finite number of a's, followed by an equal number of b's. If there is no limit to recursion (i.e., you can always embed another pair of brackets inside) it would require an infinite amount of states to recognize
http://en.wikipedia.org/wiki/Deterministic_finitestate_machine#References
Converting a Regular Expression into a Deterministic Finite Automaton The task of a scanner generator, such as flex, is to generate the transition tables or to synthesize the scanner program given a scanner specification (in the form of a set of REs). So it needs to convert a RE into a DFA. This is accomplished in two steps: first it converts a RE into a non-deterministic finite automaton (NFA) and then it converts the NFA into a DFA. A NFA is similar to a DFA but it also permits multiple transitions over the same character and transitions over . The first type indicates that, when reading the common character associated with these transitions, we have more than one choice; the NFA succeeds if at least one of these choices succeeds. The transition doesn't consume any input characters, so you may jump to another state for free. Clearly DFAs are a subset of NFAs. But it turns out that DFAs and NFAs have the same expressive power. The problem is that when converting a NFA to a DFA we may get an exponential blowup in the number of states. We will first learn how to convert a RE into a NFA. This is the easy part. There are only 5 rules, one for each type of RE:
As it can been shown inductively, the above rules construct NFAs with only one final state. For example, the third rule indicates that, to construct the NFA for the RE AB, we construct the NFAs for A and B, which are represented as two boxes with one start state
and one final state for each box. Then the NFA for AB is constructed by connecting the final state of A to the start state of B using an empty transition. For example, the RE (a| b)c is mapped to the following NFA:
The next step is to convert a NFA to a DFA (called subset construction). Suppose that you assign a number to each NFA state. The DFA states generated by subset construction have sets of numbers, instead of just one number. For example, a DFA state may have been assigned the set {5, 6, 8}. This indicates that arriving to the state labeled {5, 6, 8} in the DFA is the same as arriving to the state 5, the state 6, or the state 8 in the NFA when parsing the same input. (Recall that a particular input sequence when parsed by a DFA, leads to a unique state, while when parsed by a NFA it may lead to multiple states.) First we need to handle transitions that lead to other states for free (without consuming any input). These are the transitions. We define the closure of a NFA node as the set of all the nodes reachable by this node using zero, one, or more transitions. For example, The closure of node 1 in the left figure below
is the set {1, 2}. The start state of the constructed DFA is labeled by the closure of the NFA start state. For every DFA state labeled by some set {s1,..., sn} and for every character c in the language alphabet, you find all the states reachable by s1, s2, ..., or sn using c arrows and you union together the closures of these nodes. If this set is not the label of any other node in the DFA constructed so far, you create a new DFA node with this label. For example, node {1, 2} in the DFA above has an arrow to a {3, 4, 5} for the character a since the NFA node 3 can be reached by 1 on a and nodes 4 and 5 can be reached by 2. The b arrow for node {1, 2} goes to the error node which is associated with an empty set of NFA nodes. The following NFA recognizes (a| b)*(abb | a+b), even though it wasn't constructed with the above RE-to-NFA rules. It has the following DFA:
2.2 Deterministic Finite Automata (DFAs) A DFA represents a finite state machine that recognizes a RE. For example, the following DFA:
recognizes (abc+)+. A finite automaton consists of a finite set of states, a set of transitions (moves), one start state, and a set of final states (accepting states). In addition, a DFA has a unique transition for every state-character combination. For example, the previous figure has 4 states, state 1 is the start state, and state 4 is the only final state. A DFA accepts a string if starting from the start state and moving from state to state, each time following the arrow that corresponds the current input character, it reaches a final state when the entire input string is consumed. Otherwise, it rejects the string. The previous figure represents a DFA even though it is not complete (ie, not all statecharacter transitions have been drawn). The complete DFA is:
but it is very common to ignore state 0 (called the error state) since it is implied. (The arrows with two or more characters indicate transitions in case of any of these characters.) The error state serves as a black hole, which doesn't let you escape.
A DFA is represented by a transition table T, which gives the next state T[s, c] for a state s and a character c. For example, the T for the DFA above is: a b c 0 0 0 0 1 2 0 0 2 0 3 0 3 0 0 4 4 2 0 4 Suppose that we want to build a scanner for the REs: for - keyword = for identifier = [a - z][a - z0 - 9]* The corresponding DFA has 4 final states: one to accept the for-keyword and 3 to accept an identifier:
(the error state is omitted again). Notice that for each state and for each character, there is a single transition
Converting Deterministic Finite Automata to Regular Expressions Christoph Neumann Mar 16, 2005 Abstract This paper explores three techniques for converting deterministic _nite automata (DFA) to regular expressions and compares the usefulness of each technique. The techniques examined are the transitive closure method, the state removal method, and the Brzozowski algebraic method.
1 Background Kleene's seminal article de_nes regular expressions and their relationship to _nite automata [7]. Kleene proves the equivalence of _nite automata and regular expressions thereby providing us with the _rst technique, the transitive closure method, for converting DFAs to regular expressions. Later Brzozowski expanded on Kleene's method by introducing the notion of derivatives of regular expressions [3], but his paper passed into obscurity until G. Berry and R. Sethi brought Brzozowski's paper to the forefront in [2]1. The state removal method appears in [4] but Linz presents a more straightforward method in [8].
2 De_nitions We will using the Moore model [9] for _nite automata. Given an automaton M with an input alphabet Ak = f0; 1; _ _ _ ; k 1g, M has m states Ms = fq1; q2; _ _ _ ; qmg where q_ is the starting state of M and Mf = fqf1 ; qf2 ; _ _ _ ; qfng are the n _nal states of M where n _ m. For clarity, we will use letters to represent each value in the alphabet A instead of using the numeric representation (a = 0; b = 1; etc), and for convenience we will assume q1 is the starting state unless otherwise noted. We will use Kleene's de_nition [7] of regular expressions. Regular expressions are de_ned recursively as: 1. The symbols 0; 1; _ _ _ ; k 1; _; and _ are regular expressions, where _ is the empty string and _ is the empty set. 1However,
e_ciently converting regular expressions to automata is the focus of the G. Berry and R. Sethi paper, not converting DFAs to regular expressions
1
2. Set Union: Given the regular expressions x and y, the union of x and y, expressed as x + y, is a regular expression. 3. Concatenation: Given the regular expressions x and y, the concatenation (or product) of x and y, expressed as xy is a regular expression. 4. Iteration: Given the regular expression x, the iteration (or star) of x, expressed as x_, is a regular expression. 5. Given a regular expression x, (x) is a regular expression. 6. All regular expressions can be constructed by a _nite application of rules 1-5. The regular expressions in rule 1 are terminals. The concatenation of terminals is a string. For a given regular expression x, there is a set X which contains all the strings represented by x. We will use x and X interchangeably. A single string is simply represented by the set containing that single string. We will also refer to and use the following identities: (ab)c = a(bc) = abc _x = x_ = x _x = x_ = _ _+x=x _ + x_ = x_ (_ + x)_ = x_ Since + is commutative, all the commutative versions apply.
3 Transitive Closure Method Suppose the given DFA M is to be represented as a regular expression.
q1 q2 q3 q4 bca Figure 1: A very simple automaton.
Consider the automaton in _gure 1. The input for edge in the automaton is a regular expression. Quite simply, the regular expression for the transition from q1 to q2 is b, the transition from q2 to q3 is c and so on. Furthermore, the regular expression representing the transition from q1 to q3 is the concatenation of the regular expressions thus forming bc. Thusly, we can _nd the regular expression for the automaton to be bca since that expression is the concatenation of all of the transitions from the starting state q1 to the _nal state q4. More generally, for a path from q_ to qf , the concatenation of the regular
expression for each transition in the path forms a regular expression that represents the same string as the path from q_ to qf in the automaton. Supposing there exists only one unique path in automaton M from
2
q_ to qf , there exists only one regular expression R such that R represents the same string as the DFA M. However, this is a trivial automaton, let us examine how to expand this to a more general case.
q1 q2 b ab Figure 2: A more complicated automaton.
Now consider the DFA in _gure 2. It is clear that multiple paths exist from q1 to q2. We cannot derive a simple regular expression to represent the DFA, however using the other operators (union and iteration) we can build on our previous approach to create a construction that works for all types of DFA. We will forgo the proof and leave it to Kleene [7]. The proof is also demonstrated by Hopcroft and Ullman [5] and explained quite clearly by Salomaa [10]. Suppose regular expression Rij represents the set of all strings that transition the automaton M from qi to qj . Furthermore, suppose Rk ij
represents the set of all strings that transition the automaton M from qi to qj without passing through any state higher than qk. We can construct Rij by successively constructing R1 ij ;R2 ij ; _ _ _ ;Rm ij . Rk ij is recursively de_ned as: Rk ij = Rk 1 ik (Rk 1 kk )_Rk 1 kj + Rk 1 ij
assuming we have initialized R0 ij to be: R0 ij =8< : r if i 6= j and r transitions M from qi to qj r + _ if i = j and r transitions M from qi to qj _ otherwise As we can see, this successive construction builds up regular expressions until we have Rij . We can then construct a regular expression representing M as the union of all R_f where q_ is the starting state and f 2 Mf (the _nal states for M). This technique is similar in nature to the all-pairs shortest path problem. The only di_erence being that we are taking the union and concatenation of regular expressions instead of summing up distances. This solution is of the same form as transitive closure and belongs to the constellation of problems associated with closed semirings. The chief problem of the transitive closure approach is that it creates very large regular expressions. Examining the formula for an Rk ij , it is clear the signi_cant length is due to the repeated union of concatenated terms. Even by using the previous identities, we still have long expressions.
4 State Removal Method The state removal approach identi_es patterns within the graph and removes states, building up regular expressions along each transition. The
3
advantage of this technique over the transitive closure method is that it is easier to visualize. This technique is described by Du and Ko [4], but
we examine a much simpler approach is given by Linz [8]. First, any multi-edges are uni_ed into a single edge that contains the union of inputs. Suppose from q2 to q5 there is an edge a and an edge b, those would be uni_ed into one edge from q2 to q5 that has the value a + b. Now, consider a subgraph of the automaton M which is of the form given in _gure 3. State q may be removed and the automaton may be reduced to the form in _gure 4. The pattern may still be applied if edges are missing. For an edge that is missing, leave out the corresponding edge in _gure 4. This process repeats until the automaton is of the form in _gure 5. Then by direct calculation, the regular expression is: r = r1 _ r2(r4 + r3r1 _ r2) _ :
qi q qj ab dc e Figure 3: Desired pattern for state removal. qi qj ae_b ce_d ae_d ce_d Figure 4: Results after state removal. q1 qf r2 r3 r1 r4 Figure 5: Final form. 4
5 Brzozowski Algebraic Method Brzozowski method [3]2 takes a unique approach to generating regular expressions. We create a system of regular expressions with one regular expression unknown for each state in M, and then we solve the system for R_ where R_ is the regular expression associated with starting state q_. These equations are the characteristic equations of M. Constructing the characteristic equations is straightforward. For each state qi in M, the equation for Ri is a union of terms. Each term can be constructed like so: for a transition a from qi to qj , the term is aRj . If Ri is a _nal state, _ is also one of the terms. This leads to a system of equations in the form: R1 = a1R1 + a2R2 + _ _ _ R2 = a1R1 + a2R2 + _ _ _ R3 = a1R1 + a2R2 + _ _ _ + _ ... = ... Rm = a1R1 + a2R2 + _ _ _ + _ where ax = _ if there is no transition from Ri to Rj . The system can be solved via straightforward substitution, except when an unknown appears on both the right and left hand side of the equation. This situation occurs when there is a self loop for state qi. Arden's theorem [1] is the key to solving these situations. The theorem is as follows: Given an equation of the form X = AX + B where _ =2 A, the equation has the solution X = A_B. We use this equation to isolate Ri on the left hand size and successively substitute Ri into the another equation. We repeat the process until we have found R_ with no unknowns on the right hand side. For example, consider again the automaton in _gure 2. The characteristic equations are as follows (where R_ = R1): R1 = aR1 + bR2
R2 = bR2 + _ We solve for R2 using Arden's theorem and the previously mentioned identities: R2 = bR2 + _ = b__ = b_ We substitute into R1 and solve: R1 = aR1 + b(b_) = aR1 + bb_ = a_(bb_) = a_bb_ Thus, the regular expression for the automaton in _gure 2 is a_bb_. 2Kain
5
[6] explains this method in more detail and gives illustrative examples.
6 Conclusions The state removal approach seems useful for determining regular expressions via manual inspection, but is not as straightforward to implement as the transitive closure approach and the algebraic approach. The transitive closure approach gives a clear and simple implementation, but tends to create very long regular expressions. The algebraic approach is elegant, leans toward a recursive approach, and generates reasonably compact regular expressions. Brzozowski's method is particularly suited for recursion oriented languages, such as functional languages, where the transitive closure approach would be cumbersome to implement.