P.H. CS123a
Hidden Markov Model (HMM) Pengyu Hong COSI 123a Statistical Machine Learning
P.H. CS123a
Markov Model qt-k
•••
qt-1
qt-1
qt
1st-order Markov model qt represents the state at time t P(qt | qt-1, qt-2, …) = P(qt | qt-1) P(qt, qt-1, qt-2, …) = P(qt | qt-1)P(qt-1 | qt-2) …
P.H. CS123a
Markov Model qt-k
•••
qt-1
qt-1
2nd-order Markov model P(qt | qt-1, qt-2, …) = P(qt | qt-1, qt-2)
qt
P.H. CS123a
Markov Model •••
qt-k
qt-1
qt-1
qt
A 1st-order Markov model for weather predictor State-transition probability P(qt | qt-1) 0.3 rainy 0.4 0.3
cloudy
0.2 0.1
0.2
sunny 0.8
0.6 0.1
qt-1
qt rainy cloudy
sunny
rainy cloudy
0.4 0.2
0.3 0.6
0.3 0.2
sunny
0.1
0.1
0.8
P.H. CS123a
Markov Model 0.3 rainy 0.4 0.3
cloudy
0.2 0.1
0.2
0.6 0.1
sunny 0.8 P(rainy Æ rainy Æ sunny Æ cloudy Æ sunny)? today
P.H. CS123a
What is an HMM? qt-k
qt-1
qt
Hidden states
ot-k
ot-1
ot
Observed states
• Graphical Model • Circles indicate states • Arrows indicate probabilistic dependencies between states
P.H. CS123a
An Example “Grammar” a
e
i
u o Hidden noise variation
Observed
P.H. CS123a
An Example a
e
i
u o a
u
i
o
u
o
i
o
a
Variations: temporal, volume, noise, …
e
P.H. CS123a
Applications of HMM • Speech recognition – hidden: phoneme/words – observed: acoustic signals
i
• Handwriting recognition – hidden: words – observed: images
3
• Part-of-speech tagging – hidden: part-of-speech tags – observed: words
noun change
• Machine translation – hidden: words in the target language – observed: words in the source language
• ……
learn 学习
P.H. CS123a
Formulate Discrete HMM qt-k
qt-1
qt
ot-k
ot-1
ot
λ = {A, B, π} Time invariant
qt – The hidden state at time t S = {1, 2, …, N} – all possible values of hidden states. ot – The observation at time t V = {1, 2, …, M} – all possible values of observed states. A = [aij]i,j – state transition probabilities. aij = P(qt+1 = j | qt = i),
aij ≥ 0, 1 ≤ i, j ≤ N. Σjaij =1.
B = [bik]i,k – emission probabilities.
bik =P(ot = k | qt = i), bik ≥ 0, 1 ≤ i ≤ N & 1 ≤ k ≤ M. Σkbik =1.
π = {πi} – initial state probabilities. πi = P(q1 = i). Σiπi =1.
P.H. CS123a
Three Basic Problems for HMMs q1
qT-1
qT
o1
oT-1
oT
λ = {A, B, π} 1. Given the observation sequence O = o1o2…oT and the model λ, how to efficiently compute P(O|λ).
P.H. CS123a
Three Basic Problems for HMMs q1
qT-1
qT
o1
oT-1
oT
λ = {A, B, π} 2. Given the observation sequence O = o1o2…oT and the model λ, how to compute the hidden state sequence Q = q1q2…qT which best “explains” the observations (i.e., compute the most likely hidden state sequence)
P.H. CS123a
Three Basic Problems for HMMs q1
qT-1
qT
o1
oT-1
oT
λ = {A, B, π} 3. How to adjust the model parameters λ = {A, B, π} to maximize P(O | λ). (adjust the model to fit the data best?)
P.H. CS123a
Problem 1: Probability of an Observation Sequence q1
•••
qT-1
qT
o1
•••
oT-1
oT
λ = {A, B, π}
Given O = o1o2…oT and λ, compute P(O|λ). P(O|λ) = ΣQP(O,Q|λ) = ΣQP(O|Q,λ) P(Q|λ) P (O | Q, λ ) = ∏t =1 P (ot | qt , λ ) T
P (Q | λ ) = π q1 aq1q 2 aq 2 q3 L aqT −1qT
However, there are NT possible hidden state sequences! There is an efficient way – dynamic programming.
P.H. CS123a
The Forward Procedure P(o1...oT | λ ) =
SN
∑ P(o ...o , q
qT = S1
1
T
T
q1
•••
qT-1
qT
o1
•••
oT-1
oT
| λ)
P (o1...oT , qT | λ )
λ = {A, B, π}
= P (o1...oT | qT , λ ) P (qT | λ ) = P (o1...oT −1 | qT , λ ) P (oT | qT , λ ) P (qT | λ ) = P (o1...oT −1 , qT | λ ) P (oT | qT , λ ) = ∑q N
P (o1...oT −1 , qT −1 , qT | λ ) P (oT | qT , λ )
= ∑q N
P (o1...oT −1 , qT | qT −1 , λ ) P(qT −1 | λ ) P(oT | qT , λ )
= ∑q N
P(o1...oT −1 | qT −1 , qT , λ ) P(qT | qT −1 , λ ) P(qT −1 | λ ) P(oT | qT , λ )
= ∑q N
P(o1...oT −1 | qT −1 , λ ) P(qT −1 | λ ) P(qT | qT −1 , λ ) P(oT | qT , λ )
= ∑q N
P(o1...oT −1 , qT −1 | λ ) P(qT | qT −1 , λ ) P(oT | qT , λ )
S
T −1 = S1
S
T −1 = S1
S
T −1 = S1
S
T −1 = S1
S
T −1 = S1
P.H. CS123a
The Forward Procedure P(o1...oT | λ ) =
SN
∑ P(o ...o , q
qT = S1
1
T
T
q1
•••
qT-1
qT
o1
•••
oT-1
oT
| λ)
λ = {A, B, π} P(o1...oT , qT | λ ) = ∑q N S
T −1 = S1
P(o1...oT −1 , qT −1 | λ ) P(qT | qT −1 , λ ) P (oT | qT , λ )
Define the forward variable α t (i) = P(o1...ot , qt = i | λ ) = [∑ j =1α t −1 ( j )a ji ]biot N
A = [aij]i,j – state transition probabilities. aij = P(qt+1 = j | qt = i) B = [bik]i,k – emission probabilities. bik =P(ot = k | qt = i)
P.H. CS123a
The Forward Procedure P(o1...oT | λ ) =
SN
∑ P(o ...o , q
qT = S1
1
T
T
| λ)
α t (i ) = P(o1...ot , qt = i | λ ) = [∑ j =1α t −1 ( j )a ji ]bio N
t
q1
•••
qT-1
qT
o1
•••
oT-1
oT
λ = {A, B, π}
(1) Initialize
α1 (i ) = π ibio , 1
1≤ i ≤ N
(2) Induction
α t (i ) = [∑ j =1α t −1 ( j )a ji ]bio , N
t
2≤t ≤T 1≤ i ≤ N
(3) Termination N
P(O | λ ) = ∑ α T (i ) i =1
P.H. CS123a
The Forward Procedure N
P(O | λ ) = ∑ α T (i )
λ = {A, B, π}
α t (i ) = [∑ j =1α t −1 ( j )a ji ]bio N
i =1
α1(i)
α2(i)
S1 S2
π
αT(i) •••
S1
A = [aij]i,j
t
S1
•••
S2
S2
••
••
••
••
•
•
•
• SN
SN
•••
SN
o1
o2
•••
oT
B = [bik]i,k
Σ
P(O | λ )
P.H. CS123a
The Backward Procedure Analogous to the forward variable, define a backward variable β t (i ) = P(ot +1...oT | qt = i, λ )
q1
•••
qt
qt+1
•••
qT
o1
•••
ot
ot+1
•••
oT
Probability of the partial observation (1) Initialize
βT(i)=1, 1 ≤ i ≤ N (2) Induction
β t (i) = ∑ j =1 aij b jo β t +1 ( j ), N
t +1
1 ≤ t ≤ T −1 1≤ i ≤ N
(3) Termination N
P(O | λ ) = ∑ π i β1 (i ) i =1
P.H. CS123a
The Forward-Backward Procedure N
N
P(O | λ ) = ∑ π i β1 (i )
P(O | λ ) = ∑ α T (i )
i =1
i =1
q1
•••
qt-1
qt
qt+1
•••
qT
o1
•••
ot-1
ot
ot+1
•••
oT
α t (i ) = P(o1...ot , qt = i | λ ) N
β t (i ) = P(ot +1...oT | qt = i, λ )
P(O | λ ) = ∑ α t (i ) β t (i ) i =1
P.H. CS123a
Problem 2: Optimal State Sequence ?
•••
?
?
?
•••
?
o1
•••
ot-1
ot
ot+1
•••
oT
Q* = arg max P(Q | O, λ ) = arg max P(Q, O | λ ) Q
Q
O = o1o2…oT Q = q1q2…qT
P.H. CS123a
The Viterbi Algorithm Q* = arg max P(Q | O, λ ) = arg max P(Q, O | λ ) Q
Q
•••
S1
Sj
•• •
•
• SN
SN
oT-1
oT
•••
B = [bik]i,k o1
•••
•
Sj
••
••
•
•••
••
•
•
•
A =• •[a• ij]i,j
•• SN
S1
••
••
••
π
Sj
S1
best
P (o1 ,..., oT −1 , q1 ,..., qT −1 , qT = j , oT | λ )
P.H. CS123a
The Viterbi Algorithm Q* = arg max P(Q | O, λ ) = arg max P(Q, O | λ ) Q
Q
•••
S1
•• • Si
best
P (o1 ,..., oT −1 , q1 ,..., qT −1 , qT = j , oT | λ )
•• •
•
•
•
•••
Si
[aij]i,j
••
••
•• SN
•
•
•
A =• •[a• ij]i,j
Si
S1
••
••
••
π
S1
SN
SN
oT-1
oT
•••
B = [bik]i,k
•••
o1
P(o1 ,..., oT −1 , q1 ,..., qT −1 , qT = j , oT | λ ) = [max P(o1 ,..., oT − 2 , q1 ,..., qT − 2 = i | λ )aij ]b joT i
P.H. CS123a
The Viterbi Algorithm Q* = arg max P(Q | O, λ ) = arg max P(Q, O | λ ) Q
Q
•••
S1
[aij]i,j
o1
•••
oT-1
best
P (o1 ,..., oT −1 , q1 ,..., qT −1 , qT = j , oT | λ )
•
• SN
•••
B = [bik]i,k
Si
••
••
• SN
•
Si
••
•
•••
••
•
•
•
A =• •[a• ij]i,j
•• SN
S1
••
••
••
π
Si
S1
oT
δt+1(j) – The highest probability of landing in state j at time t+1 after seeing the observations up to time t+1
Define δ t +1 ( j ) = max P (o1 ,..., ot , q1 ,..., qt , qt +1 = j , ot +1 | λ ) q1 ,..., q t
δ t +1 ( j ) = [max δ t (i )aij ]b jo
t +1
P.H. CS123a
The Viterbi Algorithm Q* = arg max P(Q | O, λ ) = arg max P(Q, O | λ ) Q
Q
•••
S1
•
Si
[aij]i,j
Si
••
••
••
•
•
•
•
•••
••
A =• •[a• ij]i,j
•• SN
•
•
• Si
S1
••
••
••
π
S1
SN
SN
oT-1
oT
•••
B = [bik]i,k o1
•••
δ t +1 ( j ) = [max δ t (i )aij ]b jo
t +1
ψ t +1 ( j ) = arg max δ t (i )aij b jo i
t +1
precedence
P.H. CS123a
The Viterbi Algorithm 1. Initialization
δ1 (i ) = π ibio1 and ψ 1 (i ) = 0,
1≤ i ≤ N
2. for t = 1 to T-1
δ t +1 ( j ) = [max δ t (i )aij ]b jo
t +1
ψ t +1 ( j ) = arg max δ t (i )aij b jo i
endfor
t +1
3. Termination
P* = max δ T (i ) 1≤ i ≤ N
qT* = arg max δ T (i )
4. State sequence backtracking
qt* = ψ t +1 (qt*+1 )
1≤ i ≤ N
P.H. CS123a
Problem 3: Learning Parameters of HMMs •••
o1
•••
•••
ot-1
ot
ot+1
•••
oT
λ = arg max P(O | λ ) *
λ
Easy if the hidden states are known.
Hidden variable problem
The EM algorithm
P.H. CS123a
The EM Algorithm – The forward-backward (or Baum-Welch) algorithm Expectation step: P (Q | O , λ( n ) )
Maximization step: λ( n +1) = arg max E[log p (O, Q | λ ) | O, λ( n ) ] λ
O – Observations {O1, O2, … OD} Q – The corresponding hidden states {Q1, Q2, … QD}
Od = od 1od 2 ...odTd
Qd = qd 1qd 2 ...qdTd
If d ≠ e, Td may not = Te
P.H. CS123a
Transition Probabilities
Given the observations and the current parameters λ(n): (a) The probability of traversing an arc: From state i at time t to state j at time t+1 ξt( n ) (i, j ) = P(qt = i,qt +1 = j | O, λ( n ) ) •• • •
••
••
α t( n ) (i )
Sj ••
•
ot
•
•
••
•
ot-1
••
•
•• •
•• ••
Si
aij
ot+1
ot+2
β t(+n1) ( j )
(n) (n) (n) (n) α ( i ) a b β t ij jo t +1 ( j ) (n) = ξt (i, j ) = (n) P(O | λ ) t +1
α t( n ) (i )aij( n ) b (jon ) β t(+n1) ( j ) t +1
(n) (n) (n) (n) α ( x ) a b β ∑ x =1 ∑ y =1 t xy yo t +1 ( y ) N
N
t +1
P.H. CS123a
Transition Probabilities
Given the observations and the current parameters λ(n): (a) The probability of traversing an arc: From state i at time t to state j at time t+1 ξt( n ) (i, j ) = P(qt = i,qt +1 = j | O, λ( n ) ) (b) The probability of being in state i at time t
γ
(n) t
(i ) = ∑ j =1ξt( n ) (i, j ) N
••
••
•
•
•
t
t+1
•
•
• ••
t-1
••
••
••
Si
ξt( n ) (i, j )
P.H. CS123a
Re-estimate Parameters
Given the observations and the current parameters λ(n): (a) The probability of traversing an arc: ξt( n ) (i, j ) (n) (b) The probability of being in state i at time t: γ t (i )
• The probability that i is a start state πi π i( n +1) = γ 1( n ) (i ) • Transition Probabilities T −1 (n) ξ (i, j ) ∑ ( n +1) t =1 t aij = T −1 ( n ) ∑t =1 γ t (i) • Emission Probabilities ( n +1) ik
b
∑ = ∑
γ (i )
{t :ot = k } t T
γ (t ' )
t ' =1 i