EM expectation maximization semi-supervised learning for noisy channel / HMMs
1
hmm p(s4 |s1 ) p(s1 |start) hidden:
start
observed:
p(s7 |s4 )
s1
s4
s7
s2
s5
s8
s3 o1
s6 o2
s9 o3
p(end|s7 ) end
What are the probabilities of a certain observation given a certain source state (emission, p(oi |si )) or source given another (transmission, p(sj |si ))?1 If we know those, we can make an educated guess about what sources generated what we’ve observed. (find the max p(src|obs) via Bayes by calculating p(src), p(obs), and p(obs|src), or p(obs, src))
2
generative model
With models that estimate the probability of the data, we can prefer those that make the data more likely. p(~o, ~s) = p(s1 |start)...p(sn |sn−1 )p(end|sn )p(o1 |s1 )...p(on |sn ) e.g. p(o1 , o2 , o3 , s1 , s4 , s7 ) = p(s1 |start)p(s4 |s1 )p(s7 |s4 )p(end|s7 )p(o1 |s1 )p(o2 |s4 )p(o3 |s7 )
p(~o) =
X
p(s1 |start)...p(sn |sn−1 )p(end|sn )p(o1 |s1 )...p(on |sn )
(1) (2)
(3)
~ s∈S
yikes, exponential paths to count! better do something dynamic! but first, what about estimating our parameters? 1
if you want to trigram words, the preceding source state must encode both the 2 preceding words; in HMMs decisions are severely local.
1
3
fractional counting
Let’s count what we can observe and estimate unobserved counts from our prior beliefs. In fact, then we can update those beliefs with our new counts, and then count again, and do that all over and over until there’s just not much more improvement (in fact, that often occurs in just a few iterations). Though I won’t prove it here, this method is guaranteed to improve the likelihood of the data. Be sure to normalize wherever necessary! • count(o1 , o2 , o3 , s1 , s4 , s7 ) = p(s1 |start)p(s4 |s1 )p(s7 |s4 )p(end|s7 )p(o1 |s1 )p(o2 |s4 )p(o3 |s7 ) • count(s1 ) = count(s1 ) + p(s1 |start)p(s4 |s1 )p(s7 |s4 )p(end|s7 )p(o1 |s1 )p(o2 |s4 )p(o3 |s7 ) As we explore all possible explanations for the data, we can increment the individual members of a joint observation by as much as we believe in that hypothesized observation. initially, 1 • count(o1 , o2 , o3 , s1 , s4 , s7 ) = 31 ∗ 13 ∗ 31 ∗ 1 ∗ 1 ∗ 1 ∗ 1 = 27 (but if s4 = s1 , must normalize to p(o1 |s1 ) = .5, p(o2 |s1 ) = .5)
• count(s1 , o1 ) = count(s1 , o1 ) + • count(s1 ) = count(s1 ) +
4
1 27
1 27
dynamic efficiency
need an efficient way to count all these paths: (see Baum, Baker, Viterbi, etc. for major developers of these methods)
4.1
forward-backward
Calculate the probability up to any given point as the sum of probabilities of all possible paths to that point times the probability that point contributes. X
f i = pi s 1 : f = p1 s 2 : f = p2 s 3 : f = p3
f = p1 p(s4 |s1 ) f = p2 p(s4 |s2 ) f = p3 p(s4 |s3 )
(4)
f
f ∈Fi
s4 : f = p(o4 |s4 ) o4
The probability of the observation is fend. Do the same thing backwards:
2
P3 i
pi p(s4 |si )
X
bj = p j
(5)
b
b∈Bj
bstart better equal fend! Fi
Bj si : fi
p(sj |si )
count(si , sj )+ =
s j : bj
fi ∗ p(sj |si ) ∗ bj fend
(6)
rather than NsNo more like Ns ∗ No !
4.2
inside-outside
messier but possible with trees. we can store a chart of all possible parses of a string of words: ca (w1 , w2 , w3 ), ..., cz (w1 , w2 , w3 ), ..., cz (ca (w1 , w2 ), w3 ), ..., cz (ca (w1 , w2 ), w3 ), ... ca (w1 , w2 ), ..., cz (w1 , w2 ) w1
ca (w2 , w3 ), ..., cz (w2 , w3 ) w2
w3
the chart is essentially a pyramid, with any particular block containing all possible classifications of the sub-string which it sits on. possible classifications of (w1 , w2 , w3 ) are presumably affected by possible classifications of (w1 , w2 ) and (w2 , w3 ). p(w1 , ..., wn ) =
X
p of all possible parse generations of (w1 , ..., wn )
count(daughters, mother)+ = I ∗ p(daughters|mother) ∗ O Icij = Ocij =
X
X
(7) (8)
p of ways of classifying wi ...wj as c
(9)
p of ways to the root which classify wi ...wj as c
(10)
as above, the sums are entirely local to the node they operate on: the probability of the path up to that point is just extracted from the relevant inside or outside, hypothesized probability of whatever hypothesized operation is multiplied in, and that value is pushed along whatever paths said operation connects in. 3