Maximum Likelihood USC Linguistics August 8, 2007
Ratnaparkhi (1998), p. 8 (slightly modified): f (a,b)
wj j p(a|b) = P Q f (a,b) j a j wj Q
j
(1)
training: L(p) =
X
p˜(a, b) log p(a|b)
(2)
a,b
Generative Iterative Scaling (GIS) (Ratnaparkhi (1998) p. 14, citing Darroch and Ratcliff (1972)): X
f (a, b) = C
(3)
f
if necessary: C = max a,b
X
f (a, b) ;
fn+1 (a, b) = C −
f
X
f (a, b)
(4)
f
wj:0 = 1 ; wj:i+1 = wj:i Epi fj =
X
Ep˜fj Epi fj
1
C
p˜(b)fj (a, b)pi (a|b)
(5) (6)
a,b f (a,b)
wj:ij pi (a|b) = P Q f (a,b) j a j wj:i Q
Ep˜fj =
X
j
p˜(a, b)fj (a, b) =
a,b
1
1 X fj (an , bn ) N n
(7)
(8)
c(b,a) 20 0 0 20 0 20 0 20 80
cp1 (b) T T T T F F F F
cp2 (b) T T F F T T F F
a T F T F T F T F
f1 1 0 1 0 0 0 0 0 20 .25
f2 1 0 0 0 1 0 0 0 20 .25
f3 0 0 0 0 0 1 0 1 40 .5
f4 0 0 0 1 0 0 0 1 40 .5
fc 0 2 1 1 1 1 2 0 40 .5
C=2 " wj:i = wj:i−1
1 N
P
n fj (an , bn )
#1
2
(10)
P
˜(b)fj (a, b)pi (a|b) a,b p f (a,b)
j j wj:i pi (a|b) = P Q f (a,b) j a j wj:i
(11)
Ep˜f1,2 = 20/80 = .25 ; Ep˜f3,4,c = 40/80 = .5
(12)
Q
p0 p1 p2 p3
(9)
T|TT .5 .66 .74 .79
f1 f2 f3 f4 fc 1 1 1 1 1 1 1 1.41 1.41 .71 .96 .96 1.71 1.71 .57 .91 .91 1.94 1.94 .47 T|TF F|TF T|FT F|FT .5 .5 .5 .5 .41 .59 .41 .59 .36 .64 .36 .64 .32 .68 .32 .68 f1 f2 f3 f4 fc Ep0 .25 .25 .25 .25 1 Ep1 .27 .27 .34 .34 .77 Ep2 .28 .28 .39 .39 .73 Ep3 .28 .28 .41 .41 .73
wx:0 wx:1 wx:2 wx:3 F|TT .5 .33 .26 .21
2
T|FF .5 .2 .1 .06
F|FF .5 .8 .9 .94
p0 (T |(T, T )) = p0 (F |(T, T )) =
11 ∗ 11 ∗ 10 ∗ 10 ∗ 10 = .5 ∗ 10 ∗ 10 ) + (10 ∗ 10 ∗ 10 ∗ 10 ∗ 12 )
(13)
10 ∗ 10 ∗ 10 ∗ 10 ∗ 12 = .5 (11 ∗ 11 ∗ 10 ∗ 10 ∗ 10 ) + (10 ∗ 10 ∗ 10 ∗ 10 ∗ 12 )
(14)
(11
∗
11
∗
10
Ep0 f1 = (.25 ∗ 1 ∗ p0 (T |(T, T ))) + (.25 ∗ 1 ∗ p0 (T |(T, F ))) = .25
(15)
Ep0 f3 = (.25 ∗ 1 ∗ p0 (F |(F, T ))) + (.25 ∗ 1 ∗ p0 (F |(F, F ))) = .25
(16)
Ep0 fc = (.25 ∗ 2 ∗ p0 (F |(T, T )))+ (.25 ∗ 1 ∗ p0 (T |(T, F ))) + (.25 ∗ 1 ∗ p0 (F |(T, F )))+ (.25 ∗ 1 ∗ p0 (T |(F, T ))) + (.25 ∗ 1 ∗ p0 (F |(F, T )))+ (.25 ∗ 2 ∗ p0 (T |(F, F ))) = 1 √ √ w1:1 = w2:1 = 1 ; w3:1 = w4:1 = 2 ; wc:1 = .5
(17)
(18)
p1 (T |(T, T )) =
11 ∗ 11 ∗ 1.410 ∗ 1.410 ∗ .710 = .66 (19) (11 ∗ 11 ∗ 1.410 ∗ 1.410 ∗ .710 ) + (10 ∗ 10 ∗ 1.410 ∗ 1.410 ∗ .712 )
p1 (F |(T, T )) =
10 ∗ 10 ∗ 1.410 ∗ 1.410 ∗ .712 = .33 (20) (11 ∗ 11 ∗ 1.410 ∗ 1.410 ∗ .710 ) + (10 ∗ 10 ∗ 1.410 ∗ 1.410 ∗ .712 )
p1 (T |(T, F )) =
11 ∗ 10 ∗ 1.410 ∗ 1.410 ∗ .711 = .41 (21) (11 ∗ 10 ∗ 1.410 ∗ 1.410 ∗ .711 ) + (10 ∗ 10 ∗ 1.410 ∗ 1.411 ∗ .711 )
p1 (F |(T, F )) =
10 ∗ 10 ∗ 1.411 ∗ 1.411 ∗ .710 = .59 (22) (10 ∗ 10 ∗ 1.411 ∗ 1.411 ∗ .710 ) + (10 ∗ 10 ∗ 1.410 ∗ 1.410 ∗ .712 ) p1 (T |(F, F )) = p1 (F |(F, F )) =
.712 = .2 (.712 ) + (1.41 ∗ 1.41)
(23)
1.41 ∗ 1.41 = .8 + (1.41 ∗ 1.41)
(24)
(.712 )
Ep1 f1 = (.25 ∗ 1 ∗ .66) + (.25 ∗ 1 ∗ .41) = .27
(25)
Ep1 f3 = (.25 ∗ 1 ∗ .59) + (.25 ∗ 1 ∗ .8) = .34
(26)
w1:2 = w2:2 = .96 ; w3:2 = w4:2 = 1.71 ; wc:2 = .57
(27)
3
References Berger, Adam L., Pietra, Stephen Della, and Pietra, Vincent J. Della (1996) “A Maximum Entropy Approach to Natural Language Processing,” Computational Linguistics, 22(1), 39–71. Darroch, J.N. and Ratcliff, D. (1972) “Generalized iterative scaling for log-linear models,” Ann. Math. Statist., 43, 1470–1480. Ratnaparkhi, A. (1998) “Maximum Entropy Models for Natural Language Ambiguity Resolution,” Ph.D Thesis, University of Pennsylvania.
4