Decision Trees USC Linguistics July 26, 2007 Suppose we wish to classify events according to their temporal orientations.1
He read. He was hungry. He ran. They liked it. He reads. He is happy. They need it. They want it. He will read. They will get it.
-ed F F F T F F T F F F
-s F F F F T F F F F F
will F F F F F F F F T T
I(P (v1 ), ..., P (vn )) =
n X
be F T F F F T F F F F
[sing] T T T F T T F F T F
t(e) t(e)<ST t(e)<ST t(e)<ST t(e)<ST t(e)∩ST t(e)∩ST t(e)∩ST t(e)∩ST t(e)>ST t(e)>ST
−P (vi )log2 P (vi )
(1)
i=1
4 4 2 , , ) = .5288 + .5288 + .4644 = 1.522 10 10 10 X c(Ai ) c(v1 ) c(vn ) Remainder(A) = I , ..., N c(Ai ) c(Ai ) I(
(2) (3)
i
Gain(A) = I(P (v1 ), ..., P (vn )) − Remainder(A)
(4)
R(−ed) =
8 3 3 2 2 1 1 I( , , 0) + I( , , ) = .2 ∗ 1 + .8 ∗ 1.56 = 1.45 10 2 2 10 8 8 8
(5)
R(−s) =
1 9 4 3 2 I(0, 1, 0) + I( , , ) = .1 ∗ 0 + .9 ∗ 1.53 = 1.38 10 10 9 9 9
(6)
1
Notice immediately, that there is no way to disambiguate the written representation “They read”; unless, perhaps, the past is more likely to be eventive, and more likely to want a direct object.
1
R(will) = R(be) = R([sing]) =
2 8 4 4 I(0, 0, 1) + I( , , 0) = .2 ∗ 0 + .8 ∗ 1 = .8 10 10 8 8
(7)
8 3 3 2 2 1 1 I( , , 0) + I( , , ) = .2 ∗ 1 + .8 ∗ 1.56 = 1.45 10 2 2 10 8 8 8
(8)
6 3 2 1 4 1 2 1 I( , , ) + I( , , ) = .6 ∗ 1.45 + .4 ∗ 1.5 = 1.475 10 6 6 6 10 4 4 4
(9)
<<<< ∩ ∩ ∩∩ >> will:T
will:F
>>
<<<< ∩ ∩ ∩∩
4 4 I( , ) = 1 8 8
(10)
2 1 1 6 3 3 R(−ed) = I( , ) + I( , ) = .25 + .75 = 1 8 2 2 8 6 6
(11)
1 7 4 3 R(−s) = I(0, 1) + I( , ) = 0 + .86 = .86 8 8 7 7
(12)
2 1 1 6 3 3 R(be) = I( , ) + I( , ) = .25 + .75 = 1 8 2 2 8 6 6
(13)
5 3 2 3 1 2 R([sing]) = I( , ) + I( , ) = .61 + .34 = .95 8 5 5 8 3 3
(14)
<<<< ∩ ∩ ∩∩ >> will:T
will:F
>>
<<<< ∩ ∩ ∩∩ -s:T
-s:F
∩
<<<< ∩ ∩ ∩
2
He read. He was hungry. He ran. They liked it He reads. He is happy They need it. They want it. He will read. They will get it. R(w − 1) =
-ed F F F T F F T F F F
-s F F F F T F F F F F
w-1 was is will will
[sing] T T T F T T F F T F
t(e) t(e)<ST t(e)<ST t(e)<ST t(e)<ST t(e)∩ST t(e)∩ST t(e)∩ST t(e)∩ST t(e)>ST t(e)>ST
6 3 3 1 1 2 I( , , 0) + I(1, 0, 0) + I(0, 1, 0) + I(0, 0, 1) = .6 10 6 6 10 10 10
(15)
<<<< ∩ ∩ ∩∩ >>
w-1:-
w-1:was
w-1:is
w-1:will
<<< ∩ ∩ ∩
<
∩
>>
He read. He was hungry. He ran. They liked it He reads. He is happy They need it. They want it. He will read. They will get it.
R(w − 1) =
-ed F F F T F F T F F F
-s F F F F T F F F F F
w-1 he was he they he is they they will will
[sing] T T T F T T F F T F
t(e) t(e)<ST t(e)<ST t(e)<ST t(e)<ST t(e)∩ST t(e)∩ST t(e)∩ST t(e)∩ST t(e)>ST t(e)>ST
3 2 1 1 3 1 2 1 2 I( , , 0) + I(1, 0, 0) + I( , , 0) + I(0, 1, 0) + I(0, 0, 1) (16) 10 3 3 10 10 3 3 10 10 R(w − 1) = .55
3
(17)
Venkataraman2 credits Quinlan with the concept of GainRatio: GainRatio(A) = X
Gain(A)
(18)
−P (v)log2 P (v)
v∈A
GainRatio(w − 1) =
1.522 − .55 3 1 3 1 2 I( 10 , 10 , 10 , 10 , 10 )
GainRatio(w − 1) =
=
.972 = .45 2.17
.922 1.522 − .6 6 1 1 2 = 1.57 = .59 I( 10 , 10 , 10 , 10 )
GainRatio(will) =
1.522 − .8 .722 =1 = 8 2 .722 I( 10 , 10 )
(19) (20) (21)
χ2
1
• “Don’t use low numbers, especially not zero.”
E=
T otal(row) ∗ T otal(col) N
(22)
‘Expected’ by the “independent hypothesis”3 :
was is will total
Observed < ∩ > 30 30 0 10 0 0 0 10 0 0 0 20 40 40 20
60 10 10 20 100
was is will total
Expected < ∩ > 24 24 12 4 4 2 4 4 2 8 8 4 40 40 20
60 10 10 20 100
Table 1: y:ind. var.; x: dep. var.
χ2 =
χ2 = 2 3
X (O − E)2 E
(30 − 24)2 (30 − 24)2 (0 − 12)2 (10 − 4)2 (0 − 4)2 (0 − 2)2 + + + + + + 24 24 12 4 4 2
http://www.speech.sri.com/people/anand/771/html/node29.html Null in the sense of ‘to be nullified’, not in the sense of maximum entropy.
4
(23)
(24)
(0 − 4)2 (10 − 4)2 (0 − 2)2 (0 − 8)2 (0 − 8)2 (20 − 4)2 + + + + + = 125 4 4 2 8 8 4
(25)
f ◦ = (rows − 1)(cols − 1)
(26)
gamma function:
Z∞ Γ(α) =
xα−1 e−x dx
(27)
0
for integers: Γ(n) = (n − 1)! gamma pdf: p(x) =
λα α−1 −λx x e Γ(α)
χ2 : α= 2
p(χ ) =
(28)
(29)
1 f◦ ,λ = 2 2
( 12 )
f◦ 2
◦ Γ( f2 )
x
f◦ −1 2
(30) 1
2
e− 2 χ
(31)
Gamma PDF (χ2 : f ◦ = 2α; λ = .5) 0.5
α = 10, λ = 2 α = 5, λ = 2 α = 3, λ = .5 α = 1, λ = .5 α = .5, λ = .5
0.4 0.3 0.2 0.1 0 0
2
4
6
p(χ2 > 125, f ◦ = 6) = 9.25 ∗ 10−14
5
8
10 (32)
-s T F total
Observed < ∩ > 0 10 0 40 30 0 40 40 0
-s T F total
10 70 80
Expected < ∩ > 5 5 0 35 35 0 40 40 0
10 70 80
χ2 = 5 + 5 + 0 + .71 + .71 + 0 = 11.42, f ◦ = 2
(33)
p(χ2 > 11.42, f ◦ = 2) = .003
(34)
Observed -s < ∩ > T 0 1 0 F 4 3 0 total 4 4 0
1 7 8
-s T F total
Expected < ∩ .5 .5 3.5 3.5 4 4
> 0 0 0
1 7 8
χ2 = .5 + .5 + 0 + .071 + .071 + 0 = 1.142, f ◦ = 2
(35)
p(χ2 > 1.142, f ◦ = 2) = .56
(36)
6