Basic Bayes∗ USC Linguistics December 20, 2007 β ¬β
α n11 n10
¬α n01 n00
Table 1: n=counts
N = n11 + n01 + n10 + n00 n11 p(α, β) = N n11 + n10 (3) N p(α, β) n11 = p(β|α) = n11 + n10 p(α) (4)
(1) (2)
n11 + n01 (5) N p(α, β) n11 = p(α|β) = n11 + n01 p(β) (6)
p(α) =
p(β) =
p(α, β) = p(α|β)p(β) = p(α)p(β|α)
(7)
“Bayes’ Theorem”:
p(α|β) =
p(α)p(β|α) p(β)
∗
(8)
Thanks to David Wilczynski and USC’s CSCI 561 slides for the general gist of this brief introduction. Also to Grenager’s Stanford Lecture notes (http://wwwnlp.stanford.edu/∼grenager/cs121/handouts/cs121 lecture06 4pp.pdf), and particularly John A. Carroll’s Sussex notes (http://www.informatics.susx.ac.uk/courses/nlp/lecturenotes/corpus2.pdf) for the tip on feature products; also wikipedia for its clear presentation of multiple variables.
1
Extending to more variables:
p(α|β, γ) =
1
p(α, β, γ) p(α, β)p(γ|α, β) p(α)p(β|α)p(γ|α, β) p(α, β, γ) = = = (9) p(β, γ) p(β)p(γ|β) p(β)p(γ|β) p(β)p(γ|β)
The Naive Approach
for `, a label, and f, features of the event1 :
c(fi, `) P p(fi|`) = j c(fj , `)
(10)
c(`) p(`) = P i c(`i )
(11)
A new event is assigned the label which maximizes the following product.
p(`)
Y
p(fi|`)
(12)
i 1.1
Problems
if α and β are independent:
p(α|β) = p(α)
(13)
p(α, β) = p(α)p(β)
(14)
DO NOT IMPLY: p(α, β|γ) = p(α|γ)p(β|γ) 1
c.f. John A. Carroll
2
(15)
(p(α, β|γ) = p(α|γ)p(β|γ)) ↔ (p(α|β, γ) = p(α|γ))
(16)
suppose : p(α, β|γ) = p(α|γ)p(β|γ)
(17)
∴ p(α, β, γ) = p(α|γ)p(β, γ)
(18)
∴ p(α|β, γ) = p(α|γ)
(19)
Much thanks to Greg Lawler, of the University of Chicago, who, in a fortuitous flight meeting, provided this elegant example exception: • α: green die + red die= 7 • β: green die=1 • γ: red die=6
p(α|β) = p(α|γ) = p(α) = 1/6
(20)
p(α|β, γ) = 1
(21)
but,
∴ p(α|β, γ) 6= p(α|γ) ∴ p(α, β|γ) 6= p(α|γ)p(β|γ)
(22)
but anyways:
Qn p(`) i=0 p(fi|`) Qn p(`|f0, ..., fn) = i=0 p(fi ) 3
(23)
Also notice how this equation can give p>1: Assume 3 events: (α,β), (α,γ), (δ,δ) • p(β|α) = 1/2
• p(α) = 2/3 • p(β) = 1/3 • p(γ) = 1/3
• p(γ|α) = 1/2
p(α|β, γ) =
p(α)p(β|α)p(γ|α) 2/3 ∗ 1/2 ∗ 1/2 = = 3/2 p(β)p(γ) 1/3 ∗ 1/3
(24)
Also, be sure: c(L)=c(F) p(`) =
p(`|f ) =
2 2.1
c(`) c(f ) ; p(f ) = c(L) c(F )
c(`) c(`,f ) c(L) c(`) c(f ) c(F )
=
c(`, f )/c(L) c(`, f ) = c(f )/c(F ) c(f )
(25)
(26)
smoothing Linear Interpolation
control the non-conditioned significance. tune α on reserved data.
p(x|y) = αˆ p(x|y) + (1 − α)ˆ p(x) 2.2
(27)
Laplace
k, “the strength of the prior”. tune k on reserved data.
c(x) + k c(x) + k p(x) = P = N + k|X| x [c(x) + k] p(x|y) =
c(x, y) + k c(y) + k|X|
(28) (29)
If k=1, we are pretending we saw everything once more than we actually did; even things that we never saw! 4
2.3
another caveat?
p(`) =
c(`) + |F | N + |F L|
p(f |`) = p(f ) =
c(f, `) + 1 c(`) + |F | c(f ) + |L| N + |F L|
(30) (31) (32)
|F | is the number of feature types, |L| is the number of label types, and |F L| is their product. ∴ p(`|f ) =
5
c(`, f ) + 1 c(f ) + |L|
(33)