Introduction to Probability for Electrical Engineering Prapun Suksompong School of Electrical and Computer Engineering Cornell University, Ithaca, NY 14853
[email protected] January 23, 2008
Contents 1 Mathematical Background 1.1 Set Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Enumeration / Combinatorics / Counting . . . . . . . . . . . . . . . . . . 1.3 Dirac Delta Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4 4 9 19
2 Classical Probability 2.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20 22
3 Probability Foundations 3.1 Algebra and σ-algebra . . . . . . . . 3.2 Kolmogorov’s Axioms for Probability 3.3 Properties of Probability Measure . . 3.4 Countable Ω . . . . . . . . . . . . . . 3.5 Independence . . . . . . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
24 25 29 31 34 34
4 Random Element 4.1 Random Variable . . . . . . 4.2 Distribution Function . . . . 4.3 Discrete random variable . . 4.4 Continuous random variable 4.5 Mixed/hybrid Distribution . 4.6 Independence . . . . . . . . 4.7 Misc . . . . . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
36 37 38 41 42 45 46 48
5 PMF Examples 5.1 Random/Uniform . . . . . . . . . 5.2 Bernoulli and Binary distributions 5.3 Binomial: B(n, p) . . . . . . . . . 5.4 Geometric: G(β) . . . . . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
48 48 49 50 51
. . . . . . .
. . . . . . .
5.5 5.6 5.7 5.8 5.9 5.10
Poisson Distribution: P(λ) . . . . . . . Compound Poisson . . . . . . . . . . . Hypergeometric . . . . . . . . . . . . . Negative Binomial Distribution (Pascal Beta-binomial distribution . . . . . . . Zipf or zeta random variable . . . . . .
. . . / . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . P´olya distribution) . . . . . . . . . . . . . . . . . . . . . .
6 PDF Examples 6.1 Uniform Distribution . . . . . . . . . . . . 6.2 Gaussian Distribution . . . . . . . . . . . 6.3 Exponential Distribution . . . . . . . . . . 6.4 Pareto: Par(α)–heavy-tailed model/density 6.5 Laplacian: L(α) . . . . . . . . . . . . . . . 6.6 Rayleigh . . . . . . . . . . . . . . . . . . . 6.7 Cauchy . . . . . . . . . . . . . . . . . . . . 6.8 Weibull . . . . . . . . . . . . . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
53 57 57 58 59 59
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
59 59 61 65 67 67 68 69 69
7 Expectation
69
8 Inequalities
76
9 Random Vectors 9.1 Random Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
80 86
10 Transform Methods 10.1 Probability Generating Function 10.2 Moment Generating Function . 10.3 One-Sided Laplace Transform . 10.4 Characteristic Function . . . . .
. . . .
86 86 87 88 89
. . . .
91 91 96 98 101
. . . .
108 114 114 114 116
11 Functions of random variables 11.1 SISO case . . . . . . . . . . . 11.2 MISO case . . . . . . . . . . . 11.3 MIMO case . . . . . . . . . . 11.4 Order Statistics . . . . . . . .
. . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
12 Convergences 12.1 Summation of random variables . . . . . . . 12.2 Summation of independent random variables 12.3 Summation of i.i.d. random variable . . . . 12.4 Central Limit Theorem (CLT) . . . . . . . . 13 Conditional Probability and 13.1 Conditional Probability . . 13.2 Conditional Expectation . 13.3 Conditional Independence
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
Expectation 117 . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 2
14 Real-valued Jointly Gaussian
120
15 Bayesian Detection and Estimation
123
A More Math A.1 Inequalities A.2 Summations A.3 Derivatives . A.4 Integration . A.5 Gamma and
126 126 127 129 132 135
. . . . . . . . . . . . Beta
. . . . . . . . . . . . . . . . . . . . . . . . functions
. . . . .
. . . . .
. . . . .
. . . . .
3
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
1
Mathematical Background
1.1
Set Theory
1.1. Basic Set Identities: • Idempotence: (Ac )c = A • Commutativity (symmetry): A∪B =B∪A, A∩B =B∩A • Associativity: ◦ A ∩ (B ∩ C) = (A ∩ B) ∩ C ◦ A ∪ (B ∪ C) = (A ∪ B) ∪ C • Distributivity ◦ A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C) ◦ A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C) • de Morgan laws ◦ (A ∪ B)c = Ac ∩ B c ◦ (A ∩ B)c = Ac ∪ B c 1.2. Basic Terminology: • A ∩ B is sometimes written simply as AB. • Sets A and B are said to be disjoint (A ⊥ B) if and only if A ∩ B = ∅ • A collection of sets (Ai : i ∈ I) is said to be pair-wise disjoint or mutually exclusive [9, p 9] if and only if Ai ∩Aj = ∅ when i = j. • A collection Π = (Aα : α ∈ I) of subsets of Ω (in this case, indexed or labeled by α taking values in an index or label set I) is said to be a partition of Ω if (a) Ω = Aα∈I and (b) For all i = j, Ai ⊥ Aj (pairwise disjoint). In which case, the collection (B ∩ Aα : α ∈ I) is a partition of B. In other words, any set B can be expressed as B = (B ∩ Ai ) where the union is a disjoint union. α
• The cardinality (or size) of a collection pr set A, denoted |A|, is the number of elements of the collection. This number may be finite or infinite. 4
name
rule
Commutative laws
A∩B =B∩A
A∪B =B∪A
Associative laws
A ∩ (B ∩ C) = (A ∩ B) ∩ C A ∪ (B ∪ C) = (A ∪ B) ∪ C
Distributive laws
A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C) A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C)
DeMorgan’s laws
A∩B =A∪B
Complement laws
A∩A=∅
Double complement law
A=A
Idempotent laws
A∩A=A
Absorption laws
A ∩ (A ∪ B) = A
Dominance laws
A∩∅=∅
A∪U =U
Identity laws
A∪∅=A
A∩U =A
A∪B =A∩B
A∪A=U A∪A=A A ∪ (A ∩ B) = A
Figure 1: Set Identities [17] ◦ Inclusion-Exclusion Principle: n Ai = i=1
n c ◦ Ai = |Ω| +
φ=I⊂{1,...,n}
(−1)|I|+1 Ai . i∈I
(−1) Ai . i=1 i∈I φ=I⊂{1,...,n} n ◦ If ∀i, Ai ⊂ B ( or equivalently, i=1 Ai ⊂ B), then n (−1)|I| Ai . (B \ Ai ) = |B| +
|I|
i=1
φ=I⊂{1,...,n}
i∈I
• An infinite set A is said to be countable if the elements of A can be enumerated or listed in a sequence: a1 , a2 , . . . . Empty set and finite sets are also said to be countable. By a countably infinite set, we mean a countable set that is not finite. • A singleton is a set with exactly one element. • N = {1, 2, , 3, . . . } R = (−∞, ∞). • For a set of sets, to avoid the repeated use of the word “set”, we will call it a collection/class/family of sets. Definition 1.3. Monotone sequence of sets 5
• The sequence of events (A1 , A2 , A3 , . . . ) is monotone-increasing sequence of events if and only if A1 ⊂ A2 ⊂ A3 ⊂ . . .. In which case, ◦
n
A i = An
i=1
◦ lim An = n→∞
∞
Ai .
i=1
Put A = lim An . We then write An A; that is An A if and only if ∀n n→∞ ∞ An ⊂ An+1 and An = A. n=1
• The sequence of events (B1 , B2 , B3 , . . . ) is monotone-decreasing sequence of events if and only if B1 ⊃ B2 ⊃ B3 ⊃ . . .. In which case, ◦
n
Bi = Bn
i=1
◦ lim Bn = n→∞
∞
Bi
i=1
Put B = lim Bn . We then write Bn B; that is Bn B if and only if ∀n n→∞ ∞ Bn = B. Bn+1 ⊂ Bn and n=1
Note that An A ⇔ Acn Ac . 1.4. An (Event-)indicator function IA : Ω → {0, 1} is defined by 1, if ω ∈ A IA (ω) = 0, otherwise. • Alternative notation: 1A . • A = {ω : IA (ω) = 1} • A = B if and only if IA = IB • IAc (ω) = 1 − IA (ω) • A ⊂ B ⇔ {∀ω, IA (ω) ≤ IB (ω)} ⇔ {∀ω, IA (ω) = 1 ⇒ IB (ω) = 1} IA∩B (ω) = min (IA (ω) , IB (ω)) = IA (ω) · IB (ω) IA∪B (ω) = max (IA (ω) , IB (ω)) = IA (ω) + IB (ω) − IA (ω) · IB (ω) 6
∞ Bi for all finite N ∈ N. Then, ∞ i=1 Bi . i=1 Ai = Proof. To show “⊂”, suppose x ∈ ∞ ∃N0 such that x ∈ AN0 . Now, AN0 ⊂ i=1 Ai . Then, N0 N0 ∞ ∞ i=1 Bi . Therefore, x ∈ i=1 Bi . To show “⊃”, use symmetry. i=1 Ai = i=1 Bi ⊂ 1.5. Suppose
N
i=1
Ai =
N
i=1
1.6. Let A1 , A2 , . . . be a sequence of disjoint sets. Define Bn = ∪i>n Ai = ∪i≥n+1 Ai . Then, (1) Bn+1 ⊂ Bn and (2) ∩∞ n=1 Bn = ∅. Proof. Bn = ∪i≥n+1 Ai = (∪i≥n Ai ) ∪ An+1 = Bn+1 ∪ An+1 So, (1) is true. For (2), consider / Bn and hence x ∈ / ∩∞ two cases. (2.1) For element x ∈ / ∪i Ai , we know that x ∈ n=1 Bn . (2.2) For x ∈ ∪i Ai , we know that ∃i0 such that x ∈ Ai0 . Note that x can’t be in other Ai because / Bi+0 and therefore x ∈ / ∩∞ the Ai ’s are disjoint. So, x ∈ n=1 Bn . 1.7. Any countable union can be written as a union of pairwise disjoint sets: Given any sequence of sets Fn , define a new sequence by A1 = F1 , and ⎞ ⎛ c An = Fn ∩ Fn−1 ∩ · · · ∩ F1c = Fn ∩ Fic = Fn \ ⎝ Fi ⎠ . i∈[n−1]
for n ≥ 2. Then,
∞ i=1
Fi =
∞ i=1
i∈[n−1]
Ai where the union on the RHS is a disjoint union.
Proof. Note that the An are pairwise disjoint. To see this, consider An1 , An2 where n1 = n2 . WLOG, assume n1 < n2 . This implies that Fnc1 get intersected in the definition of An2 . So, ⎛ ⎞ n n 1 −1 2 −1 ⎜ ⎟ An1 ∩ An2 = Fn1 ∩ Fnc1 ∩ ⎝ Fic ∩ Fn2 ∩ Fic ⎠ = ∅ i=1 i=1 ∅
i=n1
Also, for finite N ≥ 1, we have n∈[N ] Fn = n∈[N ] An (*). To see this, note that (*) is true for N = 1. Suppose (*) is true for N = m and let B = n∈[m] An = n∈[m] Fn . Now, for N = m + 1, by definition, we have An = An ∪ Am+1 = (B ∪ Fm+1 ) ∩ Ω n∈[m+1]
n∈[m]
= B ∪ Fm+1 =
Fi .
i∈[m+1]
So, (*) is true for N = m + 1. By induction, (*) is true for all finite N . The extension to ∞ is done via (1.5). For finite union, we can modify the above statement by setting Fn = ∅ for n ≥ N . Then, An = ∅ for n ≥ N . By construction, {An : n ∈ N} ⊂ σ ({Fn : n ∈ N}). However, in general, it is not true that {Fn : n ∈ N} ⊂ σ ({An : n ∈ N}). For example, for finite union with N = 2, we can’t get F2 back from set operations on A1 , A2 because we lost information about F1 ∩ F2 . To create a disjoint union which preserve the information about the overlapping parts of the 7
Fn ’s, we can define the A’s by ∩n∈N Bn where Bn is Fn or Fnc . This is done in (1.8). However, this leads to uncountably many Aα , which is why we used the index α above instead of n. The uncountability problem does not occur if we start with a finite union. This is shown in the next result. 1.8. Decomposition: • Fix sets A1 , A2 , . . . , An , not necessarily disjoint. Let Π be a collection of all sets of the form B = B1 ∩ B2 ∩ · · · ∩ Bn where each Bi is either Aj or its complement. There are 2n of these, say n
B (1) , B (2) , . . . , B (2 ) . Then, (a) Π is a partition of Ω and (b) Π \ ∩j∈[n] Acj is a partition of ∪j∈[n] Aj .
Moreover, any Aj can be expressed as Aj =
B (i) for some Si ⊂ [2n ]. More
i∈Sj
specifically, Aj is the union of all B
(i)
which is constructed by Bj = Aj
• Fix sets A1 , A2 , . . ., not necessarily disjoint. Let Π be a collection of all sets of the form B = n∈N Bn where each Bn is either An or its complement. There are uncountably of these hence we will index the B’s by α; that is we write B (α) . Let I be the set of all α after we eliminate all repeated B (α) ; note that I can still be uncountable. Then, (a) Π = B (α) : α ∈ I is a partition of Ω and (b) Π \ {∩n∈N Acn } is a partition of ∪n∈N An . Moreover, any Aj can be expressed as Aj =
B (α) for some Sj ⊂ I. Because I is
α∈Sj
uncountable, in general, Sj can be uncountable. More specifically, Aj is the (possibly uncountable) union of all B (α) which is constructed by Bj = Aj . The uncountability of Sj can be problematic because it implies that we need uncountable union to get Aj back. 1.9. Let {Aα : α ∈ I} be a collection of disjoint sets where I is a nonempty index set. For any set S ⊂ I, define a mapping g on 2I by g(S) = ∪α∈S Aα . Then, g is a 1:1 function if and only if none of the Aα ’s is empty. 1.10. Let
A = (x, y) ∈ R2 : (x + a1 , x + b1 ) ∩ (y + a2 , y + b2 ) = ∅ ,
where ai < bi . Then,
A = (x, y) ∈ R2 : x + (a1 − b2 ) < y < x + (b1 − a2 ) = (x, y) ∈ R2 : a1 − b2 < y − x < b1 − a2 .
8
ݕ
ݔ=ݕ A
ܿ + ሺܾ1 െ ܽ2 ሻ ܿ + ሺܽ1 െ ܾ2 ሻ ܿ
ܿ
x
Figure 2: The region A = (x, y) ∈ R2 : (x + a1 , x + b1 ) ∩ (y + a2 , y + b2 ) = ∅ .
1.2
Enumeration / Combinatorics / Counting
1.11. The four kinds of counting problems are: (a) ordered sampling of r out of n items with replacement: nr ; (b) ordered sampling of r ≤ n out of n items without replacement: (n)r ; (c) unordered sampling of r ≤ n out of n items without replacement: nr ; (d) unordered sampling of r out of n items with replacement: n+r−1 . r 1.12. Given a set of n distinct items, select a distinct ordered sequence (word) of length r drawn from this set. • Sampling with replacement: μn,r = nr ◦ Ordered sampling of r out of n items with replacement. ◦ μn,1 = n ◦ μ1,r = 1 ◦ μn,r = μn,r−1 for r > 1 ◦ Examples:
∗ Suppose A is a finite set, then the cardinality of its power set is 2A = 2|A| . ∗ There are 2r binary strings/sequences of length r.
9
• Sampling without replacement: (n)r =
r−1
(n − i) =
i=0
n! (n − r)!
= n · (n − 1) · · · (n − (r − 1)); r terms
r≤n
◦ Ordered sampling of r ≤ n out of n items without replacement. ◦ For integers r, n such that r > n, we have (n)r = 0. ◦ The definition in product form (n)r =
r−1 i=0
(n − i) = n · (n − 1) · · · (n − (r − 1)) r terms
can be extended to any real number n and a non-negative integer r. We define (n)0 = 1. (This makes sense because we usually take the empty product to be 1.) ◦ (n)1 = n ◦ (n)r = (n − (r − 1))(n)r−1 . For example, (7)5 = (7 − 4)(7)4 . 1, if r = 1 ◦ (1)r = 0, if r > 1 • Ratio:
r−1
(n)r = nr
(n − i)
i=0 r−1
= (n)
r−1 i=0
i 1− n
i=0
≈
r−1
− ni
e
1 −n
=e
r−1 i=0
i
= e−
r(r−1) 2n
i=0 r2
≈ e− 2n 1.13. Factorial and Permutation: The number of arrangements (permutations) of n ≥ 0 distinct items is (n)n = n!. • 0! = 1! = 1 • n! = n(n − 1)! • n! =
∞
e−t tn dt
0
• Stirling’s Formula: n! ≈
√
2πnnn e−n = 10
√
1 n 2πe e(n+ 2 ) ln( e ) .
1.14. Binomial coefficient: (n)r n! n = = r r! (n − r)!r! This gives the number of unordered sets of size r drawn from an alphabet of size n without replacement; this is unordered sampling of r ≤ n out of n items without replacement. It is also the number of subsets of size r that can be formed from a set of n elements. Some properties are listed below: (a) Use nchoosek(n,r) in MATLAB. (b) Use combin(n,r) in Mathcad. However, to do symbolic manipulation, use the factorial definition directly. n . (c) Reflection property: nr = n−r (d) nn = n0 = 1. n = n. (e) n1 = n−1 (f) nr = 0 if n < r or r is a negative integer. n (g) max nr = n+1 . r 2 n−1 + k−1 . This property divides the process of (h) Pascal’s “triangle” rule: nk = n−1 k choosing k items into two steps where the first step is to decide whether to choose the first item or not.
Figure 3: Pascal Triangle. (i)
0≤k≤n k even
n k
=
0≤k≤n k odd
n k
= 2n−1
There are many ways to show this identity.
11
(i) Consider the number of subsets of S = {a1 , a2 , . . . , an } of n distinct elements. First choose subsets A of the first n − 1 elements of S. There are 2n−1 distinct S. Then, for each A, to get set with even number of element, add element an to A if and only if |A| is odd. (ii) Look at binomial expansion of (x + y)n with x = 1 and y = −1. n (iii) For odd n, use the fact that nr = n−r . (j)
min(n1 ,n2 ) n1 n2 k=0
k
k
=
n1 +n2 n1
=
n1 +n2 n2
.
• This property divides the process of choosing k items from n1 +n2 into two steps where the first step is to choose from the first n1 items. • Can replace the min(n1 , n2 ) in the first sum by n1 or n2 if we define nk = 0 for k > n. n n2 2n • = n . r=1 r (k) Parallel summation: n m m=k
k
k k+1 n n+1 = + + ··· + = . k k k k+1
To see this, suppose we try to choose k + 1 items from n + 1 items a1 , a2 , . . . , an+1 . First, we choose whether to choose a1 . If so, then we need to choose the rest k items from the n items a2 , . . . , an+1 . Hence, we have the nk term. Now, suppose we didn’t choose a1 . Then, we still need to choose k + 1 items from n items a2 , . . . , an+1 . We then repeat the same argument on a2 in stead of a1 . Equivalently, r n+k k
k=0
=
r n+k k=0
n
n+r+1 n+r+1 . + r n+1
=
To prove the middle equality in (1), use induction on r. 1.15. Binomial theorem: n n (x + y) = xr y n−r r n
r=0
• Let x = y = 1, then
n n r=0
r
= 2n .
• Entropy function: H(p) = −p logb (p) − (1 − p) logb (1 − p) ◦ Binary: b = 2 ⇒ H2 (p) = −p log2 (p) − (1 − p) log2 (1 − p) . 12
(1)
In which case, Hence,
n r
1 nH ( nr ) 2 ≤ n+1
n r
≤ 2nH ( n ) . r
≈ 2nH2 ( n ) . r
• By repeated differentiating with respect to x followed by multiplication by x, we have n n r n−r ◦ = nx(x + y)n−1 and r=0 r r x y n 2 n r n−r ◦ = nx (x(n − 1)(x + y)n−2 + (x + y)n−1 ). r=0 r r x y For x + y = 1, we have n n r n−r ◦ = nx and r=0 r r x (1 − x) n 2 n r n−r ◦ = nx (nx + 1 − x). r=0 r r x (1 − x) All identities above can be verified easily via Mathcad. 1.16. Multinomial Counting: The multinomial coefficient r n−
i−1
nk
n n1 n2 ··· nr
is defined as
k=0
i=1
ni n − n1 n − n1 − n2 nr n · · ··· = n2 n3 nr n1 n! = r n! i=1
It is the number of ways that we can arrange n =
r
ni tokens when having r types of
i=1
symbols and ni indistinguishable copies/tokens of a type i symbol. 1.17. Multinomial Theorem n
(x1 + . . . + xr ) =
n n−i 1 i1 =0 i2 =0
n−
j
···
ij
ir−1 =0
n!
n−
ik !
k
n−
xr ik !
j
r−1 ij
xikk
k=1
k
• r-ary entropy function: Consider any vector p = (p1 , p2 , . . . , pr ) such that pi ≥ 0 and r pi = 1. We define i=1
H (p) = −
r i=1
13
pi logb pi .
Factorial expansion Symmetry Monotonicity Pascal’s identity Binomial theorem Counting all subsets Even and odd subsets Sum of squares Square of row sums Absorption/extraction Trinomial revision Parallel summation Diagonal summation Vandermonde convolution Diagonal sums in Pascal’s triangle (§2.3.2) Other Common Identities
n k n k n 0
=
n! k!(n−k)! , n n−k , k n 1 < ···
k = 0, 1, 2, . . . , n
= 0, 1, 2, . . . , n n < < n/2 , n≥0 n n−1 n−1 k = k−1 + k , k = 0, 1, 2, . . . , n n (x + y)n = k=0 nk xk y n−k , n ≥ 0 n n n k=0 k = 2 , n ≥ 0 n k n k=0 (−1) k = 0, n ≥ 0 n n2 2n = n , n≥0 k=0 k n n 2 2n 2n = k=0 k , n ≥ 0 k=0 k n n n−1 k = k k−1 , k = 0 n m n n−k m k = k m−k , 0 ≤ k ≤ m ≤ n m n+k n+m+1 = , m, n ≥ 0 k=0 k m n−m m+k n+1 = m+1 , n ≥ m ≥ 0 k=0 m r m n m+n , m, n, r ≥ 0 k=0 k r−k = r n/2 n−k = Fn+1 (Fibonacci numbers), n ≥ 0 k=0 k n
=
n n−1 , n≥0 k=0 k k = n2 n n 2 n−2 , k=0 k k = n(n + 1)2 n n k k=0 (−1) k k = 0, n ≥ 0
n
(nk)
k=0 k+1
=
n
2n+1 −1 n+1 , n k
k ( ) k=0 (−1) k+1
n
=
n≥0
n≥0
1 n+1 ,
n≥0
n
1 1 1 k−1 ( k ) k=1 (−1) k = 1 + 2 + 3 + · · · + n, n > 0 n−1 n n 2n k=0 k k+1 = n−1 , n > 0 m m n m+n k=0 k p+k = m+p , m, n, p ≥ 0, n ≥ p + m
Figure 4: Binomial coefficient identities [17]
14
As a special case, let pi =
ni , n
then n n1 n2 · · · nr
n! = ≈ 2nH2 (p) r ni ! i=1
1.18. The number solutions to x1 + x2 + · · · + xn = k for the xi ’s are nonnegative of k+n−1 k+n−1 = n−1 . integers is k (a) Suppose we further require that the xi are strictly positive (xi ≥ 1), then there are k−1 solutions. n−1 (b) Extra Lower-bound Requirement: Suppose we further require that xi ≥ ai where ai are some given nonnegative integers, then the number of solution is k−(a1 +athe 2 +···+an )+n−1 n−1 . Note that here we work with equivalent problem: y1 + y2 + · · · + yn = k − ni=1 ai where yi ≥ 0. (c) Extra Upper-bound Requirement: Suppose we further require that 0 ≤ xi < such that xi ≥ bi and xj ≥ 0 for j = i. The bi . Let Ai be the set of solutions n − |∪ number of solutions is k+n−1 i=1 Ai | where the second term can be found via the n−1 inclusion/exclusion principle |I|+1 Ai = (−1) Ai i∈[n]
i∈I
I⊂[n] I=∅
and the fact that for any index set I ⊂ [n], we have |∩i∈I Ai | =
k−(
bi )+n−1 n−1
i∈I
.
(d) Extra Range Requirement: Suppose we further require that ai ≤ xi < bi where 0 ≤ ai < bi , then we work instead with yi = xi − ai . The number of solutions is ⎛ ⎝
k−
n i=1
ai
n−1
⎛
⎞
+n−1 ⎠ + (−1)|I| ⎝ I⊂[n] I=∅
k−
n i=1
ai
⎞ − (bi − ai ) + n − 1 ⎠ . i∈I n−1
1.19. The bars and stars argument: • Consider the distribution of r = 10 indistinguishable balls into n = n distinguishable cells. Then, we only concern with the number of balls in each cell. Using n − 1 = 4 bars, we can divide r = 10 stars into n = 5 groups. For example, ****|***||**|* n+r−1 would mean (4,3,0,2,1). In general, there are ways of arranging the bars and r stars. distinct vector x = xn1 of nonnegative integers such that x1 + x2 + • There are n+r−1 r · · · + xn = r. We use n − 1 bars to separate r 1’s. • Suppose r letters are drawn with replacement from a set {a1 , a2 , . . . , an }. Given a drawn sequence, let xinbe the number of ai in the drawn sequence. Then, there are n+r−1 possible x = x1 . r 15
objects
number of objects
reference
n! = P (n, n) = n(n − 1) . . . 2 · 1
§2.3.1
Arranging objects in a row: n distinct objects
nk = P (n, k) = n(n−1) . . . (n−k+1) n n! some of the n objects are identical: k1 k2 ... kj = k1 ! k2 !...kj ! k1 of a first kind, k2 of a second kind, . . . , kj of a jth kind, and where k1 + k2 + · · · + kj = n 1 1 none of the n objects remains in its Dn = n! 1− 1! + · · · +(−1)n n! original place (derangements)
k out of n distinct objects
§2.3.1 §2.3.2
§2.4.2
Arranging objects in a circle (where rotations, but not reflections, are equivalent): n distinct objects k out of n distinct objects
(n − 1)!
§2.2.1
P (n,k) k
§2.2.1
Choosing k objects from n distinct objects: order matters, no repetitions order matters, repetitions allowed order does not matter, no repetitions order does not matter, repetitions allowed
P (n, k) =
n! (n−k)!
= nk
P R (n, k) = nk n! C(n, k) = nk = k!(n−k)! C R (n, k) =
k+n−1 k
§2.3.1 §2.3.3 §2.3.2 §2.3.3
Figure 5: Counting problems and corresponding sections in [17].
16
objects
number of objects
Subsets:
reference
of size k from a set of size n
n
of all sizes from a set of size n
2
§2.3.4
Fn+2
§3.1.2
kn n
§2.2.1
§2.3.2
k n
of {1, . . . , n}, without consecutive elements Placing n objects into k cells: distinct objects into distinct cells distinct objects into distinct cells, no cell empty distinct objects into identical cells distinct objects into identical cells, no cell empty
k
n n n 1 + 2 +· · ·+ k = Bn n k
distinct objects into distinct cells, with ki in cell i (i = 1, . . . , n), and where k1 + k2 + · · · + kj = n
identical objects into identical cells, no cell empty
integers where 0 ≤ ai ≤ xi for all i integers where 0 ≤ xi ≤ ai for one or more i
§2.5.2 §2.5.2 §2.3.2
n
§2.3.3
k−1
§2.3.3
pk (n)
§2.5.1
pk (n) − pk−1 (n)
§2.5.1
n
Placing n distinct objects into k nonempty cycles
positive integers
n−1
identical objects into distinct cells, no cell empty identical objects into identical cells
nonnegative integers
n k1 k2 ... kj
n+k−1
identical objects into distinct cells
Solutions to x1 + · · · + xn = k:
§2.5.2
k!
§2.5.2
k
k+n−1 k
=
k+n−1
k−1
n−1
§2.3.3
n−1
§2.3.3
n−1
§2.3.3
inclusion/exclusion principle
§2.4.2
k−(a1 +···+an )+n−1
Figure 6: Counting problems (con’t) and corresponding sections in [17].
17
objects
number of objects
reference
Functions from a k-element set to an n-element set: nk
all functions one-to-one functions (n ≥ k) onto functions (n ≤ k)
nk =
n! (n−k)!
§2.2.1 = P (n, k)
§2.2.1 §2.4.2
inclusion/exclusion k k k 2 k k 0 + 1 n+ 2 n + · · · + k n = (n + 1)k
§2.3.2
all strings
2n
§2.2.1
with given entries in k positions
n−k
§2.2.1
partial functions Bit strings of length n:
with exactly k 0s
2 n k
with at least k 0s
n n n k + k+1 + · · · + n n
§2.3.2 §2.3.2
n/2
§2.3.2
palindromes
2n/2
§2.2.1
with an even number of 0s
2n−1
§2.3.4
without consecutive 0s
Fn+2
§3.1.2
with equal numbers of 0s and 1s
Figure 7: Counting problems (con’t) and corresponding sections in [17].
B(n) or Bn : Bell number
nk = n(n − 1) . . . (n − k + 1) = P (n, k): falling power
b(n, k): associated Stirling number of the second kind 2n 1 Cn = n+1 n : Catalan number n n! C(n, k) = k = k!(n−k)! : binomial coefficient
P (n, k) =
n! (n−k)! :
k-permutation
p(n): number of partitions of n pk (n): number of partitions of n into at most k summands
ϕ: Euler phi-function
p∗k (n): number of partitions of n into exactly k summands n k : Stirling cycle number n k : Stirling subset number
E(n, k): Eulerian number
Tn : tangent number
d(n, k): associated Stirling number of the first kind En : Euler number
Fn : Fibonacci number
Figure 8: Notation from [17]
18
1.3
Dirac Delta Function
The (Dirac) delta function or (unit) impulse function is denoted by δ(t). It is usually depicted as a vertical arrow at the origin. Note that δ(t) is not a true function; it is undefined at t = 0. We define δ(t) as a generalized function which satisfies the sampling property (or sifting property) ! φ(t)δ(t)dt = φ(0) for any function φ(t) which is continuous at t = 0. From this definition, It follows that ! (δ ∗ φ)(t) = (φ ∗ δ)(t) = φ(τ )δ(t − τ )dτ = φ(t) where we assume that φ is continuous at t. Intuitively we may δ(t) as a infinitely visualize ε 1 tall, infinitely narrow rectangular pulse of unit area: lim ε 1 |t| ≤ 2 . ε→0
We list some interesting properties of δ(t) here. • δ(t) = 0 when t = 0. δ(t − T ) = 0 for t = T . • A δ(t)dt = 1A (0). (a) δ(t)dt = 1. (b) {0} δ(t)dt = 1. x (c) −∞ δ(t)dt = 1[0,∞) (x). Hence, we may think of δ(t) as the “derivative” of the unit step function U (t) = 1[0,∞) (x). • φ(t)δ(t)dt = φ(0) for φ continuous at 0. • φ(t)δ(t − T )dt = φ(T ) for φ continuous at T . In fact, for any ε > 0, !
T +ε
φ(t)δ(t − T )dt = φ(T ).
T −ε
• δ(at) =
1 δ(t) |a|
• δ(t − t1 ) ∗ δ(t − t2 ) = δ (t − (t1 + t2 )). • Fourier properties: ◦ Fourier series: δ(x − a) = ◦ Fourier transform: δ(t) =
1 2π
+
1 π
∞
cos(n(x − a)) on [−π, π].
k=1 j2πf t
e
df
• For a function g whose real-values roots are ti , δ (g (t)) =
n δ (t − ti ) k=1
19
|g (ti )|
(2)
[3, p 387]. Hence,
! f (t)δ(g(t))dt =
x:g(x)=0
f (x) . |g (x)|
(3)
Note that the (Dirac) delta function is to be distinguished from the discrete time Kronecker delta function. As a finite measure, δ is a unit massat 0; that is for any set A, we have δ(A) = 1[0 ∈ A]. In which case, we have again gdδ = f (x)δ(dx) = g(0) for any measurable g. For a function g : D → Rn where D ⊂ Rn , δ(g(x)) =
z:g(z)=0
δ(x − z) |det dg(z)|
(4)
[3, p 387].
2
Classical Probability
Classical probability, which is based upon the ratio of the number of outcomes favorable to the occurrence of the event of interest to the total number of possible outcomes, provided most of the probability models used prior to the 20th century. Classical probability remains of importance today and provides the most accessible introduction to the more general theory of probability. Given a finite sample space Ω, the classical probability of an event A is P (A) =
the number of cases favorable to the outcome of the event A = . Ω the total number of possible cases
• In this section, we are more apt to refer to equipossible cases as ones selected at random. Probabilities can be evaluated for events whose elements are chosen at random by enumerating the number of elements in the event. • The bases for identifying equipossibility were often ◦ physical symmetry (e.g. a well-balanced die, made of homogeneous material in a cubical shape) or ◦ a balance of information or knowledge concerning the various possible outcomes. • Equipossibility is meaningful only for finite sample space, and, in this case, the evaluation of probability is accomplished through the definition of classical probability. 2.1. Basic properties of classical probability: • P (A) ≥ 0 • P (Ω) = 1 • P (∅) = 0 20
• P (Ac ) = 1 − P (A) • P (A ∪ B) = P (A) + P (B) − P (A ∩ B) which comes directly from |A ∪ B| = |A| + |B| − |A ∩ B|. • A ⊥ B is equivalent to P (A ∩ B) = 0. • A ⊥ B ⇒ P (A ∪ B) = P (A) + P (B) • Suppose Ω = {ω1 , . . . , ωn } and P (ωi ) = n1 . Then P (A) =
p (ω).
ω∈A
◦ The probability of an event is equal to the sum of the probabilities of its component outcomes because outcomes are mutually exclusive 2.2. Classical Conditional Probability : The conditional classical probability P (A|B) of event A, given that event B = ∅ occurred, is given by P (A|B) =
|A ∩ B| P (A ∩ B) = . |B| P (B)
(5)
• It is the updated probability of the event A given that we now know that B occurred. • Read “conditional probability of A given B”. • P (A|B) = P (A ∩ B|B) ≥ 0 • For any A such that B ⊂ A, we have P (A|B) = 1. This implies P (Ω|B) = P (B|B) = 1. • If A ⊥ C, P (A ∪ C |B ) = P (A |B ) + P (C |B ) • P (A ∩ B) = P (B)P (A|B) • P (A ∩ B) ≤ P (A|B) • P (A ∩ B ∩ C) = P (A) × P (B|A) × P (C|A ∩ B) • P (A ∩ B) = P (A) × P (B|A) • P (A ∩ B ∩ C) = P (A ∩ B) × P (C|A ∩ B) • P (A, B |C ) = P (A |C ) P (B |A, C ) = P (B |C ) P (A |B, C ) 2.3. Total Probability and Bayes Theorem If {Bi , . . . , Bn } is a partition of Ω, then for any set A, • Total Probability Theorem: P (A) = ni=1 P (A|Bi )P (Bi ). • Bayes Theorem: Suppose P (A) > 0, we have P (Bk |A) = 21
nP (A|Bk )P (Bk ) . i=1 P (A|Bi )P (Bi )
|=
2.4. Independence Events: A and B are independent (A
B) if and only if
P (A ∩ B) = P (A)P (B)
(6)
In classical probability, this is equivalent to |A ∩ B||Ω| = |A||B|. • Sometimes the definition for independence above does not agree with the everydaylanguage use of the word “independence”. Hence, many authors use the term “statistically independence” for the definition above to distinguish it from other definitions.
CA
B
|=
C, A
|=
B, B
|=
|=
A
|=
2.5. Having three pairwise independent events does not imply that the three events are jointly independent. In other words, C.
Example: Experiment of flipping a fair coin twice. Ω = {HH, HT, T H, T T }. Define event A to be the event that the first flip gives a H; that is A = {HH, HT }. Event B is the event that the second flip gives a H; that is B = {HH, T H}. C = {HH, T T }. Note also that even though the events A and B are not disjoint, they are independent. |=
2.6. Consider Ω of size 2n. We are given a set A ⊂ Ω of size n. Then, P (A) = 12 . We want to find all sets B ⊂ Ω such that A B. (Note that without the required independence, there are 22n possible B.) For independence, we need P (A ∩ B) = P (A)P (B). Let r = |A ∩ B|. Then, r can be any integer from 0 to n. Also, let k = |B \ A|. Then, the condition for independence becomes 1r+k r = n 2 n
|=
which is equivalent to r = k. So, the construction of the set B is given by choosing r n choices elements from set A, then choose r = k elements from set Ω \ A. There are r for the first part and nk = nr choice for the second part. Therefore, the total number of possible B such that A B is n 2 n 2n = . r n r=1
2.1
Examples
2.7. Chevalier de Mere’s Scandal of Arithmetic: Which is more likely, obtaining at least one six in 4 tosses of a fair die (event A), or obtaining at least one double six in 24 tosses of a pair of dice (event B).
22
We have
4 5 P (A) = 1 − = .518 6
and
P (B) = 1 −
35 36
24 = .491.
Therefore, the first case is more probable. 2.8. A random sample of size r with replacement is taken from a population of n elements. The probability of the event that in the sample no element appears twice (that is, no repetition in our sample) is (n)r . nr The probability that at least one element appears twice is r−1 r(r−1) i 1− ≈ 1 − e− 2n . pu (n, r) = 1 − n i=1 In fact, when r − 1 < n2 , (A.2) gives e
1 r(r−1) 3n+2r−1 2 n 3n
≤
r−1 i=1
i 1− n
1 r(r−1) n
≤ e2
.
• From the approximation, to have pu (n, r) = p, we need 1 1" 1 − 8n ln (1 − p). r≈ + 2 2 • Probability of coincidence birthday: Probability that there is at least two people who have the same birthday in your class of n students ⎧ 1, if r ≥ 365, ⎪ ⎪ ⎞ ⎛ ⎪ ⎪ ⎨ ⎜ 365 364 = 365 − (r − 1) ⎟ ⎟ , if 0 ≤ r ≤ 365 ⎜ ⎪ · · · · · · 1 − ⎪ ⎠ ⎝ 365 365 ⎪ 365 ⎪ ⎩ r terms ◦ Birthday Paradox : In a group of 23 randomly selected people, the probability that at least two will share a birthday (assuming birthdays are equally likely to occur on any given day of the year) is about 0.5. See also (3). 2.9. Monte Hall’s Game: Started with showing a contestant 3 closed doors behind of which was a prize. The contestant selected a door but before the door was opened, Monte Hall, who knew which door hid the prize, opened a remaining door. The contestant was then allowed to either stay with his original guess or change to the other closed door. Question: better to stay or to switch? Answer: Switch. Because after given that the contestant switched, then the probability that he won the prize is 23 . 23
pu(n,r) for n = 365
1 0.9
pu n, r
0.8
0.6
0.7
1 e
0.6
0.5
r r 1 2n
r n
0.5 0.4 0.3
r
23
n
365
0.4 0.3
p p p
0.9 0.7 0.5
p p
0.3 0.1
r n
23 365
0.2
0.2
0.1
0.1 0
0
5
10
15
20
25
30
35
40
45
50
0
55
0
50
100
150
200
250
300
350
n
r
Figure 9: pu (n, r) 2.10. False Positives on Diagnostic Tests: Let D be the event that the testee has the disease. Let + be the event that the test returns a positive result. Denote the probability of having the disease by pD . Now, assume that the test always returns positive result. This is equivalent to P (+|D) = 1 and P (+c |D) = 0. Also, suppose that even when the testee does not have a disease, the test will still return a positive result with probability p+ ; that is P (+|Dc ). If the test returns positive result, then the probability that the testee has the disease is P (D ∩ +) pD = = P (+) pD + p+ (1 − pD ) 1+ PD ≈ ; for rare disease (pD 1) P+
P (D |+ ) =
+
+c
3
D P (+ ∩ D) = P (+|D)P (D) = 1(pD ) = pD P (+c ∩ D) = P (+c |D)P (D) = 0(pD ) = 0 P (D) = P (+ ∩ D) + P (+c ∩ D) = pD
Dc P (+ ∩ Dc ) = P (+|Dc )P (Dc ) = p+ (1 − pD ) P (+c ∩ Dc ) = P (+c |Dc )P (Dc ) = (1 − p+ )(1 − pD ) P (Dc ) = P (+∩Dc )+P (+c ∩Dc ) = 1 − pD
p+ pD
1 (1 − pD )
P (+) = P (+ ∩ D) + P (+ ∩ Dc ) = pD + p+ (1 − pD ) P (+c ) = P (+c ∩D)+P (+c ∩Dc ) = (1 − p+ )(1 − pD ) P (Ω) = 1
Probability Foundations
To study formal definition of probability, we start with the probability space (Ω, A, P ). Let Ω be an arbitrary space or set of points ω. Viewed probabilistically, a subset of Ω is an event and an element ω of Ω is a sample point. Each event is a collection of outcomes which are elements of the sample space Ω. The theory of probability focuses on collections of events, called event algebras and typically denoted A (or F) that contain all the events of interest (regarding the random experiment E) to us, and are such that we have knowledge of their likelihood of occurrence. 24
The probability P itself is defined as a number in the range [0, 1] associated with each event in A.
3.1
Algebra and σ-algebra
The class 2Ω of all subsets can be too large1 for us to define probability measures with consistency, across all member of the class. In this section, we present smaller classes which have several “nice” properties. Definition 3.1. [7, Def 1.6.1 p38] An event algebra A is a collection of subsets of the sample space Ω such that it is (1) nonempty (this is equivalent to Ω ∈ A); (2) closed under complementation (if A ∈ A then Ac ∈ A); (3) and closed under finite unions (if A, B ∈ A then A ∪ B ∈ sA). In other words, “A class is called an algebra on Ω if it contains Ω itself and is closed under the formation of complements and finite unions.” 3.2. Examples of algebras • Ω = any fixed interval of R. A = {finite unions of intervals contained in Ω} • Ω = (0, 1]. B0 = the collection of all finite unions of intervals of the form (a, b] ⊂ (0, 1] ◦ Not a σ-field. Consider the set
∞ i=1
1 ,1 2i+1 2i
3.3. Properties of an algebra A: (a) Nonempty: ∅ ∈ A, X ∈ A (b) A ⊂ 2Ω (c) An algebra is closed under finite set-theoretic operations. • A ∈ A ⇒ Ac ∈ A • A, B ∈ A ⇒ A ∪ B ∈ A, A ∩ B ∈ A, A\B ∈ A, AΔB = (A\B ∪ B\A) ∈ A n n Ai ∈ F and Ai ∈ F • A1 , A2 , . . . , An ∈ F ⇒ i=1
i=1
(d) The collection of algebras in Ω is closed under arbitrary intersection. In particular, let A1 , A2 be algebras of Ω and let A = A1 ∩ A2 be the collection of sets common to both algebras. Then A is an algebra. (e) The smallest A is {∅, Ω}. 1
There is no problem when Ω is countable.
25
(f) The largest A is the set of all subsets of Ω known as the power set and denoted by 2Ω . (g) Cardinality of Algebras: An algebra of subsets of a finite set of n elements will always have a cardinality of the form 2k , k ≤ n. It is the intersection of all algebras which contain C. 3.4. There is a smallest (in the sense of inclusion) algebra containing any given family C of subsets of Ω. Let C ⊂ 2X , the algebra generated by C is G, G is an algebra C⊂G
i.e., the intersection of all algebra containing C. It is the smallest algebra containing C. Definition 3.5. A σ-algebra A is an event algebra that is also closed under countable unions, (∀i ∈ N)Ai ∈ A =⇒ ∪i∈N ∈ A. Remarks: • A σ-algebra is also an algebra. • A finite algebra is also a σ-algebra. 3.6. Because every σ-algebra is also an algebra, it has all the properties listed in (3.3). Extra properties of σ-algebra A are as followed: (a) A1 , A2 , . . . ∈ A ⇒
∞ j=1
Aj ∈ A and
∞
Aj ∈ A
j=1
(b) A σ-field is closed under countable set-theoretic operations. (c) The collection of σ-algebra in Ω is closed under arbitrary intersection, i.e., let Aα be σ-algebra ∀α ∈ A where A is some index set, potentially uncountable. Then, Aα is still a σ-algebra. α∈A
(d) An infinite σ-algebra F on X is uncountable i.e. σ-algebra is either finite or uncountable. (e) If A and B are σ-algebra in X, then, it is not necessary that A ∪ B is also a σalgebra. For example, let E1 , E2 ⊂ X distinct, and not complement of one another. / A 1 ∪ A2 . Let Ai = {∅, Ei , Eic , X}. Then, Ei ∈ A1 ∪ A2 but E1 ∪ E2 ∈ Definition 3.7. Generation of σ-algebra Let C ⊂ 2Ω , the σ-algebra generated by C, σ (C) is G, G is a σ−field C⊂G
i.e., the intersection of all σ-field containing C. It is the smallest σ-field containing C 26
• If the set Ω is not implicit, we will explicitly write σX (C) • We will say that a set A can be generated by elements of C if A ∈ σ (C) 3.8. Properties of σ (C): (a) σ (C) is a σ-field (b) σ (C) is the smallest σ-field containing C in the sense that if H is a σ-field and C ⊂ H, then σ (C) ⊂ H (c) C ⊂ σ (C) (d) σ (σ (C)) = σ (C) (e) If H is a σ-field and C ⊂ H, then σ (C) ⊂ H (f) σ ({∅}) = σ ({X}) = {∅, X} (g) σ ({A}) = {∅, A, Ac , X} for A ⊂ Ω. (h) σ ({A, B}) has at most 16 elements. They are ∅, A, B, A∩B, A∪B, A\B, B \A, AΔB, and their corresponding complements. See also (3.11). Some of these 16 sets can be the same and hence the use of “at-most” in the statement. (i) σ (A) = σ (A ∪ {∅}) = σ (A ∪ {X}) = σ (A ∪ {∅, X}) (j) A ⊂ B ⇒ σ (A) ⊂ σ (B) (k) σ (A) , σ (B) ⊂ σ (A) ∪ σ (B) ⊂ σ (A ∪ B) (l) σ (A) = σ (A ∪ {∅}) = σ (A ∪ {Ω}) = σ (A ∪ {∅, Ω}) 3.9. For the decomposition described in (1.8), let the starting collection be C1 and the decomposed collection be C2 . • If C1 is finite, then σ (C2 ) = σ (C1 ). • If C1 is countable, then σ (C2 ) ⊂ σ (C2 ). 3.10. Construction of σ-algebra from countable partition: An intermediate-sized σ-algebra (a σ-algebra smaller than 2Ω ) can be constructed by first partitioning Ω into subsets and then forming the power set of these subsets, with an individual subset now playing the role of an individual outcome ω. Given a countable2 partition Π = {Ai , i ∈ I} of Ω, we can form a σ-algebra A by including all unions of subcollections: A = {∪α∈S Aα : S ⊂ I} 2
In this case, Π is countable if and only if I is countable.
27
(7)
[7, Ex 1.5 p 39] where we define
Ai = ∅. It turns out that A = σ(Π). Of course, a
i∈∅
σ-algebra is also an algebra. Hence, (7) is also a way to construct an algebra. Note that, from (1.9), the necessary and sufficient condition for distinct S to produce distinct element in (7) is that none of the Aα ’s are empty. In particular, for countable Ω = {xi : i ∈ N}, where xi ’s are distinct. If we want a σ-algebra which contains {xi } ∀i, then, the smallest one is 2Ω which happens to be the biggest one. So, 2Ω is the only σ-algebra which is “reasonable” to use. 3.11. Generation of σ-algebra from finite partition: If a finite collection Π = {Ai , i ∈ I} of non-empty sets forms a partition of Ω, then the algebra generated by Π is the same as the σ-algebra generated by Π and it is given by (7). Moreover, |σ(Π)| is 2|I| = 2|Π| ; that is distinct sets S in (7) produce distinct member of the (σ-)algebra. Therefore, given a finite collection of sets C = {C1 , C2 , . . . , Cn }. To find an algebra or a σ-algebra generated by C, the first step is to use the Ci ’s to create a partition of Ω. Using (1.8), the partition is given by Π = {∩ni=1 Bi : Bi = Ci or Cic } .
(8)
By (3.9), we know that σ(π) = σ(C). Note that there are seemingly 2n sets in Π however, some of them can be ∅. We can eliminate the empty set(s) from Π and it is still a partition. So the cardinality of Π in (8) (after empty-set elimination) is k where k is at most 2n . The partition Π is then a collection of k sets which can be renamed as A1 , . . . , Ak , all of which are non-empty. Applying the construction in (7) (with I = [n]), we then have σ(C) whose n cardinality is 2k which is ≤ 22 . See also properties (3.8.7) and (3.8.8). Definition 3.12. In general, the Borel σ-algebra or Borel algebra B is the σ-algebra generated by the open subsets of Ω • Call BΩ the σ-algebra of Borel subsets of Ω or σ-algebra of Borel sets on Ω • Call set B ∈ BΩ Borel set of Ω (a) On Ω = R, the σ-algebra generated by any of the followings are Borel σ-algebra: (i) Open sets (ii) Closed sets (iii) Intervals (iv) Open intervals (v) Closed intervals (vi) Intervals of the form (−∞, a], where a ∈ R i. Can replace R by Q ii. Can replace (−∞, a] by (−∞, a), [a, +∞), or (a, +∞) iii. Can replace (−∞, a] by combination of those in 1(f)ii. (b) For Ω ⊂ R, BΩ = {A ∈ BR : A ⊂ Ω} = BR ∩ Ω where BR ∩ Ω = {A ∩ Ω : A ∈ BR } 28
(c) Borel σ-algebra on the extended real line is the extended σ-algebra BR = {A ∪ B : A ∈ BR , B ∈ {∅, {−∞} , {∞} , {−∞, ∞}}} It is generated by, for example (i) A ∪ {{−∞} , {∞}} where σ (A) = BR '( ) '( *) ' *) (ii) a ˆ, ˆb , a ˆ, ˆb , a ˆ, ˆb (iii) {[−∞, b]} , {[−∞, b)} , {(a, +∞]} , {[a, +∞]} '( *) '( ) (iv) −∞, ˆb , −∞, ˆb , {(ˆ a, +∞]} , {[ˆ a, +∞]} Here, a, b ∈ R and a ˆ, ˆb ∈ R ∪ {±∞} (d) Borel σ-algebra in Rk is generated by (i) the class of open sets in Rk (ii) the class of closed sets in Rk (iii) the class of bounded semi-open rectangles (cells) I of the form I = x ∈ Rk : ai < xi ≤ bi , i = 1, . . . , k . k
Note that I = ⊗ (ai , bi ] where ⊗ denotes the Cartesian product × i=1
(iv) the class of “southwest regions” Sx of points “southwest” to x ∈ Rk , i.e. Sx = y ∈ Rk : yi ≤ xi , i = 1, . . . , k 3.13. The Borel σ-algebra B of subsets of the reals is the usual algebra when we deal with real- or vector-valued quantities. Our needs will not require us to delve into these issues beyond being assured that events we discuss are constructed out of intervals and repeated set operations on these intervals and these constructions will not lead us out of B. In particular, countable unions of intervals, their complements, and much, much more are in B.
3.2
Kolmogorov’s Axioms for Probability
Definition 3.14. Kolmogorov’s Axioms for Probability [12]: A set function satisfying K0–K4 is called a probability measure. K0 Setup: The random experiment E is described by a probability space (Ω, A, P ) consisting of an event σ-algebra A and a real-valued function P : A → R. K1 Nonnegativity: ∀A ∈ A, P (A) ≥ 0. K2 Unit normalization: P (Ω) = 1.
29
K3 Finite additivity: If A, B are disjoint, then P (A ∪ B) = P (A) + P (B). K4 Monotone continuity: If (∀i > 1) Ai+1 ⊂ Ai and ∩i∈N Ai = ∅ (a nested series of sets shrinking to the empty set), then lim P (Ai ) = 0.
i→∞
K4 Countable or σ-additivity: If (Ai ) is a countable collection of pairwise disjoint (no overlap) events, then +∞ , ∞ Ai = P (Ai ). P i=1
i=1
• Note that there is never a problem with the convergence of the infinite sum; all partial sums of these non-negative summands are bounded above by 1. • K4 is not a property of limits of sequences of relative frequencies nor meaningful in the finite sample space context of classical probability, is offered by Kolmogorov to ensure a degree of mathematical closure under limiting operations [7, p 111]. • K4 is an idealization that is rejected by some accounts (usually subjectivist) of probability. • If P satisfies K0–K3, then it satisfies K4 if and only if it satisfies K4 [7, Theorem 3.5.1 p 111]. Proof. To show “⇒”, consider disjoint A1 , A2 , . . .. Define Bn = ∪i>n Ai . Then, by (1.6), Bn+1 ⊂ Bn and ∩∞ n=1 Bn = ∅. So, by K4, lim P (Bn ) = 0. Furthermore, n→∞
∞
+ Ai = Bn ∪
i=1
n
, Ai
i=1
where all the sets on the RHS are disjoint. Hence, by finite additivity, +∞ , n Ai = P (Bn ) + P (Ai ). P i=1
i=1
Taking limiting as n → ∞ gives us K4 . To show “⇐”, see (3.17). Equivalently, in stead of K0-K4, we can define probability measure using P0-P2 below. Definition 3.15. A probability measure defined on a σ-algebra A of Ω is a (set) function (P0) P : A → [0, 1] that satisfies: 30
(P1,K2) P (Ω) = 1 (P2,K4 ) Countable additivity countable sequence (An )∞ n=1 of disjoint ele every ∞ : For ∞ ments of A, one has P An = P (An ) n=1
n=1
• P (∅) = 0 • The number P (A) is called the probability of the event A • The triple (Ω, A, P ) is called a probability measure space, or simply a probability space • A support of P is any A-set A for which P (A) = 1
3.3
Properties of Probability Measure
3.16. Properties of probability measures: (a) P (∅) = 0 (b) 0 ≤ P ≤ 1: For any A ∈ A, 0 ≤ P (A) ≤ 1 (c) If P (A) = 1, A is not necessary Ω. (d) Additivity: A, B ∈ A, A ∩ B = ∅ ⇒ P (A ∪ B) = P (A) + P (B) (e) Monotonicity: A, B ∈ A, A ⊂ B ⇒ P (A) ≤ P (B) and P (B − A) = P (B) − P (A) (f) P (Ac ) = 1 − P (A) (g) P (A) + P (B) = P (A ∪ B) + P (A ∩ B) = P (A − B) + 2P (A ∩ B) + P (B − A). P (A ∪ B) = P (A) + P (B) − P (A ∩ B). (h) P (A ∪ B) ≥ max(P (A), P (B)) ≥ min(P (A), P (B)) ≥ P (A ∩ B) (i) Inclusion-exclusion principle: n P Ak = P (Ai ) − P (Ai ∩ Aj ) + P (Ai ∩ Aj ∩ Ak )+ k=1
i
i<j
i<j
· · · + (−1)n+1 P (A1 ∩ · · · ∩ An ) In a more compact form, , +n Ai = P i=1
φ=I⊂{1,...,n}
31
(−1)|I|+1 P
+ i∈I
, Ai .
For example, P (A1 ∪ A2 ∪ A3 ) = P (A1 ) + P (A2 ) + P (A3 ) − P (A1 ∩ A2 ) − P (A1 ∩ A3 ) − P (A2 ∩ A3 ) + P (A1 ∩ A2 ∩ A3 ) . See also (8.2). Moreover, for any event B, we have + n , + , P Ack ∩ B = P (B) + (−1)|I| P Ai ∩ B . ∅=I⊂[n]
k=1
(j) Finite additivity: If A =
n
(9)
i∈I
Aj with Aj ∈ A disjoint, then P (A) =
j=1
n
P (Aj )
j=1
• If A and B are disjoint sets in A, then P (A ∪ B) = P (A) + P (B) (k) Subadditivity or Boole’s Inequality: If A1 , . . . , An are events, not necessarily n n Ai ≤ P (Ai ) disjoint, then P i=1
i=1
(l) σ-subadditivity: 1 , A2 , . . . is a sequence of measurable sets, not necessarily ∞If A ∞ Ai ≤ P (Ai ) disjoint, then P i=1
i=1
• This formula is known as the union bound in engineering. • If A1 , A2 ,. . . is a sequence α of events, not necessarily disjoint, then ∀α ∈ (0, 1], ∞ ∞ P Ai ≤ P (Ai ) i=1
i=1
3.17. Conditional continuity ∞from above. If B1 ⊃ B2 ⊃ B3 ⊃ · · · is a decreasing sequence of measurable sets, then P ( i=1 Bi ) = lim P (Bj ). In a more compact notation, if Bi B, j→∞
then P (B) = lim P (Bj ). j→∞
Proof. Let B =
∞
Bi . LetAk = Bk \Bk+1 , i.e, the new part. We consider two partitions of
i=1 ∞
B1 : (1) B1 = B ∪
Aj and (2) B1 = Bn ∪
j=1
(2) implies P (B1 ) − P (Bn ) =
n−1
n−1
Aj . (1) implies P (B1 ) − P (B) =
j=1
j=1
P (Aj ). We then have
j=1 ∞ (1)
lim (P (B1 ) − P (Bn )) =
n→∞
j=1
32
∞
(2)
P (Aj ) = P (B1 ) − P (B) .
P (Aj ).
3.18. Let A be a σ-algebra. Suppose that P : A → [0, 1] satisfies (P1) and is finitely additive. Then, the following are equivalent: ∞ ∞ An = P (An ) • (P2): If An ∈ A, disjoint, then P n=1
n=1
• (K4) If An ∈ A, and An ∅, then P (An ) 0 • (Continuity from above) If An ∈ A, and An A, then P (An ) P (A) • If An ∈ A, and An Ω, then P (An ) P (Ω) = 1 • (Continuity from below) If An ∈ A, and An A, then P (An ) P (A) Hence, a probability measure satisfies all five properties above. Continuity from above and continuity from below are collectively called sequential continuity properties. In fact, for probability measure P , let An be a sequence of events in A which converges to A (i.e. An → A). Then A ∈ A and lim P (An ) = P (A). Of course, both An A and n→∞ An A imply An → A. Note also that +N +∞ , , An = lim P An , P N →∞
n=1
+
and P
∞
+
, An
n=1
= lim P N →∞
n=1
N
, An .
n=1
Alternative form of the sequential continuity properties are , +∞ An = lim P (AN ) , if An ⊂ An+1 P N →∞
n=1
+
and P
∞
, An
n=1
= lim P (AN ) , if An+1 ⊂ An . N →∞
3.19. Given a common event algebra A , probability measures P1 , . . ., Pm , and the numbers m m λ1 , . . ., λm , λi ≥ 0, λi = 1, a convex combination P = λi Pi of probability measures is 1
1
a probability measure 3.20. A can not contain an uncountable, disjoint collection of sets of positive probability. Definition 3.21. Discrete probability measure P is a discrete probability measure if ∃ finitelyor countably many points ωk and nonnegative masses mk such that ∀A ∈ A mk = mk IA (ωk ) P (A) = k:ωk ∈A
k
If there is just one of these points, say ω0 , with mass m0 = 1, then P is a unit mass at ω0 . In this case, ∀A ∈ A, P (A) = IA (ω0 ). Notation: P = δω0 • Here, Ω can be uncountable. 33
3.4
Countable Ω
A sample space Ω is countable if it is either finite or countably infinite. It is countably infinite if it has as many elements as there are integers. In either case, the element of Ω can be enumerated as, say, ω1 , ω2 , . . . . If the event algebra A contains each singleton set {ωk } (from which it follows that A is the power set of Ω), then we specify probabilities satisfying the Kolmogorov axioms through a restriction to the set S = {{ωk }} of singleton events. Definition 3.22. When Ω is countable, a probability mass function (pmf) is any function p : Ω → [0, 1] such that p(ω) = 1. ω∈Ω
When the elements of Ω are enumerated, then it is common to abbreviate p(ωi ) = pi . 3.23. Every pmf p defines a probability measure P and conversely. Their relationship is given by p(ω) = P ({ω}), (10) P (A) =
p(ω).
(11)
ω∈A
The convenience of a specification by pmf becomes clear when Ω is a finite set of, say, n elements. Specifying P requires specifying 2n values, one for each event in A, and doing so in a manner that is consistent with the Kolmogorov axioms. However, specifying p requires only providing n values, one for each element of Ω, satisfying the simple constraints of nonnegativity and addition to 1. The probability measure P satisfying (11) automatically satisfies the Kolmogorov axioms.
3.5
Independence
Definition 3.24. Independence between events and collections of events. (a) Two events A, B are called independent if P (A ∩ B) = P (A) P (B) (i) An event with probability 0 or 1 is independent of any event (including itself). In particular, ∅ and Ω are independent of any events. (ii) Two events A, B with positive probabilities are independent if and only if P (B |A) = P (B), which is equivalent to P (A |B ) = P (A) When A and/or B has zero probability, A and B are automatically independent. (iii) An event A is independent of itself if and only if P (A) is 0 or 1. (iv) If A an B are independent, then the two classes σ ({A}) = {∅, A, Ac , Ω} and σ ({B}) = {∅, B, B c , Ω} are independent. (v) Suppose A and B are disjoint. A and B are independent if and only if P (A) = 0 or P (B) = 0. 34
(vi) Suppose A ⊂ B. A and B are independent if and only if P (A) =
P (B) . P (B)+1
(b) Independence for finite collection {A1 , . . . , An } of sets: , + Aj = P (Aj ) ∀J ⊂ [n] and |J| ≥ 2 ≡ P j∈J
j∈J
◦ Note that the case when j = 1 automatically holds. The case when j = 0 can be regard as the ∅ event case, which is also trivially true. n n = 2n − 1 − n constraints. ◦ There are j j=2
◦ Example: A1 , A2 , A3 are independent if and only if P (A1 ∩ A2 ∩ A3 ) = P P (A1 ∩ A2 ) = P P (A1 ∩ A3 ) = P P (A2 ∩ A3 ) = P
(A1 ) P (A1 ) P (A1 ) P (A2 ) P
(A2 ) P (A3 ) (A2 ) (A3 ) (A3 )
Remark : The first equality alone is not enough for independence. See a counter example below. In fact, it is possible for the first equation to hold while the last three fail as shown in (3.26.b). It is also possible to construct events such that the last three equations hold (pairwise independence), but the first one does not as demonstrated in (3.26.a). ≡ P (B1 ∩ B2 ∩ · · · ∩ Bn ) = P (B1 ) P (B2 ) · · · P (Bn ) where Bi = Ai or Bi = Ω (c) Independence for collection {Aα : α ∈ I} of sets: ≡ ∀ finite J ⊂ I, P Aα = P (A) α∈J
α∈J
≡ Each of the finite subcollection is independent. (d) Independence for finite collection {A1 , . . . , An } of classes: ≡ ≡ ≡ ≡ ≡ ≡
the finite collection of sets A1 , . . . , An is independent where Ai ∈ Ai . P (B1 ∩ B2 ∩ · · · ∩ Bn ) = P (B1 ) P (B2 ) · · · P (Bn ) where Bi ∈ Ai or Bi = Ω P (B1 ∩ B2 ∩ · · · ∩ Bn ) = P (B1 ) P (B2 ) · · · P (Bn ) where Bi ∈ Ai ∪ {Ω} ∀i ∀Bi ⊂ Ai B1 , . . . , Bn are independent. A1 ∪ {Ω} , . . . , An ∪ {Ω} are independent. A1 ∪ {∅} , . . . , An ∪ {∅} are independent.
(e) Independence for collection {Aθ : θ ∈ Θ} of classes: ≡ Any collection {Aθ : θ ∈ Θ} of sets is independent where Aθ ∈ Aθ ≡ Any finite subcollection of classes is independent. ≡ ∀ finite Λ ⊂ Θ, P Aθ = P (Aθ ) θ∈Λ
θ∈Λ
35
• By definition, a subcollection of independent events is also independent. • The class {∅, Ω} is independent from any class. Definition 3.25. A collection of events {Aα } is called pairwise independent if for every distinct events Aα1 , Aα2 , we have P (Aα1 ∩ Aα2 ) = P (Aα1 ) P (Aα2 ) • If a collection of events {A:α : α ∈ I} is independent, then it is pairwise independent. The converse is false. See (a) in example (3.26). Aα = P (A) does not imply P Aα = P (A) • For K ⊂ J, P α∈J
α∈J
α∈K
α∈K
Example 3.26. (a) Let Ω = {1, 2, 3, 4}, A = 2Ω , P (i) = 14 , A1 = {1, 2}, A2 = {1, 3}, A3 = {2, 3}. Then P (Ai ∩ Aj ) = P (Ai ) P (Aj ) for all i = j butP (A1 ∩ A2 ∩ A3 ) = P (A1 ) P (A2 ) P (A3 ) (b) Let Ω = {1, 2, 3, 4, 5, 6}, A = 2Ω , P (i) = 16 , A1 = {1, 2, 3, 4}, A2 = A3 = {4, 5, 6}. Then, P (A1 ∩ A2 ∩ A3 ) = P (A1 ) P (A2 ) P (A3 ) but P (Ai ∩ Aj ) = P (Ai ) P (Aj ) for all i = j (c) The paradox of ”almost sure” events: Consider two random events with probabilities of 99% and 99.99%, respectively. One could say that the two probabilities are nearly the same, both events are almost sure to occur. Nevertheless the difference may become significant in certain cases. Consider, for instance, independent events which may occur on any day of the year with probability p = 99%; then the probability P that it will occur every day of the year is less than 3%, while if p = 99.99% then P = 97%.
4
Random Element
4.1. A function X : Ω → (Ω,A)
E
is said to be a random element of E if and only if
(E,BE )
X is measurable which is equivalent to each of the following statements. ≡ X −1 (BE ) ⊂ A ≡ σ (X) ⊂ A ≡ (reduced form) ∃C ⊂ BE such that σ (C) = BE and X −1 (C) ⊂ A • When E ⊂ R, X is called a random variable. • When E ⊂ Rd , then X is called a random vector. Definition 4.2. X = Y almost surely (a.s.) if P [X = Y ] = 1. • The a.s. equality is an equivalence relation.
36
4.3. Law of X or Distribution of X: P X = μX = P X −1 = L(X) : E → [0, 1] (E,E)
μX (A) = P X (A) = P X −1 (A) = P ◦ X −1 (A) = P ({ω : X (ω) ∈ A}) = P ({X ∈ A}) 4.4. For X ∈ Lp , lim tp P [|X| ≥ t] → 0 t→∞
4.5. A Borel set S is called a support of X if P X (S c ) = 0 (or equivalently P X (S) = 1)
4.1
Random Variable
Definition 4.6. A real-valued function X(ω) defined for points ω in a sample space Ω is called a random variable. • Random variables are important because they provide a compact way of referring to events via their numerical attributes. • The abbreviation r.v. will be used for “real-valued random variables” [10, p 1]. • Technically, a random variable must be measurable. 4.7. At a certain point in most probability courses, the sample space is rarely mentioned anymore and we work directly with random variables. The sample space often “disappears” but it is really there in the background. 4.8. For B ∈ R, we use the shorthand • [X ∈ B] = {ω ∈ Ω : X(ω) ∈ B} and • P [X ∈ B] = P ([X ∈ B]) = P ({ω ∈ Ω : X(ω) ∈ B}) . • In particular, P [X < x] is a shorthand for P ({ω ∈ Ω : X(ω) < x}). 4.9. If X and Y are random variables, we use the shorthand • [X ∈ B, Y ∈ C] = {ω ∈ Ω : X(ω) ∈ B and Y (ω) ∈ C} = [X ∈ B] ∩ [Y ∈ C]. • P [X ∈ B, Y ∈ C] = P ([X ∈ B] ∩ [Y ∈ C]) . 4.10. Every random variable can be written as a sum of a discrete random variable and a continuous random variable. 4.11. A random variable can have at most countably many point x such thatP [X = x] > 0. 4.12. Point masses probability measures / Direc measures, usually written εα , δα , is used to denote point mass of size one at the point α. In this case, • P X {α} = 1 • P X ({α}c ) = 0 37
• FX (x) = 1[α,∞) (x) 4.13. There exists distributions that are neither discrete nor continuous. Let μ (A) = 12 μ1 (A) + 12 μ2 (A) for μ1 discrete and μ2 coming from a density. 4.14. When X and Y take finitely many values, say x1 , . . . , xm and y1 , . . . , yn , respectively, we can arrange the probabilities pX,Y (xi , yj ) in the m × n matrix ⎡ ⎤ pX,Y (x1 , y1 ) pX,Y (x1 , y2 ) . . . pX,Y (x1 , yn ) ⎢ pX,Y (x2 , y1 ) pX,Y (x2 , y2 ) . . . pX,Y (x2 , yn ) ⎥ ⎢ ⎥ ⎢ ⎥. .. .. .. . . ⎣ ⎦ . . . . pX,Y (xm , y1 ) pX,Y (xm , y2 ) . . . pX,Y (xm , yn ) • The sum of the entries in the ith row is PX (xi ), and the sum of the entries in the jth column is PY (yj ). • The sum of all the entries in the matrix is one.
4.2
Distribution Function
4.15. The (cumulative) distribution function (cdf ) induced by a probability P on R , B is the function F (x) = P (−∞, x]. =Ω
The (cumulative) distribution function (cdf ) of the random variable X is the function FX (x) = P X (−∞, x] = P [X ≤ x]. • The distribution P X can be obtained from the distribution function by setting P X (−∞, x] = FX (x); that is FX uniquely determines P X • 0 ≤ FX ≤ 1 C1 FX is non-decreasing C2 FX is right continuous: lim FX (y) ≡ lim FX (y) = FX (x) = P [X ≤ x] ∀x FX x+ ≡ y→x y>x
y x
◦ The function g(x) = P [X < x] is left-continuous in x. C3 lim FX (x) = 0 and lim FX (x) = 1. x→−∞
x→∞
• ∀x FX (x− ) ≡ y→x lim FX (y) ≡ lim FX (y) = P X (−∞, x) = P [X < x] y<x
yx
• P [X = x] = P X {x} = F (x) − F (x− ) = the jump or saltus in F at x
38
Figure 10: Right-continuous function at jump point • ∀x
P ((x, y]) = F (y) − F (x) P ([x, y]) = F (y) − F x− P ([x, y)) = F y − − F x− P ((x, y)) = F y − − F (x) P ({x}) = F (x) − F x−
• A function F is the distribution function of some probability measure on (R, B) if and only if one has (C1), (C2), and (C3). • If F satisfies (C1), (C2), and (C3), then ∃ a unique probability measure P on (R, B) that has P (a, b] = F (b) − F (a) ∀a, b ∈ R • FX is continuous if and only if P [X = x] = 0 • FX is continuous if and only if P X is continuous. • FX has at most countably many points of discontinuity. Definition 4.16. It is traditional to write X ∼ F to indicate that “X has distribution F ” [21, p 25]. Definition 4.17. FX,A (x) = P ([X ≤ x] ∩ A) 4.18. Left-continuous inverse: g −1 (y) = inf {x ∈ R : g (x) ≥ y}, y ∈ (0, 1) • Trick : Just flip the graph along the line x = y, then make the graph left-continuous. • If g is a cdf, then only consider y ∈ (0, 1). It is called the inverse CDF [7, def 8.4.1 p. 238] or quantile function. ◦ In [21, Def 2.16 p 25], the inverse CDF is defined using strict inequality “>” rather than “≥”. • See table 1 for examples.
39
g 1 y inf ^ x \ : g x t y`
g x
y6
x4
y5 y4 y3 y2 y1
x3 x2 x2
x3
x4
y6
y1 y3 y4
x
Figure 11: Left-continuous inverse on (0,1)
Distribution Exponential
F 1 − e−λx x−a b
Extreme value
1 − e−e
Geometric
1 − (1 − p)i
Logistic
1−
Pareto Weibull
1 x−μ 1+e b
1 − x−a x b 1 − e( a )
F −1 ln (u)
− λ1
a 3+ b ln ln4u μ−
ln u ln(1−p) b ln u1 − 1
u− a
Table 1: Left-continuous inverse
40
1
a (ln u) b
1
y
Definition 4.19. Let X be a random variable with distribution function F . Suppose that p ∈ (0, 1). A value of x such that F (x− ) = P [X < x] ≤ p and F (x) = P [X ≤ x] ≥ p is called a quantile of order p for the distribution. Roughly speaking, a quantile of order p is a value where the cumulative distribution crosses p. Note that it is not unique. Suppose F (x) = p on an interval [a, b], then all x ∈ [a, b] are quantile of order p. A quantile of order 12 is called a median of the distribution. When there is only one median, it is frequently used as a measure of the center of the distribution. A quantile of order 14 is called a first quartile and the quantile of order 34 is called a third quartile. A median is a second quartile. Assuming uniqueness, let q1 , q2 , and q3 denote the first, second, and third quartiles of X. The interquartile range is defined to be q3 − q1 , and is sometimes used as a measure of the spread of the distribution with respect to the median. The five parameters max X, q1 , q2 , q3 , min X are often referred to as the five-number summary. Graphically, the five numbers are often displayed as a boxplot. 4.20. If F is non-decreasing, right continuous, with lim F (x) = 0 and lim F (x) = 1, x→−∞
x→∞
then F is the CDF of some probability measure on (R, B). In particular, let U ∼ U (0, 1) and X = F −1 (U ), then FX = F . Here, F −1 is the left-continuous inverse of F . Note that we just explicitly define a random variable X(ω) with distribution function F on Ω = (0, 1). • To generate X ∼ E (λ), set X = − λ1 ln (U )
4.3
Discrete random variable
Definition 4.21. A random variable X is said to be a discrete random variable if there exists countable distinct real numbers xk such that P [X = xk ] = 1. k
≡ ∃ countable set {x1 , x2 , . . .} such that μX ({x1 , x2 , . . .}) = 1 ≡ X has a countable support {x1 , x2 , . . .} ⇒ X is completely determined by the values μX ({x1 }) , μX ({x2 }) , . . . • pi = pX (xi ) = P [X = xi ] Definition 4.22. When X is a discrete random variable taking distinct values xk , we define its probability mass function (pmf) by pX (xk ) = P [X = xk ]. • We can use stem plot to visualize pX . • If Ω is countable, then there can be only countably many value of X(ω). So, any random variable defined on countable Ω is discrete. 41
• Sometimes, we write p(xk ) or pxk in stead of pX (xk ). • P [X ∈ B] = xk ∈B P [X = xk ]. • FX (x) = xk pX (xk )U (x − xk ). Definition 4.23 (Discrete CDF). A cdf which can be written in the form Fd (x) = k pi U (x − xk ) is called a discrete cdf [7, Def. 5.4.1 p 163]. Here, U is the unit step function, {xk } is an arbitrary countable set of real numbers, and {pi } is a countable set of positive numbers that sum to 1. Definition 4.24. An integer-valued random variable is a discrete random variable whose distinct values are xk = k. For integer-valued random variables, P [X = k]. P [X ∈ B] = k∈B
4.25. Properties of pmf • p : Ω → [0, 1]. • 0 ≤ pX ≤ 1. • k pX (xk ) = 1. Definition 4.26. Sometimes, it is convenient to work with the “pdf” of a discrete r.v. Given that X is a discrete random variable which is defined as in definition 4.22. Then, the “pdf” of X is pX (xk )δ(x − xk ), x ∈ R. (12) fX (x) = xk
Although the delta function is not a well-defined function3 , this technique does allow easy manipulation of mixed distribution. The definition of quantities involving discrete random variables and the corresponding properties can then be derived from the pdf and hence there is no need to talk about pmf at all!
4.4
Continuous random variable
Definition 4.27. A random variable X is said to be a continuous random variable if and only if any one of the following equivalent conditions holds. ≡ ∀x, P [X = x] = 0 ≡ ∀ countable set C, P X (C) = 0 ≡ FX is continuous 3
Rigorously, it is a unit measure at 0.
42
4.28. f is (probability) density function f (with respect to Lebesgue measure) of a random variable X (or the distribution P X ) ≡ P X have density f with respect to Lebesgue measure. ≡ P X is absolutely continuous w.r.t. the Lebesgue measure (P X λ) with f = dP X , the Radon-Nikodym derivative. dλ ≡ f is a nonnegative Borel function on R such that ∀B ∈ BR PX (B) = B f (x)dx = f dλ where λ is the Lebesgue measure. (This extends nicely to the random vector B
case.) ≡ X is absolutely continuous ≡ X (or F X ) comes from the density f ≡ ∀x ∈ R FX (x) =
x
f (t)dt
−∞
≡ ∀a, b FX (b) − FX (a) =
b
f (x)dx
a
4.29. If F does differentiate to f and f is continuous, it follows by the fundamental theorem of calculus that f is indeed a density for F . That is, if F has a continuous derivative, this derivative can serve as the density f . 4.30. Suppose a random variable X has a density f . • F need not differentiate to f everywhere.
•
◦ When X ∼ U(a, b), FX is not differentiable at a nor b. f (x) dx = 1
Ω
• f is determined only Lebesgue-a.e. That is, If g = f Lebesgue-a.e., then g can also serve as a density for X and PX • f is nonnegative a.e. [9, stated on p 138] • X is a continuous random variable • f at its continuity points must be the derivative of F • P [X ∈ [a, b]] = P [X ∈ [a, b)] = P [X ∈ (a, b]] = P [X ∈ (a, b)] because the corresponding integrals over an interval are not affected by whether or not the endpoints are included or excluded. In other words, P [X = a] = P [X = b] = 0. • P [fx (X) = 0] = 0 4.31. fX (x) = E [δ (X − x)] 43
Definition 4.32 (Absolutely Continuous CDF). An absolutely continuous cdf Fac can be written in the form ! x f (z)dz, Fac (x) = −∞
where the integrand, d Fac (x), dx is defined a.e., and is a nonnegative, integrable function (possibly having discontinuities) satisfying ! f (x)dx = 1. f (x) =
4.33. Any nonnegative function that integrates to one is a probability density function (pdf) [9, p 139]. 4.34. Remarks: Some useful intuitions (a) Approximately, for a small Δx, P [X ∈ [x, x + Δx]] = This is why we call fX the density function. (b) In fact, fX (x) = lim
Δx→0
x+Δx x
fX (t)dt ≈ f − X(x)Δx.
P [x<X≤x+Δx] Δx
4.35. Let T be an absolutely continuous nonnegative random variable with cumulative distribution function F and density f on the interval [0, ∞). The following terms are often used when T denotes the lieftime of a device or system. (a) Its survival-, survivor-, or reliability-function is: !∞ f (x)dx = 1 − F (t) .
R (t) = P [T > t] = t
• R(0) = P [T > 0] = P [T ≥ 0] = 1. (b) The mean time of failure (MTTF) = E [T ] =
∞ 0
R(t)dt.
(c) The (age-specific) failure rate or hazard function os a device or system with lifetime T is P [T ≤ t + δ|T > t] R (t) f (t) d =− = = ln R(t). r (t) = lim δ→0 δ R(t) R (t) dt (i) r (t) δ ≈ P [T ∈ (t, t + δ] |T > t] (ii) R(t) = e−
t 0
r(τ )dτ
(iii) f (t) = r(t)e−
t 0
.
r(τ )dτ
• For T ∼ E(λ), r(t) = λ. See also [9, section 5.7]. 44
Definition 4.36. A random variable whose cdf is continuous but whose derivative is the zero function is said to be singular. • See Cantor-type distribution in [5, p 35–36]. • It has no density. (Otherwise, the cdf is the zero function.) So, ∃ continuous random variable X with no density. Hence, ∃ random variable X with no density. • Even when we allow the use of delta function for the density as in the case of mixed r.v., it still has no density because there is no jump in the cdf. • There exists singular r.v. whose cdf is strictly increasing. Definition 4.37. fX,A (x) =
4.5
d F dx X,A
(x). See also definition 4.17.
Mixed/hybrid Distribution
There are many occasion where we have a random variable whose distribution is a combination of a normalized linear combination of discrete and absolutely continuous distributions. For convenience, we use the Dirac delta function to link the pmf to a pdf as in definition 4.26. Then, we only have to concern about the pdf of a mixed distribution/r.v. 4.38. By allowing density functions tocontain impulses, the cdfs of mixed random variables can be expressed in the form F (x) = (−∞,x] f (t)dt. 4.39. Given a cdf FX of a mixed random variable X, the density fX is given by fX (x) = f˜X (x) + P [X = xk ]δ(x − xk ), k
where • the xi are the distinct points at which FX has jump discontinuities, and FX (x) , FX is differentiable at x • f˜X (x) = 0, otherwise. In which case,
! E [g(X)] =
g(x)f˜X (x)dx +
g(xk )P [X = xk ].
k
Note also that P [X = xk ] = F (xk ) −
F (x− k)
4.40. Suppose the cdf F can be expressed in the form F (x) = G(x)U (x − x0 ) for some function G. Then, the density is f (x) = G (x)U (x − x0 ) + G(x0 )δ(x − x0 ). Note that G(x0 ) = F (x0 ) = P [X = x0 ] is the jump of the cdf at x0 . When the random variable is continuous, G(x0 ) = 0 and thus f (x) = G (x)U (x − x0 ).
45
4.6
Independence
Definition 4.41. A family of random variables {Xi : i ∈ I} is independent if ∀ finite J ⊂ I, the family of random variables {Xi : i ∈ J} is independent. In words, “an infinite collection of random elements is by definition independent if each finite subcollection is.” Hence, we only need to know how to test independence for finite collection. (a) (Ei , Ei )’s are not required to be the same. (b) The collection of random variables {1Ai : i ∈ I} is independent iff the collection of events (sets) {Ai : i ∈ I} is independent. Definition 4.42. Independence among finite collection of random variables: For finite I, the following statements are equivalent ≡ (Xi )i∈I are independent (or mutually independent [2, p 182]). ≡ P [Xi ∈ Hi ∀i ∈ I] = P [Xi ∈ Hi ]where Hi ∈ Ei i∈I
6 5 ≡ P (Xi : i ∈ I) ∈ × Hi = P [Xi ∈ Hi ] where Hi ∈ Ei i∈I
≡ P (Xi :i∈I)
× Hi
=
i∈I
≡ P [Xi ≤ xi ∀i ∈ I] =
i∈I
P Xi (Hi ) where Hi ∈ Ei
i∈I
P [Xi ≤ xi ]
i∈I
≡ [Factorization Criterion] F(Xi :i∈I) ((xi : i ∈ I)) =
i∈I
FXi (xi )
≡ Xi and X1i−1 are independent ∀i ≥ 2 ≡ σ (Xi ) and σ X1i−1 are independent ∀i ≥ 2 ≡ Discrete random variables Xi ’s with countable range E: P [Xi = xi ∀i ∈ I] = P [Xi = xi ] ∀xi ∈ E ∀i ∈ I i∈I
≡ Absolutely continuous Xi with density fXi : f(Xi :i∈I) ((xi : i ∈ I)) =
i∈I
fXi (xi )
Definition 4.43. If the Xα , α ∈ I are independent and each has the same marginal distribution with distribution Q, we say that the Xα ’s are iid (independent and identically iid distributed) and we write Xα ∼ Q • The abbreviation can be IID [21, p 39]. Definition 4.44. A pairwise independent collection of random variables is a set of random variables any two of which are independent. (a) Any collection of (mutually) independent random variables is pairwise independent
46
(b) Some pairwise independent collections are not independent. See example (4.45). Example 4.45. Let suppose X, Y , and Z have the following joint probability distribution: pX,Y,Z (x, y, z) = 14 for (x, y, z) ∈ {(0, 0, 0), (0, 1, , 1), (1, 0, 1), (1, 1, 0)}. This, for example, can be constructed by starting with independent X and Y that are Bernoulli- 12 . Then set Z = X ⊕ Y = X + Y mod 2.
Z does not imply (X, Y )
|=
Z and Y
|=
(b) The combination of X
|=
(a) X, Y, Z are pairwise independent. Z.
Definition 4.46. The convolution of probability measures μ1 and μ2 on (R, B) is the measure μ1 ∗ μ2 defined by ! (μ1 ∗ μ2 ) (H) = μ2 (H − x)μ1 (dx) H ∈ BR (a) μX ∗ μY = μY ∗ μX and μX ∗ (μY ∗ μZ ) = (μX ∗ μY ) ∗ μZ (b) If FX and GX are distribution functions corresponding to μX and μY , the distribution function corresponding to μX ∗ μY is ! (μX ∗ μY ) (−∞, z] =
FY (z − x)μX (dx)
In this case, it is notationally convenient to replace μX (dx) by dFX (x) (Stieltjes Integral.) Then, (μX ∗ μY ) (−∞, z] = FY (z − x)dFX (x). This is denoted by (FX ∗ FY ) (z). That is ! FZ (z) = (FX ∗ FY ) (z) = FY (z − x)dFX (x) (c) If densityfY exists, FX ∗ FY has density FX ∗ fY , where ! (FX ∗ fY ) (z) = fY (z − x)dFX (x) (i) If Y (or FY ) is absolutely continuous with density fY , then for any X (or FX ), X + Y (or FX ∗ FY ) is absolutely continuous with density (FX ∗ fY ) (z) = fY (z − x)dFX (x) If, in addition, FX has density fX . Then, ! ! fY (z − x)dFX (x) = fY (z − x)fX (x) dx This is denoted by fX ∗ fY In other words, if densities fX , fY exist, then FX ∗ FY has density fX ∗ fY , where ! (fX ∗ fY ) (z) = fY (z − x)fX (x) dx 47
4.47. If random variables X and Y are independent and have distribution μX and μY , then X + Y has distribution μX ∗ μY 4.48. Expectation and independence (a) Let X and Y be nonnegative independent random variables on (Ω, A, P ), then E [XY ] = EXEY (b) If X1 , X2 , . . . , Xn are independent and gk ’s complex-valued measurable5function. Then 6 n gk (Xk ) = gk (Xk )’s are independent. Moreover, if gk (Xk ) is integrable, then E n
k=1
E [gk (Xk )]
k=1
(c) If X1 , . . . , Xn are independent and
n
Xk has a finite second moment, then all Xk
k=1
have finite second as well. 6 5 n moments n Moreover, Var Xk = Var Xk . k=1
k=1
5 (d) If pairwise independent Xi ∈ L , then Var 2
k
i=1
4.7
6 Xi =
k
Var [Xi ]
i=1
Misc
4.49. The mode of a discrete probability distribution is the value at which its probability mass function takes its maximum value. The mode of a continuous probability distribution is the value at which its probability density function attains its maximum value. • the mode is not necessarily unique, since the probability mass function or probability density function may achieve its maximum value at several points.
5
PMF Examples
The following pmf will be defined on its support S. For Ω larger than S, we will simply put the pmf to be 0.
5.1
Random/Uniform
5.1. Rn , Un When an experiment results in a finite number of “equally likely” or “totally random” outcomes, we model it with a uniform random variable. We say that X is uniformly distributed on [n] if 1 P [X = k] = , k ∈ [n]. n We write X ∼ Un . 48
X∼ Uniform Un U{0,1,...,n−1}
Support set X {1, 2, . . . , n} {0, 1, . . . , n − 1}
Bernoulli B(1, p)
{0, 1}
Binomial B(n, p) Geometric G(p) Geometric G (p)
{0, 1, . . . , n} N ∪ {0} N
1 − p, k = 0 k=1 np, k p (1 − p)n−k k (1 − p)pk (1 − p)k−1 p
Poisson P(λ)
N ∪ {0}
e−λ λk!
pX (k)
ϕX (u)
1 n 1 n
1−eiun n(1−eiu )
k
n
(1 − p + peju ) 1−β 1−βeiu
eλ(e
iu −1
)
Table 2: Examples of probability mass functions. Here, p, β ∈ (0, 1). λ > 0.
• pi =
1 n
for i ∈ S = {1, 2, . . . , n}.
• Examples ◦ classical game of chance / classical probability drawing at random ◦ fair gaming devices (well-balanced coins and dice, well shuffled decks of cards) ◦ experiment where ∗ there are only n possible outcomes and they are all equally probable ∗ there is a balance of information about outcomes 5.2. Uniform on a finite set: U(S) Suppose |S| = n, then p(x) = n1 for all x ∈ S. Example 5.3. For X uniform on [-M:1:M], we have EX = 0 and Var X = M (M3 +1) . 1 and Var X = 12 (M − N )(M − N − 2). For X uniform on [N:1:M], we have EX = M −N 2 Example 5.4. Set S = 0, 1, 2, . . . , M , then the sum of two independent U(S) has pmf p(k) =
(M + 1) − |k − M | (M + 1)2
for k = 0, 1, . . . , 2M . Note its triangular shape with maximum value at p(M ) = visualize the pmf in MATLAB, try k = 0:2*M; P = (1/((M+1)^2))*ones(1,M+1); P = conv(P,P); stem(k,P)
5.2
Bernoulli and Binary distributions
5.5. Bernoulli : B(1, p) or Bernoulli(p) • S = {0, 1} 49
1 . M +1
To
• p0 = q = 1 − p, p1 = p • EX = E [X 2 ] = p. Var [X] = p − p2 = p (1 − p). Note that the variance is maximized at p = 0.5. 0.25
0.2 p ( 1 p ) 0.1
0 0
0
0.2
0.4
0
0.6
0.8
p
1 1
Figure 12: The plot for p(1 − p). 5.6. Binary Suppose X takes only two values a and b with b > a. P [X = b] = p. Then, X can be expressed as X = (b − a) I + a, where I is a Bernoulli random variable with P [I = 1] = p. • Var X = (b − a)2 V arI = (b − a)2 p (1 − p) . Note that it is still maximized at p = 1/2. • Suppose a = −b. Then, X = 2I + a = 2I − b. In which case, Var X = 2b2 p (1 − p).
5.3
Binomial: B(n, p)
• Binomial distribution with size n and parameter p. p ∈ [0, 1]. • pi = ni pi (1 − p)n−i for i ∈ S = {0, 1, 2, . . . , n} • X is the number of success in n independent Bernoulli trials and hence the sum of n independent, identically distributed Bernoulli r.v. n
• ϕX (u) = (1 − p + peju ) • EX = np
• EX 2 = (np)2 + np (1 − p) • Var [X] = np (1 − p) • Tail probability:
n n r=k
r
pr (1 − p)n−r = Ip (k, n − k + 1)
50
• Maximum probability value happens at kmax = mode X = (n + 1) p ≈ np ◦ When (n+1)p is an integer, then the maximum is achieved at kmax and kmax −1. • If have E1 , . . ., En , n unlinked repetition of E and event A for E, the the distribution B(n,p) describe the probability that A occurs k times in E1 , . . . ,En . • Gaussian Approximation for Binomial Probabilities: When n is large, binomial distribution becomes difficult to compute directly because of the need to calculate factorial terms. We can use P [X = k] " "
(k−np)2
1 2πnp (1 − p)
e− 2np(1−p) ,
(13)
which comes from approximating X by Gaussian Y with the same mean and variance and the relation P [X = k] " P [X ≤ k] − P [X ≤ k − 1] " P [Y ≤ k] − P [Y ≤ k − 1] " fY (k). See also (12.22). • Approximation: p 0.05
O
e
O
k
n−k
p (1 − p) k
=
e
2 O
1+O
e
O
* ( x 1) 1
1 0.1
n 800 O 40
e
2 O
0.06 ( x O )
2
2 S O
x O1
x
2
2 S O
2 np2 , kn
p 0.05
0.15 ( x O )
n 100 O 5
O
1
(np)k −np e k!
x
* ( x 1)
1
n
0.04
x O1
e x
e x
* ( O)
* ( O)
* ( n 1)
x
* ( n x 1) * ( x 1)
p ( 1 p )
* ( n 1)
n x 0.05
* ( n x 1) * ( x 1)
0
0
5
x
p ( 1 p )
n x 0.02
0
10
x
0
20
40
60
x
Figure 13: Gaussian approximation to Binomial, Poisson distribution, and Gamma distribution.
5.4
Geometric: G(β)
A geometric distribution is defined by the fact that for some β ∈ [0, 1), pk+1 = βpk for all k ∈ S where S can be either N or N ∪ {0}. • When its support is N, pk = (1 − p) pk−1 . This is referred to as G1 (β) or geometric1 (β). In MATLAB, use geopdf(k-1,1-p). 51
• When its support is N ∪ {0}, pk = (1 − p) pk . This is referred to as G0 (β) or geometric0 (β). In MATLAB, use geopdf(k,1-p). 5.7. Consider X ∼ G0 (β). • pi = (1 − β) β i , for S = N ∪ {0} , 0 ≤ β < 1 • β=
m m+1
where m = average waiting time/ lifetime
• P [X = k] = P [k failures followed by a success] = (P [failures])k P [success] P [X ≥ k] = β k = the probability of having at least k initial failure = the probability of having to perform at least k+1 trials. P [X > k] = β k+1 = the probability of having at least k+1 initial failure. • Memoryless property: ◦ P [X ≥ k + c |X ≥ k ] = P [X ≥ c], k, c > 0. ◦ P [X > k + c |X ≥ k ] = P [X > c], k, c > 0. ◦ If a success has not occurred in the first k trials (already fails for k times), then the probability of having to perform at least j more trials is the same the probability of initially having to perform at least jtrials. ◦ Each time a failure occurs, the system “forgets” and begins anew as if it were performing the first trial. ◦ Geometric r.v. is the only discrete r.v. that satisfies the memoryless property. • Ex. ◦ lifetimes of components, measured in discrete time units, when the fail catastrophically (without degradation due to aging) ◦ waiting times ∗ for next customer in a queue ∗ between radioactive disintegrations ∗ between photon emission ◦ number of repeated, unlinked random experiments that must be performed prior to the first occurrence of a given event A ∗ number of coin tosses prior to the first appearance of a ‘head’ number of trials required to observe the first success • The sum of independent G0 (p) and G0 (q) has pmf 7 k+1 −pk+1 (1 − p) (1 − q) q q−p , p = q 2 k (k + 1) (1 − p) p , p=q for k ∈ N ∪ {0}. 5.8. Consider X ∼ G1 (β). • P [X > k] = β k
• Suppose independent Xi ∼ G1 (βi ). min (X1 , X2 , . . . , Xn ) ∼ G1 ( ni=1 βi ). 52
5.5
Poisson Distribution: P(λ)
5.9. Characterized by • pX (k) = P [X = k] = e−λ λk! ; or equivalently, k
• ϕX (u) = eλ(e
iu −1
),
where λ ∈ (0, ∞) is called the parameter or intensity parameter of the distribution. In MATLAB, use poisspdf(k,lambda). 5.10. Denoted by P (λ). 5.11. In stead of X, Poisson random variable is usually denoted by Λ. 5.12. EX = Var X = λ. 5.13. Successive probabilities are connected via the relation kpX (k) = λpX (k − 1). 5.14. mode X = λ. • Note that when λ ∈ N, there are two maximums at λ − 1 and λ. • When λ # 1, pX (λ) ≈
0<λ<1 λ∈N λ ≥ 1, λ ∈ /N
√1 2πλ
via the Stirling’s formula (1.13).
Most probable value (imax ) 0 λ − 1, λ λ
Associated max probability e−λ λλ −λ e λ! λλ −λ e λ!
5.15. P [X ≥ 2] = 1 − e−λ − λe−λ = O (λ2 ) . The cumulative probabilities can be found by 9 8 k+1 !∞ (∗) 1 Xi > 1 = e−t tx dt, P [X ≤ k] = P Γ (k + 1) i=1 (∗)
P [X > k] = P [X ≥ k + 1] = P
8 k+1 i=1
λ
9
1 Xi ≤ 1 = Γ (k + 1)
!λ
e−t tx dt,
0
where the Xi ’s are i.i.d. E(λ). The equalities given by (*) are easily obtained via counting the number of events from rate-λ Poisson process on interval [0, 1]. 5.16. Fano factor (index of dispersion):
Var X EX
=1
An important property of the Poisson and Compound Poisson laws is that their classes are close under convolution (independent summation). In particular, we have divisibility properties (5.21) and (5.31) which are straightforward to prove from their characteristic functions. 53
5.17 (Recursion equations). Suppose X ∼ P(λ). Let mk (λ) = E X k and μk (λ) = E (X − EX)k . mk+1 (λ) = λ (mk (λ) + m k (λ)) μk+1 (λ) = λ (kμk−1 (λ) + μ k (λ))
(14) (15)
[14, p 112]. Starting with m1 = λ = μ2 and μ1 = 0, the above equations lead to recursive determination of the moments mk and μk . 1 d 1 1 −λ n can be expressed 5.18. E X+1 = λ 1 − e . Because for d ∈ N, Y = X+1 n=0 an X d−1 c n + X+1 , the value of EY is easy to find if we know EX n . as n=0 bn X 5.19. Mixed Poisson distribution: Let X be Poisson with mean λ. Suppose, that the mean λ is chosen in accord with a probability distribution whose characteristic function is ϕΛ . Then, ( * * ( iuX Λ(eiu −1) i(−i(eiu −1))Λ = ϕΛ −i eiu − 1 . =E e ϕX (u) = E E e |Λ = E e • EX = EΛ. • Var X = Var Λ + EΛ. • E [X 2 ] = E [Λ2 ] + EΛ. • Var[X|Λ] = E [X|Λ] = Λ. z−1 • When Λ is a nonnegative 1 integer-valued random variable, we have GX (z) = GΛ (e ) and P [X = 0] = GΛ z .
• E [XΛ] = E [Λ2 ] • Cov [X, Λ] = Var Λ 5.20. Thinned Poisson: Suppose we have X → s → Y where X ∼ P (λ). The box s is a binomial channel with success probability s. (Each 1 in the X get through the channel with success probability s.) • Note that Y is in fact a random sum X i=1 Ii where i.i.d. Ii has Bernoulli distribution with parameter s. • Y ∼ P (sλ); x−y
• p (x |y ) = e−λ(1−s) (λ(1−s)) (x−y)!
; x ≥ y (shifted Poisson);
[Levy and Baxter, 2002] 5.21. n Finite additivity: Suppose we have independent Λi ∼ P (λi ), then P ( i=1 λi ).
n i=1
Λi ∼
5.22. Raikov’s theorem: independent random variables can have their sum Poissondistributed only if every component of the sum is Poisson-distributed. 54
5.23. Countable Additivity Theorem [11, p 5]: Let (Xj : j =∈ N) be independent random variables, and assume that Xj has the distribution P(μj ) for each j. If ∞
μj
(16)
j=1
converges to μ, then S =
∞
Xj converges with probability 1, and S has distribution P(μ).
j=1
If on the other hand (16) diverges, then S diverges with probability 1. 5.24. Let X1 , X 2 , . . . , Xn be independent, and let Xj havedistribution P(μj ) for all n n = j. Then S n j=1 Xj has distribution P(μ), with μ = j=1 μj ; and so, whenever n j=1 rj = s, r n μj j s! P [Xj = rj ∀j|Sn = s] = r1 !r2 ! · · · rn ! j=1 μ which follows the multinomial distribution [11, p 6–7]. • If X and Y are independent Poisson random variables with respective parameters λ λ and μ, then (1) Z = X +Y is P(λ+μ) and (2) conditioned on Z = z, X is B z, λ+μ . So, E [X|Z] =
λ Z, λ+μ
λμ Var[X|Z] = Z (λ+μ) 2 , and E [Var[X|Z]] =
λμ . λ+μ
5.25. One of the reasons why Poisson distribution is important is because many natural phenomenons can be modeled by Poisson processes. For example, if we consider the number of occurrences Λ during a time interval of length τ in a rate-λ homogeneous Poisson process, then Λ ∼ P(λτ ). Example 5.26. • The first use of the Poisson model is said to have been by a Prussian physician, von Bortkiewicz, who found that the annual number of late-19th-century Prussian soldiers kicked to death by horses followed a Poisson distribution [7, p 150]. • #photons emitted by a light source of intensity λ [photons/second] in time τ • #atoms of radioactive material undergoing decay in time τ • #clicks in a Geiger counter in τ seconds when the average number of click in 1 second is λ. • #dopant atoms deposited to make a small device such as an FET • #customers arriving in a queue or workstations requesting service from a file server in time τ • Counts of demands for telephone connections • number of occurrences of rare events in time τ 55
• #soldiers kicked to death by horses • Counts of defects in a semiconductor chip. 5.27. Normal Approximation to Poisson Distribution with large λ: Let X ∼ P (λ). X n can be though of as a sum of i.i.d.Xi ∼ P (λn ), i.e., X = Xi , where nλn = λ. Hence X i=1
is approximately normal N (λ, λ) for λ large. Some says that the normal approximation is good when λ > 5. 5.28. Poisson distribution can be obtained as a limit from negative binomial distributions. Thus, the negative binomial distribution with parameters r and p can be approximated by the Poisson distribution with parameter λ = rqp (mean-matching), provided that p is “sufficiently” close to 1 and r is “sufficiently” large. 5.29. Convergence of sum of bernoulli random variables to the Poisson Law Suppose that for each n ∈ N Xn,1 , Xn,2 , . . . , Xn,rn are independent; the probability space for the sequence may change with n. Such a collection is called a triangular array [1] or double sequence [8] which captures the nature of the collection when it is arranged as ⎫ X1,1 , X1,2 , . . . , X1,r1 , ⎪ ⎪ ⎪ X2,1 , X2,2 , . . . , X2,r2 , ⎪ ⎪ ⎬ .. .. .. . . ··· . ⎪ Xn,1 , Xn,2 , . . . , Xn,rn , ⎪ ⎪ ⎪ ⎪ ⎭ .. .. .. . . ··· . where the random variables in each row are independent. Let Sn = Xn,1 + Xn,2 + · · · + Xn,rn be the sum of the random variables in the nth row. Consider a triangular array of bernoulli random variables Xn,k with P [Xn,k = 1] = pn,k . rn If max pn,k → 0 and pn,k → λ as n → ∞, then the sums Sn converges in distribution 1≤k≤rn
k=1
to the Poisson law. In other words, Poisson distribution rare events limit of the binomial (large n, small p). As a simple special case, consider a triangular array of bernoulli random variables Xn,k with P [Xn,k = 1] = pn . If npn → λ as n → ∞, then the sums Sn converges in distribution to the Poisson law. i ≤ ni ≤ To show this special case directly, we bound the first i terms of n! to get (n−i) i! ni . Using the upper bound, i! n i 1 npn n pn (1 − pn )n−i ≤ (npn )i (1 − pn )−i (1 − ) . i i! n i →1 →λ
The lower bound gives the same limit because (n − i)i = → 1. 56
→e−λ
n−i i n
ni where the first term
5.6
Compound Poisson
Given an arbitrary probability measure μ and a positive real number λ, the compound Poisson distribution CP (λ, μ) is the distribution of the sum Λj=1 Vj where the Vj are i.i.d. with distribution μ and Λ is a P (λ) random variable, independent of the Vj . Sometimes, it is written as POIS (λμ). The parameter λ is called the rate of CP (λ, μ) and μ is called the base distribution. 5.30. The mean and variance of CP (λ, L (V )) are λEV and λEV 2 respectively. 5.31. If Z ∼ CP (λ, q), then ϕZ (t) = eλ(ϕq (t)−1) . An important property of the Poisson and Compound Poisson laws is that their classes are close under convolution (independent summation). In particular, we have divisibility properties (5.21) and (5.31) which are straightforward to prove from their characteristic functions. 5.32. Divisibility ofthe compound law: Suppose we property 1 Poisson nhave inden n (i) (i) where λ = i=1 λi . pendent Λi ∼ CP λi , μ , then i=1 Λi ∼ CP λ, λ i=1 λi μ Proof. n ϕ i=1
Zi
(t) =
n
eλi (ϕqi (t)−1) = exp
i=1
= exp
+ n
+ n ,
i=1
λi ϕqi (t) − λ
i=1
, λi (ϕqi (t) − 1) + +
= exp λ
,, n 1 λi ϕqi (t) − 1 λ i=1
We usually focus on the case when μ is a discrete probability measure on N = {1, 2, . . .}. In which case, we usually refer to μ by the pmf q on N; q is called the base pmf. Equivalently, CP (λ, q) is also the distribution of the sum i∈N iΛi where (Λi : i ∈ N) are independent with Λi ∼ P (λqi ). Note that i∈N λqi = λ. The Poisson distribution is a special case of the compound Poisson distribution where we set q to be the point mass at 1. 5.33. The compound negative binomial [Bower, Gerber, Hickman, Jones, and Nesbitt, 1982, Ch 11] can be approximated by the compound Poisson distribution.
5.7
Hypergeometric
An urn contains N white balls and M black balls. One draws n balls without replacement, so n ≤ N + M . One gets X white balls and n − X black balls. 7 N M ( x )(n−x) , 0 ≤ x ≤ N and 0 ≤ n − x ≤ M (N +M P [X = x] = n ) 0, otherwise
57
5.34. The hypergeometric distributions “converge” to the binomial distribution: Assume N = p, then that n is fixed, while N and M increase to +∞ with lim N +M N →∞ M →∞
n x p (x) → p (1 − p)n−x (binomial). x Note that binomial is just drawing balls with replacement: n x n−x x n−x N M M n N x p (x) = = . (N + M )n N +M N +M x Intuitively, when N and M large, there is not much difference in drawing n balls with or without replacement. 5.35. Extension: If we have m colors and Ni balls of color i. The urn contains N = N1 + · · · + Nm balls. One draws n balls without replacement. Call Xi the number of balls of color i drawn among n balls. (Of course, X1 + · · · + Xm = n.) ⎧ N N ⎨ ( x11 )( x22 )···(Nxmm ) , x1 + · · · + xm = n and xi ≥ 0 (Nn ) P [X1 = x1 , . . . , Xm = xm ] = ⎩ 0, otherwise
5.8
Negative Binomial Distribution (Pascal / P´ olya distribution)
5.36. The probability that the rth success occurs on the (x + r)th trial
x+r−1 x+r−1 x r−1 =p p (1 − p) = pr (1 − p)x r−1 r−1 x+r−1 = pr (1 − p)x x i.e. among the first (x + r − 1) trials, there are r − 1 successes and x failures. • Fix r. 1 • ϕX (u) = pr (1−(1−p)e iu )r
• EX =
rq ,V p
ar (X) =
rq ,q p2
• Note that if we define
−r x
n x
=1−p ≡
n(n−1)·····(n−(x−1)) . x(x−1)·····1
Then,
r (r + 1) · · · · · (r + (x − 1)) x r+x−1 . ≡ (−1) = (−1) x x (x − 1) · · · · · 1 x
58
• If independent Xi ∼ NegBin (ri , p), then
Xi ∼ NegBin
i
ri , p . This is easy to
i
see from the characteristic function. • p (x) =
Γ(r+x) r p Γ(r)x!
(1 − p)x
5.37. A negative binomial distribution can arise as a mixture of Poisson distributions with p . mean distributed as a gamma distribution Γ q = r, λ = 1−p Let X be Poisson with mean λ. Suppose, that the mean λ is chosen in accord with a probability distribution FΛ (λ). Then, ϕX (u) = ϕΛ (−i (eiu − 1)) [see compound Poisson distribution]. Here, Λ ∼ Γ (q, λ0 ); hence, ϕΛ (u) = 1 u q .l 1−i λ 0 q λ0 −q iu 0 So, ϕX (u) = 1 − e λ−1 = 1− λ01+1 eiu , which is negative binomial with p = λ0λ+1 . 0 λ0 +1
So, a.k.a. Poisson-gamma distribution, or simply compound Poisson distribution.
5.9
Beta-binomial distribution
A variable with a beta binomial distribution is distributed as a binomial distribution with parameter p, where p is distribution with a beta distribution with parameters αand β. • P (k |p ) = nk pk (1 − p)n−k . • f (p) = fβq1 ,q2 (p) = • pmf: P (k) = • EX =
Γ(q1 +q2 ) Γ(k+q1 )Γ(n−k+q2 ) k Γ(q1 )Γ(q2 ) Γ(q1 +q2 +n)
=
n β(k+q1 ,n−k+q2 ) k
β(q1 ,q2 )
Zipf or zeta random variable
5.39. E [X n ] =
6.1
n
nq1 q2 (n+q1 +q2 ) (q1 +q2 )2 (1+q1 +q2 )
5.38. P [X = k] =
6
(1 − p)q2 −1 1(0,1) (p)
nq1 q1 +q2
• Var X =
5.10
Γ(q1 +q2 ) q1 −1 p Γ(q1 )Γ(q2 )
1 1 ξ(p) kp
ξ(p−n) ξ(p)
where k ∈ N, p > 1, and ξ is the zeta function defined in (A.11).
is finite for n < p − 1, and E [X n ] = ∞ for n ≥ p − 1.
PDF Examples Uniform Distribution
6.1. Characterization for uniform[a, b]: 0 1 (a) f (x) = b−a U (x − a) U (b − x) = 1
b−a
59
x < a, x > b a≤x≤b
X∼
fX (x)
ϕX (u)
Uniform U(a, b)
1 1 (x) b−a [a,b] λe−λx 1[0,∞] (x) x−s − μ−s0 1 01 e [s0 ,∞) (x) μ−s0 α −αx e 1[a,b] (x) e−αa −e−αb α −α|x| e 2 1 x−m 2 √1 e− 2 ( σ ) σ 2π T −1 1 1 e− 2 (x−m) Λ (x−m) n√ (2π) 2 det(Λ) λq xq−1 e−λx 1(0,∞) (x) Γ(q)
eiu
Exponential E(λ) Shifted Exponential (μ, s0 ) Truncated Exp. Laplacian L(α) Normal N (m, σ 2 ) Normal N (m, Λ) Gamma Γ (q, λ) Pareto Par(α) Par(α, c) = c Par(α) Beta β (q1 , q2 )
Γ(q1 +q2 ) xq1 −1 1 Γ(q1 )Γ(q2 ) (x+1)(q1 +q2 ) (0,∞) 2 2αxe−αx 1[0,∞] (x) 1 1 π 1+x2 1 α π α2 +x2 Γ(d) 1 √ παΓ(d− 12 ) 1+ x 2 d
Beta prime Rayleigh Standard Cauchy Cau(α) Cau(α, d) μ,σ 2
Log Normal eN (
αx−(α+1) 1[1,∞] (x) α c α+1 1(c,∞) (x) c x Γ(q1 +q2 ) q1 −1 x (1 − x)q2 −1 1(0,1) (x) Γ(q1 )Γ(q2 )
)
− 12
1 √ e σx 2π
(
( ) )1
α ln x−μ 2 σ
(0,∞)
b+a 2
sin(u b−a 2 ) u b−a 2
λ λ−iu
α2 α2 +u2 ium− 12 σ 2 u2
e
eju
T m− 1 uT Λu 2
1
q
(1−i uλ )
(x)
(x)
Table 3: Examples of probability density functions. Here, c, α, q, q1 , q2 , σ, λ are all strictly d log Γ (z) = positive and d > 12 . γ = −ψ(1) ≈ .5772 is the Euler-constant. ψ (z) = dz Γ (z) Γ(q1 )Γ(q2 ) (log e) Γ(z) is the digamma function. B(q1 , q2 ) = Γ(q1 +q2 ) is the beta function.
60
(b) F (x) =
0
(c) ϕX (u) = eiu (d) MX (s) =
x < a, x > b a≤x≤b
x−a b−a b+a 2
sin(u b−a 2 ) u b−a 2
esb −esa . s(b−a)
6.2. For most purpose, it does not matter whether the value of the density f at the 1 . endpoints are 0 or b−a Example 6.3. • Phase of oscillators ⇒ [-π, π] or [0,2π] • Phase of received signals in incoherent communications → usual broadcast carrier phase φ ∼ U(-ππ) • Mobile cellular communication: multipath → path phases φc ∼ U(−π, π) • Use with caution to represent ignorance about a parameter taking value in [a, b]. 6.4. EX =
a+b , 2
Var X =
(b−a)2 , 12
E [X 2 ] = 13 (b2 + ab + a2 ).
6.5. The product X of two independent U[0, 1] has fX (x) = − (ln (x)) 1[0,1] (x) and FX (x) = x − x ln x on [0, 1]. This comes from P [X > x] =
1
1 − FU
0
6.2
x t
dt =
1 x
1 − xt dt.
Gaussian Distribution
6.6. Gaussian distribution: (a) Denoted by N (m, σ 2 ) . N (0, 1) is the standard Gaussian (normal) distribution. 2
(b) fX (x) =
1 x−m √ 1 e− 2 ( σ ) 2πσ
.
(c) FX (x) = normcdf(x,m,sigma). • The standard normal cdf is sometimes denoted by Φ(x). It inherits all properties of cdf. Moreover, note that Φ(−x) = 1 − Φ(x). 1 2 2 (d) ϕX (v) = E ejvX = ejmv− 2 v σ .] 1 2 2 σ
(e) MX (s) = esm+ 2 s
61
fX x
fX x
95%
68%
P V
P P V
P 2V
P
P 2V
Figure 14: Probability density function of X ∼ N (m, σ 2 ) . (f) Fourier transform: F {fX } =
∞
1
fX (x) e−jωx dt = e−jωm− 2 ω
2 σ2
.
−∞
x−m x−m = 1 − Φ = Φ − σ (g) P [X > x] = P [X ≥ x] = Q x−m σ σx−m x−m = Q − = Φ . P [X < x] = P [X ≤ x] = 1 − Q x−m σ σ σ 6.7. Properties (a) P [|X − μ| < σ] = 0.6827; P [|X − μ| > σ] = 0.3173 P [|X − μ| > 2σ] = 0.0455; P [|X − μ| < 2σ] = 0.9545 (b) Moments and central moments: (
(i) E (X − μ) (
k
k
(ii) E |X − μ|
*
(
k−2
= (k − 1) E (X − μ) 7
* =
* =
0, k odd k 1 · 3 · 5 · · · · · (k − 1) σ , k even
= 2 · 4 · 6 · · · · · (k − 1) σ k π2 , k odd k even 1 · 3 · 5 · · · · · (k − 1) σ k ,
(iii) Var [X 2 ] = 4μ2 σ 2 + 2σ 4 . n EX n E [(X − μ)n ]
0 1 1
1 μ 0
2 μ + σ2 σ2
3 μ (μ + 3σ 2 ) 0
2
2
4 μ + 6μ σ + 3σ 4 3σ 4 4
2 2
(c) For N (0, 1) and k ≥ 1, E X k = (k − 1) E X k−2 =
0, k odd 1 · 3 · 5 · · · · · (k − 1) , k even.
The first equality comes from integration by parts. Observe also that (2m)! E X 2m = m . 2 m! (d) L´evy–Cram´er theorem: If the sum of two independent non-constant random variables is normally distributed, then each of the summands is normally distributed. 62
• Note that
∞
2
e−αx dx =
"π α
−∞
.
6.8 (Length bound). For X ∼ N (0, 1) and any (Borel) set B, ! P [X ∈ B] ≤
|B|/2
−|B|/2
fX (x) = 1 − 2Q
|B| 2
,
where |B| is the length (Lebesgue measure) of the set B. This is because the probability is concentrated around 0. More generally, for X ∼ N (m, σ 2 ) |B| P [X ∈ B] ≤ 1 − 2Q . 2σ 6.9 (Stein’s Lemma). Let X ∼ N (μ, σ 2 ), and let g be a differentiable function satisfying E |g (X)| < ∞. Then E [g(X)(X − μ)] = σ 2 E [g (X)] . [2, Lemma 3.6.5 p 124]. Note that this is simply integration by parts with u = g(x) and dv = (x − μ)fX (x)dx. • E (X − μ)k = E (X − μ)k−1 (X − μ) = σ 2 (k − 1)E (X − μ)k−2 . 6.10. Q-function: Q (z) =
∞ z
2
x √1 e− 2 2π
dx corresponds to P [X > z] where X ∼ N (0, 1);
that is Q (z) is the probability of the “tail” of N (0, 1). The Q function is then a complementary cdf (ccdf). 1
& 0,1
0.9 0.8
Qz
0.7 0.6 0.5 0.4 0.3 0.2 0.1
0
0 -3
z
-2
Figure 15: Q-function (a) Q is a decreasing function with Q (0) = 12 . (b) Q (−z) = 1 − Q (z) = Φ(z) (c) Q−1 (1 − Q (z)) = −z
63
-1
0
1
2
3
z
π
(d) Craig’s formula: Q (x) =
1 π
2
−
e
x2 2 sin2 θ
π
dθ =
0
1 π
2
x2
e− 2 cos2 θ dθ, x ≥ 0.
0
i.i.d.
To see this, consider X, Y ∼ N (0, 1). Then, π
!!
!2 !∞
Q (z) =
fX,Y (x, y)dxdy = 2
fX,Y (r cos θ, r sin θ)drdθ. 0
(x,y)∈(z,∞)×R
z cos θ
where we evaluate the double integral using polar coordinates [9, Q7.22 p 322]. π
2
(e) Q (x) = (f)
d Q (x) dx
(g)
d Q (f dx
(h)
1 π
4
x2
e− 2 sin2 θ dθ
0 x2
= − √12π e− 2
(x)) = − √12π e−
(f (x))2 2
d f dx
(x)
Q (f (x)) g (x)dx = Q (f (x)) g (x)dx +
(f (x))2 − 2
√1 e 2π
x f (x) g (t) dt dx dx
d
a
(i) P [X > x] = Q x−m σ x−m P [X < x] = 1 − Q x−m = Q − σ . σ (j) Approximation: * ( z2 1 √ (i) Q (z) ≈ (1−a)z+a z2 +b √12π e− 2 ; (ii) 1 −
e− x22
a = π1 , b = 2π
x2
≤ Q (x) ≤ 12 e− 2 − z2 e 2 ;z > 2 (iii) Q (z) ≈ z√12π 1 − 0.7 2 z 1 x2
√ x 2π
6.11. Error function (MATLAB): erf (z) =
√2 π
z
2
e−x dx = 1 − 2Q
0
(a) It is an odd function of z. (b) For z ≥ 0, it corresponds to P [|X| < z] where X ∼ N 0, 12 . (c) lim erf (z) = 1 z→∞
(d) erf (−z) = −erf (z) 1 z 1 √ (e) Q (z) = 2 erfc 2 = 2 1 − erf √z2 (f) Φ(x) =
1 2
1 + erf
√x
(2)
=
1 erfc 2
− √x2 64
√ 2z
(g) Q−1 (q) =
√
2 erfc−1 (2q)
(h) The complementary error function: erfc (z) = 1−erf (z) = 2Q
√ 2z =
√2 π
∞ z
2
e−x dx
§ 1· & ¨ 0, ¸ © 2¹
erf z Q
2z
z
0
Figure 16: erf-function and Q-function
6.3
Exponential Distribution
6.12. Denoted by E (λ). It is in fact Γ (1, λ). λ > 0 is a parameter of the distribution, often called the rate parameter. 6.13. Characterized by • fX (x) = λe−λx U (x) ; • FX (x) = 1 − e−λx U (x) ; • Survival-, survivor-, or reliability-function: P [X > x] = e−λx 1[0,∞) (x) + 1(−∞,0) (x); • ϕX (u) =
λ . λ−iu
• MX (s) =
λ λ−s
for Re {s} < λ.
6.14. EX = σX = λ1 , Var [X] = 6.15. median(X) =
1 λ
1 . λ2
ln 2, mode(X) = 0, E [X n ] =
6.16. Coefficient of variation: CV =
σX EX
n! . λn
=1
6.17. It is a continuous version of geometric distribution. In fact, X ∼ G0 (e−λ ) and $X% ∼ G1 (e−λ ) 6.18. X ∼ E(λ) is simply λ1 X1 where X1 ∼ E(1). 6.19. Suppose X1 ∼ E(1). Then E [X1n ] = n! for n ∈ N ∪ {0}. In general, for X ∼ E(λ), we have E [X α ] = λ1α Γ(α + 1) for any α > −1. In particular, for n ∈ N ∪ {0}, the moment E [X n ] = λn!n . 6.20. μ3 = E [(X − EX)3 ] =
2 λ3
and μ4 =
9 . λ4
65
6.21. Hazard function:
f (x) P [X>x]
=λ
6.22. h(X) = log λe . 6.23. Can be generated by X = − λ1 ln U where U ∼ U (0, 1). 6.24. MATLAB: • X = exprnd(1/lambda) • fX (x) = exppdf(x,1/lambda) • FX (x) = expcdf(x,1/lambda) 6.25. Memoryless property : The exponential r.v. is the only continuous r.v. on [0, ∞) that satisfies the memoryless property: P [X > s + x |X > s] = P [X > x] for all x > 0 and all s > 0 [13, p 157–159]. In words, the future is independent of the past. The fact that it hasn’t happened yet, tells us nothing about how much longer it will take before it does happen. • In particular, suppose we define the set B + x to be {x + b : b ∈ B}. For any x > 0 and set B ⊂ [0, ∞), we have P [X ∈ B + x|X > x] = P [X ∈ B] because
P [X ∈ B + x] = P [X > x]
λe−λt dt τ =t−x = e−λx
B+x
B
λe−λ(τ +x) dτ . e−λx
6.26. The difference of two independent E(λ) is L(λ). 6.27. Consider independent Xi ∼ E (λ). Let Sn = ni=1 Xi . (a) Sn ∼ Γ(n, λ), i.e. its has n-Erlang distribution. (b) Let N = inf {n : Sn ≥ s}. Then, N ∼ Λ + 1 where Λ ∼ P (λs). 6.28. If independent Xi ∼ E (λi ), then (a) min Xi ∼ E λi . i
i
Recall order statistics. Let Y1 = min Xi i
( * (b) P min Xi = j = i
λ j . λi
Note that this is
∞ 0
i
66
fXj (t)
i=j
P [Xi > t]dt.
i.i.d. 6.29. If Si , Ti ∼ E (α), then , m−1 + m n n + m − 1 1 n+m−1 Si > Tj = P i 2 i=1 j=1 i=0 m−1 n + i − 1 1 n+i = . i 2 i=0 Note that we can set up two Poisson processes. Consider the superposed process. We want the nth arrival from the T processes to come before the mth one of the S process.
6.4
Pareto: Par(α)–heavy-tailed model/density
6.30. Characterizations: Fix α > 0. (a) f (x) = αx−α−1 U (x − 1)
(b) F (x) = 1 −
1 xα
U (x − 1) =
0 1−
1 xα
x<1 x≥1
Example 6.31. • distribution of wealth • flood heights of the Nile river • designing dam height • (discrete) sizes of files requested by web users • waiting times between successive keystrokes at computer terminals • (discrete) sizes of files stored on Unix system file servers • running times for NP-hard problems as a function of certain parameters
6.5
Laplacian: L(α)
6.32. Characterization: α > 0 (a) Also known as Laplace or double exponential. (b) f (x) = α2 e−α|x| 1 αx e x<0 2 (c) F (x) = 1 −αx x≥0 1 − 2e (d) ϕX (u) =
α2 α2 +u2
67
(e) MX (s) =
λ2 , λ2 −s2
−λ < Re {s} < λ.
6.33. EX = 0, Var X =
2 α2
Example 6.34. • amplitudes of speech signals • amplitudes of differences of intensities between adjacent pixels in an image • If X and Y are independent E(λ), then X − Y is L(λ). (Easy proof via ch.f.)
6.6
Rayleigh
6.35. Characterizations: 2 (a) F (x) = 1 − e−αx U (x) 2
(b) f (x) = 2αxe−αx u (x)
2
e−αt , t ≥ 0 1, t<0 √ (d) Use −2σ 2 ln U to generate Rayleign σ12 from U ∼ U(0, 1). (c) P [X > t] = 1 − F (t) =
6.36. Read “ray -lee” Example 6.37. • noise X at the output of AM envelope detector when no signal is present 6.38. Relationship with other distributions (a) Let X be a Rayleigh(α) r.v., then Y = X 2 is E(α). Hence, √
·
− E(α) − Rayleigh(α). − − 2
(17)
(·)
i.i.d.
(b) Suppose X, Y ∼ N (0, σ 2 ). R =
√
X 2 + Y 2 has a Rayleigh distribution with density 1 2 1 fR (r) = 2r 2 e− 2σ2 r . 2σ i.i.d. • Note that X 2 , Y 2 ∼ Γ 12 , 2σ1 2 . Hence, X 2 + Y 2 ∼ Γ 1, α = 2σ1 2 , exponential. √ By (17), X 2 + Y 2 is a Rayleigh r.v. α = 2σ1 2 . • Alternatively, transformation from Cartesian coordinates to polar coordinates x r → θ y 1 r cos θ 2 1 r sin θ 2 1 1 fR,Θ (r, θ) = rfX,Y (r cos θ, r sin θ) = r √ e− 2 ( σ ) √ e− 2 ( σ ) σ 2π σ 2π 1 2 1 1 2r 2 e− 2σ2 r = 2π 2σ Hence, the radius R and the angle Θ are independent, with the radius R having a Rayleigh distribution while the angle Θ is uniformly distributed in the interval (0, 2π).
68
6.7
Cauchy
6.39. Characterizations: Fix α > 0. (a) fX (x) =
1 α , π α2 +x2
α>0 (b) FX (x) = π1 tan− 1 λx + 12 . (c) ϕX (u) = e−α|u| Note that Cau(α) is simply αX1 where X1 ∼ Cau(1). Also, because fX is even, f−X = fX and thus −X is still ∼ Cau(α). 6.40. Odd moments are not defined; even moments are infinite. • Because the first moment is not defined, central moments, including the variance, are not defined. • Mean and variance do not exist. • Note that even though the pdf of Cauchy distribution is an even function, the mean is not 0. 6.41. Suppose Xi are independent Cau(αi ). Then, i ai Xi ∼ Cau(|ai | αi ). 6.42. Suppose X ∼ Cau(α). X1 ∼ Cau α1 .
6.8
Weibull
6.43. For λ > 0 and p > 0, the Weibull(p, λ) distribution [9] is characterized by 1 (a) X = Yλ p where Y ∼ E(1). (b) fX (x) = λpxp−1 e−λx , x > 0 p
(c) FX (x) = 1 − e−λx , x > 0 p
(d) E [X n ] =
7
Γ(1+ n p) n
λp
.
Expectation
Consider probability space (Ω, A, P ) 7.1. Let X + = max (X, 0), and X − = − min (X, 0) = max (−X, 0). Then, X = X + − X − , and X + , X − are nonnegative r.v.’s. Also, |X| = X + + X − 7.2. A random variable X is integrable if and only if ≡ X has a finite expectation ≡ both E [X + ] and E [X − ] are finite 69
≡ E |X| is finite. ≡ EX is finite ≡ EX is defined. ≡ X∈L ≡ |X| ∈ L In which case,
EX = E X + − E X − ! ! = X (ω) P (dω) = XdP ! ! X = xdP (x) = xP X (dx) !
and
XdP = E [1A X] . A
Definition 7.3. A r.v. X admits (has) an expectation if E [X + ] and E [X − ] are not both equal to +∞. Then, the expectation of X is still given by EX = E [X + ] − E [X − ] with the conventions +∞ + a = +∞ and −∞ + a = −∞ when a ∈ R 7.4. L 1 = L 1 (Ω, A, P ) = the set of all integrable random variables. 7.5. For 1 ≤ p < ∞, the following are equivalent: 1
(a) (E [X p ]) p = 0 (b) E [X p ] = 0 (c) X = 0 a.s. a.s.
7.6. X = Y ⇒ EX = EY 7.7. E [1B (X)] = P (X −1 (B)) = P X (B) = P [X ∈ B] • FX (x) = E 1(−∞,x] (X) 7.8. Expectation rule: Let X be a r.v. on (Ω, A, P ), with values in (E, E), and distribution P X . Let h : (E, E) → (R, B) be measurable. If • X ≥ 0 or
• h (X) ∈ L 1 (Ω, A, P ) which is equivalent to h ∈ L 1 E, E, P X then • E [h (X)] = h (X (ω)) P (dω) = h (x) P X (dx) • h (X (ω)) P (dω) = h (x) P X (dx) [X∈G]
G
70
7.9. Expectation of an absolutely continuous random variable: Suppose X has density fX , then h is P X -integrable if and only if h · fX is integrable w.r.t. Lebesgue measure. In which case, ! ! X E [h (X)] = h (x) P (dx) = h (x) fX (x) dx !
and
! X
h (x) P (dx) = G
h (x) fX (x) dx G
• Caution: Suppose h is an odd function and fX is an even function, we can not conclude that E [h(X)] = 0. One obvious odd-function h is h(x) = x. For example, in (6.40), when X is Cauchy, the expectation does not exist even though the pdf is an even funciton. Of course, in general, if we also know that h(X) is integrable, then E [h(X)] is 0. Expectation of a discrete random variable: Suppose x is a discrete random variable. xP [X = x] EX = x
and E [g(X)] =
g(x)P [X = x].
x
Similarly, E [g(X, Y )] =
x
= g(x, y)P [X = x, Y = y].
y
These are called the law/rule of the lazy statistician (LOTUS) [21, Thm 3.6 p 48],[9, p 149]. 7.10. P [X ≥ t] dt = P [X > t] dt and P [X ≤ t] dt = P [X < t] dt E
E
E
E
7.11. Expectation and cdf : (a) For nonnegative X, !∞ EX =
!∞ (1 − FX (y)) dy =
P [X > y] dy = 0
!∞
0
For p > 0,
0
!∞ E [X p ] =
P [X ≥ y] dy
pxp−1 P [X > x]dx. 0
71
(18)
(b) For integrable X,
!∞
!0 (1 − FX (x)) dx −
EX =
FX (x) dx
−∞
0
(c) For nonnegative integer-valued X, EX =
∞
k=0 P [X > k] =
∞ k=1
P [X ≥ k].
Definition 7.12.
( * (a) Absolute moment: E |X|k = |x|k P X (dx), where we define E |X|0 = 1
( * (b) Moment: If E |X|k < ∞, then mk = E X k = xk P X (dx) = the k th moment of X (c) Variance: If E [X 2 ] < ∞, then we define ! 2 Var X = E (X − EX) = (x − EX)2 P X (dx) = E X 2 − (EX)2 = E [X(X − EX)] 2 • Notation: DX , or σ 2 (X), or σX , or VX [21, p 51] 6 5k 6 5k k k 2 2 Xi ∈ L , Var Xi exists, and E Xi = EXi • If Xi ∈ L , then i=1
i=1
i=1
i=1
• Suppose E [X 2 ] < ∞. If Var X = 0, then X = EX a.s. " (d) Standard deviation: σX = Var[X] (e) Coefficient of variation: CVX =
σX EX
(f) Fano factor (index of dispersion):
Var X EX
(g) Central moments: the nth central moment is μn = E [(X − EX)n ]. (i) μ1 = E [X − EX] = 0. 2 (ii) μ2 = σX = Var X n n mn−k (−m1 )k (iii) μn = k
(iv) mn =
k=1 n
k=1
n k
μn−k mk1
(h) Skewness coefficient: γX =
μ3 3 σX
(i) Describe the deviation of the distribution from a symmetric shape (around the mean) (ii) 0 for any symmetric distribution
72
(i) Kurtosis: κX =
μ4 4 . σX
• κX = 3 for Gaussian distribution (j) Excess coefficient: εX = κX − 3 =
μ4 4 σX
−3
(k) Cumulants or semivariants: For one variable: γk = (i) γ1 γ2 γ3 γ4
Third
Fourth
. v=0
= γ1 = γ2 + γ12 = γ2 + 3γ1 γ2 + γ13 = γ4 + 3γ22 + 4γ1 γ3 + 6γ12 γ2 + γ14
Moment Measure of
Second
ln (ϕX (v))
= EX = m1 , = E [X − EX]2 = m2 − m21 = μ2 = E [X − EX]3 = m3 − 3m1 m2 + 2m31 = μ3 = m4 − 3m22 − 4m1 m3 + 12m21 m2 − 6m41 = μ4 − 3μ22
(ii) m1 m2 m3 m4
First
1 ∂k j k ∂v k
Definition
Continuous variable
∞
Discrete variable
Central location
Mean, expected value E( X ) = μx
μx =
Dispersion
Variance, Var( X ) = μ2 = σx2
σx2 =
Standard deviation, σx
σx =
Coefficient of variation, x
x = σx /μx
x = σx /μx
Skewness
μ3 =
μ3 =
Skewness coefficient, γx
γx = μ3 /σx3
γx = μ3 /σx3
Kurtosis, κx
μ4 =
μ4 =
Excess coefficient, εx
κx = μ4 /σx4
all x s κx = μ4 /σx4
εx = κx − 3
εx = κx − 3
Asymmetry
Peakedness
−∞
∞ −∞
"
x f x (x) dx
μx =
(x − μx ) 2 f x (x) dx
σx2 =
σx =
Var( X )
∞
(x − μx ) 3 f x (x) dx −∞
∞
−∞
(x − μx ) 4 f x (x) dx
x p(xk ) all x s k
"
all x s
(xk − μx ) 2 Px (xk )
Var( X )
all x s
Sample estimator
x¯ =
s2 =
xi /n
1 n−1
=
s=
1 n−1
(xi − x¯ ) 2
(xi − x¯ ) 2
Cv = s/x¯ (xk − μx ) 3 p x (xk )
m3 =
n (n−1)(n−2)
(xi − x¯ ) 3
g = m3 /s3 (xk − μx ) 4 p x (xk )
m4 =
n(n+1) (n−1)(n−2)(n−3)
k = m4
(xi − x¯ ) 4
/s4
Figure 17: Product-Moments of Random Variables [20] 7.13. • For c ∈ R, E [c] = c • E [·] is a linear operator: E [aX + bY ] = aEX + bEY . • In general, Var[·] is not a linear operator. 7.14. All pairs of mean and variance are possible. A random variable X with EX = m and Var X = σ 2 can be constructed by setting P [X = m − a] = P [X = m + a] = 12 . Definition 7.15. • Correlation between X and Y : E [XY ].
73
Model
E [X]
U{0,1,...,n−1} B(n, p) G(β) P(λ) U(a, b) E(λ)
n−1 2
n2 −1 12
np
np(1-p)
β 1−β
β (1−β)2
Par(α)
Var [X]
λ
λ
a+b 2 1 λ
(b−a)2 12 1 λ2
α , α−1
∞,
L(α) N (m, σ 2 ) N (m, Λ) Γ (q, λ)
α>1 0<α≤1
⎧ ⎨ undefined, 0 < α < 1 ∞, 1<α<2 ⎩ α , α>2 (α−2)(α−1)2 2 α2 2
0 m m
σ Λ = [Cov [Xi , Xj ]]
q λ
q λ2
Table 4: Expectations and Variances
• Covariance between X and Y : Cov [X, Y ] = E [(X − EX)(Y − EY )] = E [XY ] − EXEY = E [X(Y − EY )] = E [Y (X − EX)] . • X and Y are said to be uncorrelated if and only if Cov [X, Y ] = 0. ≡ E [XY ] = EXEY • X and Y are said to be orthogonal if E [XY ] = 0. • Correlation coefficient, autocorrelation, normalized covariance: 6 5 E [XY ] − EXEY Cov [X, Y ] Y − EY X − EX ρXY = = =E . σX σY σX σY σX σY 7.16. Properties (a) Var X = Cov [X, X], ρX,X = 1
(c) If X
|=
(b) Var[aX] = a2 Var X, σaX = |a| σX . Y , then Cov [X, Y ] = 0. The converse is not true.
(d) ρXY ∈ [−1, 1]. 2 2 σY with equality if and only if (e) By Caychy-Schwartz inequality, (Cov [X, Y ])2 ≤ σX 2 2 2 2 σY (X − EX) = σX (Y − EY ) a.s.
• This implies |ρX,Y | ≤ 1. 74
When σY , σX > 0, equality occurs if and only if the following conditions holds ≡ ≡ ≡ ≡
∃a = 0 such that (X − EX) = a(Y − EY ) ∃c = 0 and b ∈ R such that Y = cX + b ∃a = 0 and b ∈ R such that X = aY + b |ρXY | = 1
a In which case, |a| = σσXY and ρXY = |a| = sgn a. Hence, pXY is used to quantify linear dependence between X and Y . The closer |ρXY | to 1, the higher degree of linear dependence between X and Y .
(f) Linearity: (i) Let Yi = ai Xi + bi . i. Cov [Y1 , Y2 ] = Cov [a1 X1 + b1 , a2 X2 + b2 ] = a1 a2 Cov [X1 , X2 ]. ii. The ρ is preserved under linear transformation: ρY1 ,Y2 = ρX1 ,X2 . (ii) Cov [a1 X + b1 , a2 X + b2 ] = a1 a2 Var X. (iii) ρa1 X+b1 ,a2 X+b2 = 1. In particular, if Y = aX + b, then ρX,Y = 1. (g) ρX,Y = 0 if and only if X and Y are uncorrelated. (h) When EX = 0 or EY = 0, orthogonality is equivalent to uncorrelatedness. (i) For finite index set I, 8 9 Var ai Xi = a2i Var Xi + 2 aj aj Cov [Xi , Xj ] . i∈I
i∈I
(i,j)∈I×I i=j
In particular Var (X + Y ) = Var X + Var Y + 2Cov [X, Y ] and Var (X − Y ) = Var X + Var Y − 2Cov [X, Y ] . (j) For finite index set I and J, 9 8 ai Xi , b j Yj = ai bj Cov [Xi , Yj ] . Cov i∈I
j∈J
i∈I j∈J
(k) Covariance Inequality: Let X be any random variable and g and h any function such that E [g(X)], E [h(X)], and E [g(X)h(X)] exist.
75
• If g and h are either both non-decreasing or non-increasing, then Cov [g(X), h(X)] ≥ 0.
(19)
• If g is non-decreasing and h is non-increasing, then Cov [g(X), h(X)] ≤ 0.
(20)
See also [2, p 191–192] and (8.13). (l) Being uncorrelated does not imply independence • Discrete: Suppose pX is an even function with pX (0) = 0. Let Y = g(X) where g is also an even function. Then, E [XY ] = E [X] = E [X] E [Y ] = Cov [X, Y ] = 0. Consider a point x0 such that pX (x0 ) > 0. Then, pX,Y (x0 , g(x0 )) = pX (x0 ). We only need to show that pY (g(x0 )) = 1 to show that X and Y are not independent. For example, let X be uniform on {±1, ±2} and Y = |X|. Consider the point x0 = 1. • Continuous: Let Θ be uniform on an interval of length 2π. Set X = cos Θ and Y = sin Θ. See (11.6). 7.17. See (4.48) for relationships between expectation and independence. Fix a > 0. Suppose X0 , X1 , X2 , . . . = p and P [Xi = 0] = 1 − p. Let
Example 7.18 (Martingale betting strategy). are independent random variables with P [Xi = 1] N = inf {i : Xi = 1}. Also define ⎧ ⎨ 0, N −1 L(N ) = i r , ⎩ a
N =0 N ∈N
i=0
and G(N ) = arN − L(N ). To have G(N ) > 0, need
k−1
ri < rk ∀k ∈ N which turns out to
i=0
require r ≥ 2. In fact, for r ≥ 2, we have G(N ) ≥ a ∀N ∈ N ∪ {0}. Hence, E [G(N )] ≥ a. It is exactly a when r = 2. n rn −1 Now, E [L (N )] = a ∞ n=1 r−1 p (1 − p) = ∞ if and only if r(1 − p) ≥ 1. When 1 − p ≤ 12 , because we already have r ≥ 2, it is true that r(1 − p) ≥ 1.
8
Inequalities
8.1. Let (Ai : i ∈ I) be a finite family of events. Then
2 P (Ai )
i
i
P (Ai ∩ Aj )
≤P
+ i
j
76
, Ai
≤
i
P (Ai )
8.2. [19, p 14] , +n n n n 1 − 1− Ai − P (Ai ) ≤ (n − 1) n− n−1 . ≤P n i=1 i=1 ↓ ↓ 1 − 1e ≈ −0.37
See figure 18. • |P (A1 ∩ A2 ) − P (A1 ) P (A2 )| ≤ 14 . 1 1
§ ©
¨ 1
1·
x
¸
0.5
x¹ x
( x 1) x
x 1
0 1 e 2
4
6
1
8
10
x
Figure 18: Bound for P
n
10
Ai −
i=1
n
P (Ai ).
i=1
8.3. Markov’s Inequality : P [|X| ≥ a] ≤ a1 E |X|, a > 0.
x
a1ª x t a º
a
¬
¼
x
a
Figure 19: Proof of Markov’s Inequality (a) Useless when a ≤ E |X|. Hence, good for bounding the “tails” of a distribution. 77
(b) Remark: P [|X| > a] ≤ P [|X| ≥ a] (c) P [|X| ≥ aE |X|] ≤ a1 , a > 0. (d) Suppose g is a nonnegative function. Then, ∀α > 0 and p > 0, we have (i) P [g (X) ≥ α] ≤
(E [(g (X))p ])
1 αp
(ii) P [g (X − EX) ≥ α] ≤
(E [(g (X − EX))p ])
1 αp
(e) Chebyshev’s Inequality : P [|X| > a] ≤ P [|X| ≥ a] ≤ (i) P [|X| ≥ α] ≤
1 αp
1 EX 2 , a2
a > 0.
(E [|X|p ])
(ii) P [|X − EX| ≥ α] ≤
2 σX ; α2
that is P [|X − EX| ≥ nσX ] ≤
• Useful only when α > σX (iii) For a < b, P [a ≤ X ≤ b] ≥ 1 −
4 (b−a)2
2 σX
+ EX −
1 n2
a+b 2 2
(f) One-sided Chebyshev inequalities: If X ∈ L2 , for a > 0, EX 2 EX 2 +a2
(i) If EX = 0, P [X ≥ a] ≤ (ii) For general X, i. P [X ≥ EX + a] ≤ ii. P [X ≤ EX − a] ≤
2 σX 2 +a2 ; σX
2 σX 2 +a2 ; σX
iii. P [|X − EX| ≥ a] ≤ better bound than
2 σX a2
that is P [X ≥ EX + nσX ] ≤
1 1+n2
that is P [X ≤ EX − nσX ] ≤
1 1+n2
2 2σX 2 +a2 ; σX
that is P [|X − EX| ≥ nσX ] ≤
2 1+n2
This is a
iff σX > a
(g) Chernoff bounds: (i) P [X ≤ b] ≤ (ii) P [X ≥ b] ≤
E[e−θX ] e−θb E[eθX ] eθb
∀θ > 0
∀θ > 0
This can be optimized over θ 8.4. Suppose |X| ≤ M a.s., then P [|X| ≥ a] ≥ 8.5. X ≥ 0 and E [X 2 ] < ∞ ⇒ P [X > 0] ≥
E|X|−a ∀a M −a
∈ [0, M )
(EX)2 E[X 2 ]
Definition 8.6. If p and q are positive real numbers such that p + q = pq, or equivalently, 1 + 1q = 1, then we call p and qa pair of conjugate exponents. p • 1 < p, q < ∞ • As p → 1, q → ∞. Consequently, 1 and ∞ are also regarded as a pair of conjugate exponents. 78
8.7. H¨ older’s Inequality : X ∈ Lp , Y ∈ Lq , p > 1,
1 p
+
1 q
= 1. Then,
(a) XY ∈ L1 1
1
(b) E [|XY |] ≤ (E [|X|p ]) p (E [|Y |q ]) q with equality if and only if E [|Y |q ] |X (ω)|p = E [|X|p ] |Y (ω)|q a.s. 8.8. Cauchy-Bunyakovskii-Schwartz Inequality: If X, Y ∈ L2 , then XY ∈ L1 and 1 1 |E [XY ]| ≤ E [|XY |] ≤ E |X|2 2 E |Y |2 2 or equivalently, 2 E Y (E [XY ])2 ≤ E X 2 with equality if and only if E [Y 2 ] X 2 = E [X 2 ] Y 2 a.s. (a) |Cov (X, Y )| ≤ σX σY . (b) (EX)2 ≤ EX 2 (c) (P (A ∩ B))2 ≤ P (A)P (B) 1
8.9. Minkowski’s Inequality : p ≥ 1, X, Y ∈ Lp ⇒ X + Y ∈ Lp and (E [|X + Y |p ]) p ≤ 1 1 (E [|X|p ]) p + (E [|Y |p ]) p 8.10. p > q > 0 (a) E [|X|q ] ≤ 1 + E [|X|p ] 1
1
(b) Lyapounov’s inequality : (E [|X|q ]) q ≤ (E [|X|p ]) p = 1 • E [|X|] ≤ E |X|2 ≤ E |X|3 3 ≤ · · · 8.11. Jensen’s Inequality : For a random variable X, if 1) X ∈ L1 (and ϕ (X) ∈ L1 ); 2) X ∈ (a, b) a.s.; and 3) ϕ is convex on (a, b), then ϕ (EX) ≤ E [ϕ (X)] 1 . • For X > 0 (a.s.), E X1 ≥ EX 8.12. • For p ∈ (0, 1], E [|X + Y |p ] ≤ E [|X|p ] + E [|Y |p ]. • For p ≥ 1, E [|X + Y |p ] ≤ 2p−1 (E [|X|p ] + E [|Y |p ]). 8.13 (Covariance Inequality). Let X be any random variable and g and h any function such that E [g(X)], E [h(X)], and E [g(X)h(X)] exist. • If g and h are either both non-decreasing or non-increasing, then E [g(X)h(X)] ≥ E [g(X)] E [h(X)] . In particular, for nondecreasing g, E [g(X)(X − EX)] ≥ 0. • If g is non-decreasing and h is non-increasing, then E [g(X)h(X)] ≤ E [g(X)] E [h(X)] . See also (19), (20), and [2, p 191–192]. 79
9
Random Vectors
In this article, a vector is a column matrix with dimension n × 1 for some n ∈ N. We use 1 to denote a vector with all element being 1. Note that 1(1T ) is a square matrix with all element being 1. Finally, for any matrix A and constant a, we define the matrix A + a to be the matrix A with each of the components are added by a. If A is a square matrix, then A + a = A + a1(1T ). Definition 9.1. Suppose I is an index set. When Xi ’s are random variables, we define a random vector XI by XI = (Xi : i ∈ I). For example, if I = [n], we have XI = (X1 , X2 , . . . , Xn ). Note also that X[n] is usually denoted by X1n . Sometimes, we simply write X to denote X1n . • For disjoint A, B, XA∪B = (XA , XB ). • For vector xI , yI , we say x ≤ y if ∀i ∈ I, xi ≤ yi [7, p 206]. • When the dimension of X is implicit, we simply write X and x to represent X1n and xn1 , respectively. or • For random vector X, Y , we use (X, Y ) to represent the random vector X Y T T T equivalently X Y . Definition 9.2. Half-open cell or bounded rectangle in Rk is set of the form Ia,b = k
{x : ai < xi ≤ bi , ∀i ∈ [k]} = × (ai , bi ]. For a real function F on Rk , the difference of F i=1
around the vertices of Ia,b is ΔIa,b F =
sgnIa,b (v) F (v) = (−1)|{i:vi =ai }| F (v)
v
(21)
v
where the sum extending over the 2k vertices v of Ia,b . (The ith coordinate of the vertex v could be either ai or bi .) In particular, for k = 2, we have F (b1 , b2 ) − F (a1 , b2 ) − F (b1 , a2 ) + F (a1 , a2 ) . 9.3 (Joint cdf ). * ( FX (x) = FX1k xk1 = P [X1 ≤ x1 , . . . , Xk ≤ xk ] = P X ∈ Sxk1 = P X (Sx ) where Sx = {y : yi ≤ xi , i = 1, . . . , k} consists of the points “southwest” of x. • ΔIa,b FX ≥ 0 • The set Sx is an orthanlike or semi-infinite corner with “northeast” vertex (vertex in the direction of the first orthant) specified by the point x [7, p 206]. C1 FX is nondecreasing in each variable. Suppose ∀i yi ≥ xi , then FX (y) ≥ FX (x) C2 FX is continuous from above: lim FX (x1 + h, . . . , xk + h) = FX (x) h 0
80
C3 xi → −∞ for some i (the other coordinates held fixed), then FX (x) → 0 If ∀i xi → ∞, then FX (x) → 1. • lim FX (x1 − h, . . . , xk − h) = P X (Sx◦ ) where Sx◦ = {y : yi < xi , i = 1, . . . , k} is the h 0
interior of Sx • Given a ≤ b, P [X ∈ Ia,b ] = ΔIa,b FX . This comes from (9) with Ai = [Xi ≤ ai ] and B = [∀ Xi ≤ bi ]. ai , i ∈ I, P [∩i∈I Ai ∩ B] = F (v) where vi = bi , otherwise.
Note that
• For any function F on Rk with satisfies (C1), (C2), and (C3), there is a unique probability measure μ on BRk such that ∀a, ∀b ∈ Rk with a ≤ b, we have μ (Ia,b ) = ΔIa,b F (and ∀x ∈ Rk μ (Sx ) = F (x)). • TFAE: (a) FX is continuous at x (b) FX is continuous from below (c) FX (x) = P X (Sx◦ ) (d) P X (Sx ) = P X (Sx◦ ) (e) P X (∂Sx ) = 0 where ∂Sx = Sx − Sx◦ = {y : yi ≤ xi ∀i, ∃j yj = xj } • If k > 1, FX can have discontinuity points even if P X has no point masses. • FX can be discontinuous at uncountably many points. • The continuity points of FX are dense. • For any j, we have lim FX (x) = FXI\{j} (xI\{j} ) xj →∞
9.4 (Joint pdf ). A function f is a multivariate or joint pdf (commonly called a density) if and only if it satisfies the following two conditions: (a) f ≥ 0; (b) f (x)d(x) = 1. • The integrability of the pdf f implies that for all i ∈ I lim f (xI ) = 0.
xi →±∞
• P [X ∈ A] =
A
fX (x)dx.
• Remarks: Roughly, we may say the following: (a) fX (x) =
lim
∀i, Δxi →0
P [∀i, xi <Xi ≤xi +Δxi ] i Δxi
= lim
81
Δx→0
P [x<X≤x+δx] i Δxi
(b) For I = [n],
x1 xn ◦ FX (x) = −∞ . . . −∞ fX (x)dxn . . . dx1 ∂n ◦ fX (x) = ∂x1 ···∂xn FX (x) . (k) u, j = k (k) ∂ fX v dx[n]\{k} where vj = ◦ ∂u FX (u, . . . , u) = . k xj , j = k∈N (−∞,u]n u u ∂ For example, ∂u FX,Y (u, u) = fX,Y (x, u)dx + fX,Y (u, y) dy. 5
(c) fX1n (xn1 ) = E
n
6 δ (Xi − xi ) .
−∞
−∞
i=1
• The level sets of a density are sets where density is constant. 9.5. Consider two random vectors X : Ω → Rd1 and Y : Ω → Rd2 . Define Z = (X, Y ) : Ω → Rd1 +d2 . Suppose that Z has density fX,Y (x, y). fX,Y (x, y)dx and fX (x) = fX,Y (x, y)dy. (a) Marginal Density : fY (y) = Rd 1
Rd 2
• In other words, to obtain the marginal densities, integrate out the unwanted variables. • fXI\{i} (xI\{i} ) = fXI (xI )dxi . (b) fY |X (y|x) = (c) FY |X (y|x) =
fX,Y (x,y) . fY (y)
y1
−∞
···
yd
2
−∞
fY |X (t|x)dtd2 · · · dt1 .
9.6. P [(X + a1 , X + b1 ) ∩ (Y + a2 , Y + b2 ) = ∅] = in (1.10).
A
fX,Y (x, y)dxdy where A is defined
9.7. Expectation and covariance: (a) The expectation of a random vector X is defined to be the vector of expectations of its entries. EX is usually denoted by μX or mX . (b) For non-random matrix A, B, C and a random vector X, E [AXB + C] = AEXB +C. (c) The correlation matrix RX of a random vector X is defined by RX = E XX T . Note that it is symmetric. (d) The covariance matrix CX of a random vector X is defined as CX = ΛX = Cov [X] = E (X − EX)(X − EX)T = E XX T − (EX)(EX)T = RX − (EX)(EX)T . (i) The ij-entry of the Cov [X] is simply Cov [Xi , Xj ]. 82
(ii) ΛX is symmetric. i. Properties of symmetric matrix A. All eigenvalues are real. B. Eigenvectors corresponding to different eigenvalues are not just linearly independent, but mutually orthogonal. C. Diagonalizable. ii. Spectral theorem: The following equivalent statements hold for symmetric matrix. A. There exists a complete set of eigenvectors; that is there exists an orthonormal basis u(1) , . . . , u(n) of R with CX u(k) = λk u(k) . B. CX is diagonalizable by an orthogonal matrix U (U U T = U T U = I). C. CX can be represented as CX = U ΛU T where U is an orthogonal matrix whose columns are eigenvectors of CX and λ = diag(λ1 , . . . , λn ) is a diagonal matrix with the eigenvalues of CX . (iii) Always nonnegative definite (positive That is ∀a ∈ Rn where n is ( semidefinite). * 2 the dimension of X, aT CX a = E aT (X − μX ) ≥ 0. • det (CX ) ≥ 0.
1 √ √ √ √ √ √ (iv) We can define CX2 = CX to be CX = U ΛU T where Λ = diag( λ1 , . . . , λn ). √ √ i. det CX = det CX . √ ii. CX is nonnegative definite. √ 2 √ √ iii. CX = CX CX = CX .
(v) Suppose, furthermore, that CX is positive definite. −1 i. CX = U Λ−1 U T where Λ−1 = diag( λ11 , . . . , λ1n ). = √ − 12 −1 1 1 −1 T √ √ ii. CX = CX = ( CX ) = U DU where D = , . . . , λn λ1 √ √ −1 iii. CX CX CX = I. 1
−1
−1 iv. CX , CX2 , CX 2 are all positive definite (and hence are all symmetric). 1 2 − −1 v. CX 2 = CX . −1
vi. Let Y = CX 2 (X − EX). Then, EY = 0 and CY = I. (vi) For i.i.d. Xi with each with variance σ 2 , ΛX = σ 2 I. (e) Cov [AX + b] = ACov [X] AT • Cov X T h = Cov hT X = hT Cov [X] h where h is a vector with the same dimension as X. (f) For Y = X + Z, ΛY = ΛX + 2ΛXZ + ΛZ . • When X and Z are independent, ΛY = ΛX + ΛZ . 83
2 • For Yi = X + Zi where X and Z are independent, ΛY = σX + ΛZ .
(g) ΛX+Y + ΛX−Y = 2ΛX + 2ΛY (h) det (ΛX+Y ) ≤ 2n det (ΛX + ΛY ) where n is the dimension of X and Y . 2 (i) Y = , X) where X is a random variable with variance σX , then ΛY = ⎡ (X, X, . . . ⎤ 1 ··· 1 . ⎥ 2 ⎢ .. σX ⎣ . . . . .. ⎦. Note that Y = 1X where 1 has the same dimension as Y . 1 ··· 1
(j) Let X be a zero-mean random vector whose covariance matrix is singular. Then, one of the Xi is a deterministic linear combination of the remaining components. In other words, there is a nonzero vector a such that aT X = 0. In general, if ΛX is singular, then there is a nonzero vector a such that aT X = aT EX. (k) If X and Y are both random vectors (not necessarily of the same dimension), then their cross-covariance matrix is ΛXY = CXY = Cov [X, Y ] = E (X − EX)(Y − EY )T . Note that the ij-entry of CXY is Cov [Xi , Yj ]. • CY X = (CXY )T . (l) RXY = E XY T .
, then (m) If we stack X and Y in to a composite vector Z = X Y CX CXY CZ = . CY X CY (n) X and Y are said to be uncorrelated if CXY = 0, the zero matrix. In which case, CX 0 C(X ) = , Y 0 CY a block diagonal matrix. 9.8. The joint characteristic function of an n-dimensional random vector X is defined by ( T * ϕX (v) = E ejv X = E ej i vi Xi . When X has a joint density fX , ϕX is just the n-dimensional Fourier transform: ! T ϕX (v) = ejv x fX (x)dx, and the joint density can be recovered using the multivariate inverse Fourier transform: ! 1 T fX (x) = e−jv x ϕX (v)dv. n (2π) 84
TX
(a) ϕX (u) = Eeiu (b) fX (x) =
1 (2π)n
.
e−jv x ϕX (v)dv. T
T (c) For Y = AX + b, ϕY (u) = eib u ϕX AT u . (d) ϕX (−u) = ϕX (u). (e) ϕX (u) = ϕX,Y (u, 0) n
(f) Moment: (i) (ii)
n
υi
∂ i=1 υ υ υn ϕX ∂v1 1 ∂v2 2 ···∂vn
∂ ϕ ∂vi X
υi
(0) = j i=1 E
n i=1
Xiυi
(0) = jEXi .
∂2 ϕ ∂vi ∂vj X
(0) = j 2 E [Xi Xj ] .
(g) Central Moment: (i) (ii) (iii)
∂ ∂vi
ln (ϕX (v))
∂2 ∂vi ∂vj
v=0
ln (ϕX (v))
∂3 ∂vi ∂vj ∂vk
= jEXi .
v=0
ln (ϕX (v))
= −Cov [Xi , Xj ] .
v=0
= j 3 E [(Xi − EXi ) (Xj − EXj ) (Xk − EXk )] .
(iv) E [(Xi − EXi ) (Xj − EXj ) (Xk − EXk ) (X − EX )] = Ψijk + Ψij Ψk + Ψik Ψj + 4 Ψi Ψjk where Ψijk = ∂vi ∂v∂j ∂vk ∂v ln (ϕX (v)) . v=0
Remark : we do not require that any or all of i, j, k, and λ be distinct. 9.9 (Decorrelation and the Karhunen-Lo` eve expansion). Let X be an n-dimensional random vector with zero mean and covariance matrix C. X has the representation X = P Y , there the component of Y are uncorrelated and P is an n × n orthonormal matrix. This representation is called the Karhunen-Lo`eve expansion. • Y = PTX • P T = P −1 is called a decorrelating transformation. • Diagonalize C = P DP T where D = diag(λ1 ). Then, Cov [Y ] = D. • In MATLAB, use [P,D] = eig(C). To extract the diagonal elements of D as a vector, use the command d = diag(D). • If C is singular (equivalently, if some of the λi are zero), we only need to keep around the Yi for which λi > 0 and can throw away the other components of Y without any loss of information. This is because λi = EYi2 and EYi2 = 0 if and only if Yi ≡ 0 a.s. [9, p 338–339].
85
9.1
Random Sequence
9.10. [10, p 9–10] Given a countable family X1 , X2 , . . . of random r.v.’s, their statistical properties are regarded as defined by prescribing, for each integer n ≥ 1 and every finite set I ⊂ N, the joint distribution function FXI of the random vector XI = (Xi : i ∈ I). Of course, some consistency requirements must be imposed upon the infinite family FXI , namely, that for j ∈ I (a) FXI\{j} xI\{j} = lim FXI (xI ) and that xj →∞
(b) the distribution function obtained from FXI (xI ) by interchanging two of the indices i1 , i2 ∈ I and the corresponding variable xi1 and xi2 should be invariant. This simply means that the manner of labeling the random variables X1 , X2 , . . . is not relevant. The joint distributions {FXI } are called the finite-dimensional distributions associated with XN = (Xn )∞ n=1 .
10 10.1
Transform Methods Probability Generating Function
Definition 10.1. [9][10, p 11] Let X be a discrete random variable taking only nonnegative integer values. The probability generating function (pgf) of X is ∞ X GX (z) = E z = z k P [X = k]. k=0
• In the summation, the first term (the k = 0 term) is P [X = 0] even when z = 0. • GX (0) = P [X = 0] • G(z −1 ) is the z transform of the pmf. • GX (1) = 1. • The names derives from the fact that it can be used to compute the pmf. • It is finite at least for any complex z with |z| ≤ 1. Hence pgf is well defined for |z| ≤ 1. (k)
(k)
Definition 10.2. GX (1) = lim GX (z). z1
10.3. Properties (a) GX is infinitely differentiable at least for |z| < 1. (b) Probability generating property: 1 d(k) GX (z) = P [X = k]. (k) k! dz z=0 86
(c) Moment generating property:
8k−1 9 d(k) GX (z) =E (X − i) . dz (k) z=1 i=0
The RHS is called the kth factorial moment of X. (d) In particular,
EX = G X (1) EX 2 = G X (1) + G X (1) Var X = G X (1) + G X (1) − (G X (1))
2
(e) pgf of a sum n of independent random variables is the product of the individual pgfs. Let S = i=1 Xi where the Xi ’s are independent. GS (z) =
n
GXi (z).
i=1
10.2
Moment Generating Function
Definition 10.4. sXThe moment sx X generating function of a random variable X is defined = e P (dx) for all s for which this is finite. as MX (s) = E e 10.5. Properties of moment generating funciton (a) MX (s) is defined on some interval containing 0. It is possible that the interval consists of 0 alone. (i) If X ≥ 0, this interval contains (−∞, 0]. (ii) If X ≤ 0, this interval contains [0, ∞). (b) Suppose that M (s) is defined throughout an interval (−s0 , s0 ) where s0 > 0, i.e. it exists (is finite) in some neighborhood of 0. Then, * ( (i) X has finite moments of all order: E |X|k < ∞∀k ≥ 0 (ii) M (s) =
∞ k=0
sk E k!
k X , for complex-valued s with |s| < s0 [1, eqn (21.22) p 278].
Thus M (s) has a Taylor expansion about 0 with positive radius of convergence. ∞ ak sk , and if i. If M (s) can somehow be calculated and expanded in a series k=0 the coefficients ak can be identified, then ak = k!1 E X k . That is E X k = k!ak (iii) M (k) (0) = E X k = xk P X (dx) (c) If M is defined in some neighborhood of s, then M (k) (s) = (d) See also Chernoff bound. 87
xk esx P X (dx)
10.3
One-Sided Laplace Transform
10.6. The one-sided Laplace transform a nonnegative random variable X is defined of−sx e P X (dx) for s ≥ 0 by L (s) = M (s) = E e−sX = [0,∞)
• Note that 0 is included in the range of integration. • Always finite because e−sx ≤ 1. In fact, it is a decreasing function of s • L (0) = 1 • L (s) ∈ [0, 1] (a) Derivative: For s > 0, L(k) (s) = (−1)k
xk e−sx P X (dx) = (−1)k E X k e−sX
n
n d n (i) lim ds n L (s) = (−1) E [X ] s↓0
• Because the value at 0 can be ∞, it does not make sense to talk about dn L (0) for n > 1 dsn d L (s) = −E Xe−sX , where at 0, this is the (ii) ∀s ≥ 0 L (s) is differentiable and ds d right-derivative. ds L (s) is finite for s > 0 •
d L (0) ds
= −E [X] ∈ [0, ∞]
(b) Inversion formula: If FX is continuous at t > 0, then FX (t) = lim
st
s→∞ k=0
(−1)k k (k) s L k!
(s)
(c) FX and P X are determined by LX (s) (i) In fact, they are determined by the values of LX (s) for s beyond any arbitrary s0 . (That is we don’t need to know LX (s) for small s.) Also, knowing LX (s) on N is also sufficient. (ii) Let μ and ν be probability measures on [0, ∞). If ∃s0 ≥ 0 such that e−sx μ (dx) = e−sx ν (dx)∀s ≥ s0 , then μ = ν −sx e f1 (x)dx = (iii) Let f1 , f2 be real functions on [0, ∞). If ∃s0 ≥ 0 such that ∀s ≥ s0 [0,∞) −sx e f2 (x)dx , then f1 = f2 Lebesgue-a.e. [0,∞) n (d) Let X1 , . . . , Xn be independent nonnegative random variables, then L
n i=1
i=1
Xi
(s) =
LXi (s)
(e) Suppose F is a distribution function with corresponding Laplace transform L. Then −sx (i) e F (x)dx = λ1 L (s) [0,∞)
(ii)
e−sx (1 − F (x))dx =
1 λ
(1 − L (s))
[0,∞)
[16, p 183]. 88
10.4
Characteristic Function
10.7. The characteristic function (abbreviated c.f. or ch.f.) of a probability measure μ on the line is defined for real t by ϕ (t) = eitx μ (dx) A random variable X has characteristic function ϕX (t) = E eitX = eitx P X (dx) (a) Always exists because |ϕ(t)| ≤ |eitx |μ (dx) = 1μ (dx) = 1 < ∞ (b) If X has a density, then ϕX (t) = eitx fX (x) dx (c) ϕ (0) = 1 (d) ∀t ∈ R |ϕ (t)| ≤ 1 (e) ϕ is uniformly continuous. (f) Suppose that all moments of X exists and ∀t ∈ R, EetX < ∞, then ϕX (t) = ∞ (it)k EX k k! k=0
( * (g) If E |X|k < ∞, then ϕ(k) (t) = ik E X k eitX and ϕ(k) (0) = ik E X k . (h) Riemann-Lebesgue theorem: If X has a density, then ϕX (t) → 0 as |t| → ∞ (i) ϕaX+b (t) = eitb ϕ (at) (j) Conjugate Symmetry Property : ϕ−X (t) = ϕX (−t) = ϕX (t) D
• X = −X iff ϕX is real-valued. • |ϕX | is even. (k) X is a.s. integer-valued if and only if ϕX (2π) = 1 n (l) If X1 , X2 , . . . , Xn are independent, then ϕ j=1
Xj
(t) =
n j=1
ϕXj (t)
(m) Inversion (i) The inversion formula: If the probability measure μ has characteristic funcT e−ita −e−itb 1 tion ϕ and if μ {a} = μ {b} = 0, then μ (a, b] = lim 2π ϕ (t)dt it T →∞
T e−ita −e−itb 1 ϕ (t)dt it T →∞ 2π −T
i. In fact, if a < b, then lim
−T
= μ (a, b) + 12 μ {a, b}
ii. Equivalently, if F is the distribution function, and a, b are continuity points T e−ita −e−itb 1 of F , then F (b) − F (a) = lim 2π ϕ (t)dt it T →∞
−T
(ii) Fourier inversion: Suppose that |ϕX (t)|dt < ∞, then X is absolutely continuous with 89
i. bounded continuous density f (x) = e−ita −e−itb 1 ii. μ (a, b] = 2π ϕ (t)dt it
1 2π
e−itx ϕ (t)dt
D
→ X if and only if ∀t ϕXn (t) → ϕX (t) (pointwise). (n) Continuity Theorem: Xn − (o) ϕX on complex plane for X ≥ 0 (i) ϕX (z) is defined in the complex plane for Imz ≥ 0 i. |ϕX (z)| ≤ 1 for such z (ii) In the domain Im {z} > 0, ϕX is analytic and continuous including the boundary Im {z} = 0 (iii) ϕX determines uniquely a function LX (s) of real argument s ≥ 0 which is equal to LX (s) = ϕX (is) = Ee−sX . Conversely, LX (s) on the half-line s ≥ 0 determines uniquely ϕX (p) If E |X|n < ∞, then >9 8 7 n k n+1 n 2 |t| |t| (it) EX k ≤ E min |X|n+1 , |X|n ϕX (t) − k! (n + 1)! n!
(22)
k=0
and ϕX (t) =
n k=0
(it)k EX k k!
+ tn β (t) where lim β (t) = 0 or equivalently, |t|→0
ϕX (t) =
n (it)k k=0
k!
EX k + o (tn )
(23)
(i) |ϕX (t) − 1| ≤ E [min {|tX| , 2}]
( '2 )* 2 • If EX = 0, then |ϕX (t) − 1| ≤ E min t2 X 2 , 2 |t| |X| ≤ t2 EX 2 ( '2 )* 2 t 2 (ii) For integrable X, |ϕX (t) − 1 − itEX| ≤ E min 2 X , 2 |t| |X| ≤ t2 EX 2 (iii) For X with finite EX 2 , ' 3 ) i. ϕX (t) − 1 − itEX + 12 t2 EX 2 ≤ E min |t|6 |X|3 , t2 |X|2 ii. ϕX (t) = 1 + itEX − 12 t2 EX 2 + t2 β (t) where lim β (t) = 0 |t|→0
10.8. φX (u) = MX (ju).
10.9. If X is a continuous r.v. with density fX , then ϕX (t) = ejtx fX (x)dx (Fourier 1 e−jtx φX (t)dt (Fourier inversion formula). transform) and fX (x) = 2π • ϕX (t) is the Fourier transform of fX evaluated at −t. • ϕX inherit the properties of a Fourier transform. (a) For nonnegative ai such that i ai = 1, if fY = i ai fXi , then ϕX = i ai ϕXi . 90
(b) If fX is even, then ϕX is also even. • If fX is even, ϕX = ϕ−X . 10.10. Linear combination n variables: Suppose X1 , . . . , Xn are n of independent random Then, ϕY (t) = i=1 ϕX (ai t). Furthermore, if |ai | = 1 independent. Let Y = i=1 ai Xi . and all fXi are even, then ϕY (t) = ni=1 ϕX (t). (a) ϕX+Y (t) = ϕX (t)ϕY (t). (b) ϕX−Y (t) = ϕX (t)ϕY (−t). (c) If fY is even, ϕX−Y = ϕX+Y = ϕX ϕY . 10.11. Characteristic function for sum of distribution: Consider nonnegative ai such that Let Pi be probability measure with corresponding ch.f. ϕi . Then, the ch.f. of i ai = 1. i ai Pi is i ai ϕ i . (a) Discrete r.v.: Suppose pi is pmf with corresponding ch.f. ϕi . Then, the ch.f. of i ai pi is i ai ϕ i . (b) Absolutely r.v.: Suppose fi is pdf with corresponding ch.f. ϕi . Then, the continuous ch.f. of i ai fi is i ai ϕi .
11
Functions of random variables
Definition 11.1. The preimage or inverse image of a set B is defined by g −1 (B) = {x : g(x) = B}. 11.2. For discrete X, suppose Y = g(X). Then, pY (y) = x∈g−1 ({y}) pX (x). The joint pmf of Y and X is given by pX,Y (x, y) = pX (x)1[y = g(x)]. (0) • In most cases, Y are not independent, we can show that X and x (0) pick a point (0) (1) (1) (1) > 0. Pick a point and pY y > 0. such that pX x (0)y such (1)that y = g x (0) (1) = 0 but pX x pY y > 0. Note that this technique does Then, pX,Y x , y not always work. For example, if g is a constant function which maps all values of x to a constant c. Then, we won’t be able to find y (1) . Of course, this is to be expected because we know that a constant is always independent of other random variables.
11.1
SISO case
There are many techniques for finding the cdf and pdf of Y = g(X). (a) One may first find FY (y) = P [g(X) ≤ y] first and then find fY from FY (y). In which case, the Leibniz’ rule in (38) will be useful. (b) Formula (25) below provides a convenient way of arriving at fY from fX without going through FY . 91
11.3 (Linear transformation). Y = aX + b where a = 0. 7 , a>0 FX y−b a y−b − FY (y) = 1 − FX , a<0 a (a) Suppose X is absolutely continuous, 1 fX fY (y) = |a|
y−b a
.
(24)
In fact (24) holds even for mixed r.v. if we allow delta function because δ (y − (axk + b)). (b) Suppose X is discrete,
1 δ |a|
y−b a
− xk =
y−b . pY (y) = pX a If we write fX (x) = k pX (xk )δ(x − xk ), we have fY (y) = p (xk ) δ (y − (axk + b)). k
11.4 (Power Law Function). Y = X n , n ∈ N or n ∈ (0, ∞). 1 1 1 1 n −1 n (a) n odd: FY (y) = FX y fX y n . and fY (y) = n y (b) n even: FY (y) = 7
and fY (y) =
⎧ ⎨
FX y
⎩
0,
1 1 n y −1 n
0,
1 n
− FX
−y
1 n
−
, y≥0 y < 0.
1 1 fX y n + fX −y n , y≥0 y < 0.
Again, the density fY in the above formula holds when X is absolutely continuous. Note that when n < 1, fY is not defined at 0. If we allow delta functions, then the density 1 1 1 n −1 δ ±y n − xk = δ (y − (±xk )n ). formula above are also valid for mixed r.v. because n y • Let X be an absolutely continuous random variable. The density of Y = X 2 is 0, √ √ y<0 fY (y) = 1 1 √ √ − y , y ≥ 0. f y + f 2 y X 2 y X 11.5. In general, for Y = g(X), we solve the equation y = g(x). Denoting its real roots by xk . Then, fX (xk ) . (25) fY (y) = (x )| |g k k If g(x) = c = constant for every x in the interval (a, b), then FY (y) is discontinuous for y = c. Hence, fY (y) contains an impulse (FX (b) − FX (a)) δ(y − c) [14, p 93–94]. 92
• To see this, consider when there is unique x such that g(x) = y. Then, For small Δx and Δy, P [y, y < Y ≤ y+Δy] = P [x < X ≤ x+Δx] where (y+Δy] = g ((x, x + Δx]) is the image of the interval (x, x + Δx]. (Equivalently, (x, x + Δx] is the inverse image of y + Δy].) This gives fY (y)Δy = fX (x)Δx. • The joint density fX,Y is fX,Y (x, y) = fX (x) δ (y − g (x)) .
(26)
Let the xk be the solutions for x of g(x) = y. Then, by integrating (26) w.r.t. x, we have (25) via the use of (3). • When g bijective,
• For Y =
a , X
d −1 −1 fY (y) = g (y) fX g (y) . dy
fY (y) = ay fX ay .
• Suppose X is nonnegative. For Y =
√
X,
fY (y) = 2yfX (y 2 ). 11.6. Given Y = g(X) where X ∼ U(a, b). Then, to get fY (y0 ), plot g on (a, b). Let A = g −1 ({y0 }) be the set of all points x such that g(x) = y0 . Suppose A can be written as a countable disjoint union A = B ∪ ∪i Ii where B is countable and the Ii ’s are intervals. We have + , 1 1 + (Ii ) δ(y − y0 ) fY (y) = b − a |g (x)| i at y = y0 where (I) is the length of the interval I. • Suppose Θ is uniform on an interval of length 2π. Y1 = cos Θ and Y2 = sin Θ are both arcsine random variables with FYi (y) = 1 − π1 cos−1 y = 21 + π1 sin−1 (y) and fYi (y) = π1 √ 1 2 for y ∈ [−1, 1]. Note also that E [Y1 Y2 ] = EYi = Cov [Y1 , Y2 ] = 0. 1−y
Hence, Y1 and Y2 are uncorrelated. However, it is easy to see that Y1 and Y2 are not independent by considering the joint and marginal densities at y1 = y2 = 0. Example 11.7. Y = X 2 1[X≥0] ⎧ y<0 ⎨ 0, FX (0), (a) FY (y) = √ y = 0 ⎩ FX y , y>0 ? 0, y < 0 √ (b) fY (y) = 1 √ f y , y > 0 + FX (0) δ (y) 2 y X
93
v = np n ∞
Poisson x = 0,1⋅⋅⋅ v
←
Binomial x = 0,1 ⋅⋅⋅ n n, p
←
X- m s Standard normal -∞ < x < ∞
Standard cauchy -∞ < x < ∞
2 X1 +
⋅⋅⋅ +
n=1
b =2
F x>0 n1, n2 min(X1,◊◊◊,XK)
X
n=1
n=2
Exponential x>0 b
Erlang x>0 b, n
X1 + ⋅⋅⋅ + XK
-blog X
Standard uniform 0 < x <1
X1 - X2
a + (b − a)X
2 1
X t -∞ < x < ∞ n
a =1
Chi-square x>0 n
n2 → ∞ n1 X X1/n1 X2 /n2
1 X
a=b=1
a=n
X1 + ⋅⋅⋅ + XK
n→∞
X1 X1 + X2
Gamma x>0 a, b
n = 2a a = 1/2
XK2
Beta 0<x<1 a, b
Continuous distributions
a=0 a=1
X1 + ⋅⋅⋅ + XK
m = ab s 2 = ab 2 a→ ∞
m + sX
Cauchy -∞ < x < ∞ a, a
a + aX
a = b→ ∞
Normal -∞ < x < ∞ m, s
log Y
X1/X2
Bernoulli x = 0,1 p
m = np s 2 = np(1 – p) n ∞
Y = eX
X1 + ⋅⋅⋅ + XK
1/X
X1 + ⋅⋅⋅ + XK n=1
s2 = v m=v
Lognormal y>0
←
n3
X1 + ⋅⋅⋅ + XK
Hypergeometric x = 0,1⋅⋅⋅ , min(n1, n2) n1, n2, n3
n3 ∞
Discrete distributions
p = n1
X
a
a=1
X2 Rayleigh x>0 b
Weibull x>0 a, b
Triangular –1 < x < 1
Figure 20: Relationships among univariate distributions [20] 94
a=0 b=1 Uniform a<x
X i ~ * qi , O
i
X ~ * q, O Y
Uniform: - a, b
§ · ~ * ¨ ¦ qi , O ¸ . © i ¹ § O· D X ~ * ¨ q, ¸ . © D¹
¦X
fX x
i
1 1 x b a > a , b@ 1 X ~ - 0,1 ln X ~ O
O
Gamma : * q, O
O q x q 1e O x
fX x
*q
q
n ,O 2
1 2V 2
Normal/Gaussian: & m,V 2 n
¦X
2 i
i .i .d .
with X i ~ & 0,V
2
1 § xP · V ¸¹
2
¨ 1 e 2© V 2S
X2 i .i .d . § 1 · X 2 Y 2 with X , Y ~ & ¨ 0, ¸ © 2D ¹
O D
Beta:
* q1 * q2
fX x
X1 with X i ~ * qi , O X1 X 2
1 2
O e O x1>0, f x
i .i .d . § 1 · X 2 Y 2 with X , Y ~ & ¨ 0, ¸ © 2O ¹
§1 1 · X ~ & 0,V 2 X 2 ~ * ¨ , 2 ¸ © 2 2V ¹
i 1
* q1 q2
fX x
q 1
1 0, f x
§n 1 · Chi-squared F 2 : * ¨ , 2 ¸ © 2 2V ¹ Usually, V 1 .
f E q ,q x
§O· X ~ O D X ~ ¨ ¸ . ©D ¹ § · Xi ~ (Oi) min X i ~ ¨ ¦ Oi ¸ . © i ¹ Exponential: (O)
x q1 1 1 x
q2 1
1 0,1 x
Rayleigh: f x
2
2D xe D x 1>0, f x
Din November 8, 2004
Figure 21: Another diagram demonstrating relationship among univariate distributions
95
11.2
MISO case
11.8. If X and Y are jointly continuous random variables with joint density fX,Y . The following two methods give the density of Z = g(X, Y ). • Condition on one of the variable, say Y = y. Then, begin conditioned, Z is simply a function of one variable g(X, y); hence, we can use the one-variable technique to find fZ|Y (z|y). Finally, fZ (z) = fZ|Y (z|y)fY (y)dy. ) • Directly find the joint density of the random vector YZ = g(X,Y . Observe that the Y ∂g ∂g ∂g . Jacobian is ∂x ∂y . Hence, the magnitude of the determinant is ∂x 0 1. Of course, the standard way of finding the pdf of Z is by finding the derivative of the cdf FZ (z) = (x,y):x2 +y2 ≤z fX,Y (x, y)d(x, y). This is still good for solving specific examples. It is also a good starting point for those who haven’t learned conditional probability nor Jacobian. Let the x(k) be the solutions of g(x, y) = z for fixed z and y. The first method gives fX|Y (x|y) ∂ . fZ|Y (z|y) = g(x, y) ∂x (k) k
Hence,
x=x
fX,Y (x, y) ∂ fZ,Y (z, y) = g(x, y) ∂x
k
, x=x(k)
which comes out of the second method directly. Both methods then gives ! fX,Y (x, y) ∂ fZ (z) = dy. g(x, y) ∂x (k) k
x=x
The integration for a given z is only on the value of y such that there is at least a solution for x in z = g(x, y). If there is no such solution, fZ (z) = 0. The same technique works for a function of more than one random variables Z = g(X1 , . . . , Xn ). For any j ∈ [n], let the (k) xj be the solutions for xj in z = g(x1 , . . . , xn ). Then, ! fX1 ,...,Xn (x1 , . . . , xn ) fZ (z) = dx[n]\{j} . ∂ ∂xj g(x1 , . . . , xn ) k (k) xj =xj
For the second method, we consider the random vector (hr (X1 , . . . , Xn ), r ∈ [n]) where hr (X1 , . . . , Xn ) = Xr for r = j and hj = g. The Jacobian is of the form ⎛ ⎞ 1 0 0 0 ⎜ 0 1 0 0 ⎟ ⎜ ∂g ∂g ⎟ ∂g ∂g ⎝ ∂x ∂x ∂x ∂x ⎠ . 1
0
2
j
0
0 96
n
1
By swapping the row with all the partial derivatives to the first row, the magnitude of the determinant is unchanged and we also end up with upper triangular matrix whose determinant is simply the product of the diagonal elements. (a) For Z = aX + bY ,
! z − ax z − by 1 1 fX,Y , y dy = fX,Y x, dx. fZ (z) = |a| a |b| b ax+by a b y , xy = • Note that Jacobian . y 0 1 !
(i) When a = 1, b = −1,
!
fX−Y (z) =
! fX,Y (z + y, y) dy =
fX,Y (x, x − z) dx.
(ii) Note that when X and Y are independent and a = b = 1, we have the convolution formula ! ! fZ (z) = fX (z − y) fY (y)dy = fX (x)fY (z − x) dx. (b) For Z = XY ,
! z 1 z 1 fZ (z) = fX,Y x, dx = fX,Y ,y dy x |x| y |y| y x [9, Ex 7.2, 7.11, 7.15]. Note that Jacobian xy , xy = . y 0 1 !
(c) For Z = X 2 + Y 2 , √
! z fX|Y
"
fZ (z) = √ − z
z−
" 2 + fX|Y − z − y y " fY (y)dy 2 z − y2
y2 y
[9, Ex 7.16]. Alternatively, 1 fZ (z) = 2
!
2π
√ √ fX,Y ( z cos θ, z sin θ)dθ,
z>0
(27)
0
[15, eq (9.14) p 318)]. i.i.d.
• This can be used to show that when X, Y ∼ N (0, 1), Z = X 2 + Y 2 ∼ E √ (d) For R = X 2 + Y 2 , ! 2π fX,Y (r cos θ, r sin θ)dθ, r > 0 fR (r) = r 0
[15, eq (9.13) p 318)]. 97
1 2
.
(e) For Z =
Y , X
! ! y y , y dy = |x| fX,Y (x, xz)dx. fZ (z) = fX,Y z z
Similarly, when Z =
X , Y
! |y| fX,Y (yz, y)dy.
fZ (z) = (f) For Z =
min(X,Y ) max(X,Y )
!
where X and Y are strictly positive,
FZ (z) = !
0
fZ (z) =
!
∞
FY |X (zx|x)fX (x)dx +
∞
FX|Y (zy|y)fY (y)dy, ! 0
∞
xfY |X (zx|x)fX (x)dx + 0
∞
yfX|Y (zy|y)fY (y)dy,
0 < z < 1.
0
[9, Ex 7.17]. 11.9 (Random sum). Let S =
N i=1
Vi where Vi ’s are i.i.d. ∼V independent of N .
(a) ϕS (u) = ϕN (−i ln (ϕV (u))). • ϕS (u) = GN (ϕV (u)) • For non-negative integer-valued summands, we have GS (z) = GN (GV (z)) (b) ES = EN EV . (c) Var [S] = EN (Var V ) + (EV )2 (Var N ). Remark : If N ∼ P (λ), then ϕS (u) = exp (λ (ϕV (u) − 1)), the compound Poisson distribution CP (λ, L (V )). Hence, the mean and variance of CP (λ, L (V )) are λEV and λEV 2 respectively.
11.3
MIMO case
Definition 11.10 (Jacobian). In vector calculus, the Jacobian is shorthand for either the Jacobian matrix or its determinant, the Jacobian determinant. Let g be a function from a subset D of Rn to Rm . If g is differentiable at z ∈ D, then all partial derivatives exists at z and the Jacobian matrix of g at a point z ∈ D is ⎞ ⎛ ∂g1 ∂g1 (z) · · · ∂x (z) ∂x1 n ∂g ∂g ⎟ ⎜ .. . . .. ⎠ = .. (z) , . . . , (z) . dg (z) = ⎝ . ∂x1 ∂xn ∂gm ∂gm (z) · · · ∂xn (z) ∂x1 ∂(g1 ,...,gn ) Alternative notations for the Jacobian matrix are J, ∂(x [7, p 242], Jg (x) where the it 1 ,...,xn ) is assumed that the Jacobian matrix is evaluated at z = x = (x1 , . . . , xn ).
98
• Let A be an n-dimensional “box” defined by the corners x and x+Δx. The “volume” of the image g(A) is ( i Δxi ) |det dg(x)|. Hence, the magnitude of the Jacobian determinant gives the ratios (scaling factor) of n-dimensional volumes (contents). In other words, ∂(y1 , . . . , yn ) dx1 · · · dxn . dy1 · · · dyn = ∂(x1 , . . . , xn ) • d(g −1 (y)) is the Jacobian of the inverse transformation. • In MATLAB, use jacobian. See also (A.14). 11.11 (Jacobian formulas). Suppose g is a vector-valued function of x ∈ Rn , and X is an Rn -valued random vector. Define Y = g(X). (Then, Y is also an Rn -valued random vector.) If X has joint density fX , and g is a suitable invertible mapping (such that the inversion mapping theorem is applicable), then fY (y) =
−1 1 g (y) = det d g −1 (y) fX g −1 (y) . f X −1 |det (dg (g (y)))|
• Note that for any matrix A, det(A) = det(AT ). Hence, the formula above could tolerate the incorrect “Jacobian”. In general, let X = (X1 , X2 , . . . , Xn ) be a random vector with pdf fX (x). Let S = {x : fX (x) > 0}. Consider a new random vector Y = (Y1 , Y2 , . . . , Yn ), defined by Yi = gi (X). Suppose that A0 , A1 , . . . , Ar form a partition of S with these properties. The set A0 , which may be empty, satisfies P [X ∈ A0 ] = 0. The transformation Y = g(X) = (g1 (X), . . . , gn (X)) is a one-to-ont transformation from Ai onto some common set B for each i ∈ [k]. Then, for each i, the inverse functions from B to Ai can be found. Denote (k) the kth inverse x = h(k) (u) by xj = hj (y). This kth inverse gives, for y ∈ B, the unique x ∈ Ak such that y = g(x). Assuming that the Jacobians det(dh(k) (y)) do not vanish identically on B, we have fY (y) =
r
fX (h(k) (y)) det(dh(k) (y)) ,
y∈B
k=1
[2, p 185]. • Suppose for some k, Yk is some functions of other Yi . In particular, suppose Yk = h(yI ) for some index set I and some deterministic function h. Then, the kth row of the Jacobian matrix is a linear combination of other rows. In particular, ∂yi ∂yk ∂ = h (yI ) . ∂xj ∂y ∂x i j i∈I Hence, the Jacobian determinant is 0. 99
11.12. Suppose Y = g(X) where both X and Y have the same dimension, then the joint density of X and Y is fX,Y (x, y) = fX (x)δ(y − g(x)). • In most cases, can show that X and Y are not independent, pick a point x(0) we such that fX x(0) > 0. Pick a point that y (1) = g x(0) and fY y (1) > 0. y(1) such Then, fX,Y x(0) , y (1) = 0 but fX x(0) fY y (1) > 0. Note that this technique does not always work. For example, if g is a constant function which maps all values of x to a constant c. Then, we won’t be able to find y (1) . Of course, this is to be expected because we know that a constant is always independent of other random variables. Example 11.13. (a) For Y = AX + b, where A is a square, invertible matrix, 1 fX A−1 (y − b) . |det A|
fY (y) =
(28)
(b) Transformation between Cartesian coordinates (x, y) and polar coordinates (r, θ) " • x = r cos θ, y = r sin θ, r = x2 + y 2 , θ = tan−1 xy . ∂x ∂x cos θ −r sin θ ∂r ∂θ = r. (Recall that dxdy = rdrdθ). • ∂y ∂y = sin θ r cos θ ∂r ∂θ We have fR,Θ (r, θ) = rfX,Y (r cos θ, r sin θ) , and fX,Y (x, y) = "
1 x2 + y 2
fR,Θ
"
r ≥ 0 and θ ∈ (−π, π), x2 + y 2 , tan−1
y x
.
If, furthermore, Θ is uniform on (0, 2π) and independent of R. Then, fX,Y (x, y) =
" 1 1 " fR x2 + y 2 . 2π x2 + y 2
(c) A related transformation is given by Z = √ √ case, X = Z cos Θ, Y = Z sin Θ, and
√
X 2 + Y 2 and Θ = tan−1
Y X
. In this
√ √ 1 z cos θ, z sin θ fZ,Θ (z, θ) = fX,Y 2 which gives (27). √ Y are 11.14. Suppose X, Y are i.i.d. N (0, σ 2 ). Then, R = X 2 + Y 2 and Θ = arctan X 2 1 r independent with R being Rayleigh fR (r) = σr2 e− 2 ( σ ) U (r) and Θ being uniform on [0, 2π]. 100
11.15 (Generation of a random sample of a normally distributed random variable). Let U1 , U2 be i.i.d. U(0, 1). Then, the random variables " X1 = −2 ln U1 cos(2πU2 ) " X2 = −2 ln U1 sin(2πU2 ) are i.i.d. N (0, 1). Moreover, Z1 = Z2 =
" "
−2σ 2 ln U1 cos(2πU2 ) −2σ 2 ln U1 sin(2πU2 )
are i.i.d. N (0, σ 2 ). • The idea is to generate R and Θ according to (11.14) first. • det(dx(u)) = − 2π , u1 = e− u1
2 x2 1 +x2 2
.
11.16. In (11.11), suppose dim(Y ) = dim(g) ≤ dim(X). To find the joint pdf of Y , we introduce “arbitrary” Z = h(X) so that dim YZ = dim(X).
11.4
Order Statistics
Given a sample of n random variables X1 , . . . , Xn , reorder them so that Y1 ≤ Y2 ≤ · · · ≤ Yn . Then, Yi is called the ith order statistic, sometimes also denoted Xi:n , X i , X(i) , Xn:i , Xi,n , or X(i)n . In particular • Y1 = X1:n = Xmin is the first order statistic denoting the smallest of the Xi ’s, • Y2 = X2:n is the second order statistic denoting the second smallest of the Xi ’s . . ., and • Yn = Xn:n = Xmax is the nth order statistic denoting the largest of the Xi ’s. In words, the order statistics of a random sample are the sample values placed in ascending order [2, p 226]. Many results in this section can be found in [4]. 11.17. Events properties: [Xmin [Xmin [Xmin [Xmin
≥ y] = i [Xi > y] = i [Xi ≤ y] = i [Xi < y] = i [Xi
≥ y] > y] ≤ y] < y]
[Xmax [Xmax [Xmax [Xmax
≥ y] = i [Xi > y] = i [Xi ≤ y] = i [Xi < y] = i [Xi
≥ y] > y] ≤ y] < y]
Let Ay = [Xmax ≤ y], By = [Xmin > y]. Then, Ay = [∀i Xi ≤ y] and By = [∀i Xi > y].
101
11.18 (Densities). Suppose the Xi are absolutely continuous with joint density fX . Let Sy be the set of all n! vector which comes from permuting the coordinates of y. fX (x), y1 ≤ y2 ≤ · · · ≤ yn . (29) fY (y) = x∈Sy
Δy is the probability that Yj is in the small interval of To see this, note that fY (y) j j length Δyj around yj . This probability can be calculated from finding the probability that all Xk fall into the above small regions. From the joint density, we can find the joint pdf/cdf of YI for any I ⊂ [n]. However, in many cases, we can directly reapply the above technique to find the joint pdf of YI . This is especially useful when the Xi are independent or i.i.d. (a) The marginal density fYk can be found by approximating fYk (y) Δy with n
P [Xj ∈ [y, y + Δy) and ∀i ∈ I, Xi ≤ y and ∀r ∈ (I ∪ {k})c , Xr > y],
j=1 I∈([n]\{j}) k−1
where for any set A and integer ∈ |A|, we define A to be the set of all k-element subsets of A. Note also that we assume (I ∪ {k})c = [n] \ (I ∪ {k}). To see this, we first choose the Xj that will be Yk with value around y. Then, we must have k − 1 of the Xi below y and have the rest of the Xi > y. (b) For integers r < s, the joint density fYr ,Ys (yr , ys )Δyr Δys can be approximated by the probability that two of the Xi are inside small regions around yr and ys . To make them Yr and Ys , for the other Xi , r − 1 of them before yr , s − r − 1 of them between yr and ys , and n − s of them beyond ys . • fXmax ,Xmin (u, v)ΔuΔv can be approximated by by
P [Xj ∈ [u, u+Δu), Xj ∈ [v, v+Δv), and ∀i ∈ [n]\{j, k} , v < Xi ≤ u],
v ≤ u,
(j,k)∈S
where S is the set of all n(n − 1) pairs (j, k) from [n] × [n] with j = k. This is simply choosing the j, k so that Xj will be the maximum with value around u, and Xk will be the minimum with value around v. Of course, the rest of the Xi have to be between the min and max. ◦ When n = 2, we can use (29) to get fXmax ,Xmin (u, v) = fX1 ,X2 (u, v) + fX1 ,X2 (v, u),
v ≤ u.
Note that the joint density at point yI is0 if the the elements in yI are not arranged in the “right” order. 11.19 (Distribution functions). We note again the the cdf may be obtained by integration of the densities in (11.18) as well as by direct arguments valid also in the discrete case. 102
(a) The marginal cdf is FYk (y) =
n
P [∀i ∈ I, Xi ≤ y and ∀r ∈ [n] \I, Xr > y].
j=k I∈([n]) j
the same as event that at least k of the Xi are This is because the event [Yk ≤ y] is n ≤ y. In other words, let N (a) = i=1 1[Xi ≤ a] be the number of Xi which are ≤ a. 8 9 9 8 [Yk ≤ y] = N (y) ≥ k = N (y) = j , (30) i
i
j≥k
where the union is a disjoint union. Hence, we sum the probability that exactly j of the Xi are ≤ y for j ≥ k. Alternatively, note that the event [Yk ≤ y] can also be expressed as a disjoint union [Xi ≤ k and exactly k − 1 of the X1 , . . . , Xj−1 are ≤ y] . j≥k
This gives FYk (y) =
n
P [Xj ≤ y, ∀i ∈ I, Xi ≤ y, and ∀r ∈ [j − 1] \ I, Xr > y] .
[j−1] k−1
j=k I∈(
)
(b) For r < s, Because Yr ≤ Ys , we have [Yr ≤ yr ] ∩ [Ys ≤ ys ] = [Ys ≤ ys ] ,
ys ≤ y r .
By (30), for yr < ys , + [Yr ≤ yr ] ∩ [Ys ≤ ys ] =
n
, [N (yr ) = j]
=
∩
n
, [N (ys ) = m]
m=s
j=r n m
+
[N (yr ) = j and N (ys ) = m],
m=s j=r
where the upper limit of the second union is changed from n to m because we must have N (yr ) ≤ N (ys ). Now, to have N (yr ) = j and N (ys ) = m for m > j is to put j of the Xi in (−∞, yr ], m − j of the Xi in (yr , ys ], and n − m of the Xi in (ys , ∞). (c) Alternatively, for Xmax , Xmin , we have FXmax ,Xmin (u, v) = P (Au ∩ Bvc ) = P (Au ) − P (Au ∩ Bv ) = P [∀i Xi ≤ u] − P [∀i v < Xi ≤ u] where the second term is 0 when u < v. So, FXmax ,Xmin (u, v) = FX1 ,...,Xn (u, . . . , u) 103
when u < v. When v ≥ u, the second term can be found by (21) which gives (−1)|i:wi =v| FX1 ,...,Xn (w) FXmax ,Xmin (u, v) = FX1 ,...,Xn (u, . . . , u) − w∈S
=
|i:wi =v|+1
(−1)
FX1 ,...,Xn (w).
w∈S\{(u,...,u)}
where S = {u, v}n is the set of all 2n vertices w of the “box” × (ai , bi ]. The joint i∈[n]
density is 0 for u < v. 11.20. For independent Xi ’s, (a) fY (y) =
n x∈Sy i=1
fXi (xi )
(b) Two forms of marginal cdf: FYk (y) =
+
n
j=k I∈([n]) j
=
⎞ ,⎛ FXi (y) ⎝ (1 − FXi (y))⎠
i∈I
n
r∈[n]\I
FXj (y)
+
[j−1] k−1
j=k I∈(
(c) FXmax ,Xmin (u, v) =
k∈[n]
⎞ ,⎛ FXi (y) ⎝ (1 − FXr (y))⎠
i∈I
) 7
FXk (u) −
0, k∈[n]
r∈[j−1]\I
u≤v (FXk (u) − FXk (v)), v < u
>
(d) The marginal cdf is FYk (y) =
n j=k I∈([n]) j
(e) FXmin (v) = 1 − (f) FXmax (u) =
i
i
+
⎞ ,⎛ FXi (y) ⎝ (1 − FXr (y))⎠ .
i∈I
r∈[n]\I
(1 − FXi (v)).
FXi (u).
i.i.d. 11.21. Suppose Xi ∼ X with common density f and distribution function F . (a) The joint density is given by fY (y) = n!f (y1 )f (y2 ) . . . f (yn ),
104
y1 ≤ y 2 ≤ · · · ≤ y n .
If we define y0 = −∞, yk+1=∞ , n0 = 0, nk+1 = n + 1, then for k ∈ [n] and 1 ≤ n1 < · · · < nk ≤ n, the joint density fYn1 ,Yn2 ,...,Ynk (yn1 , yn2 , . . . , ynk ) is given by + k , k (F (ynj+1 ) − F (ynj ))nj+1 −nj −1 n! f (ynj ) . (n − n − 1)! j+1 j j=1 j=1 In particular, for r < s, the joint density fYr ,Ys (yr , ys ) is given by n! f (yr )f (ys )F r−1 (yr )(F (ys ) − F (yr ))s−r−1 (1 − F (ys ))n−s (r − 1)!(s − r − 1)!(n − s)! [2, Theorem 5.4.6 p 230]. (b) The joint cdf FYr ,Ys (yr , ys ) is given by ⎧ ys ≤ yr , ⎨ FYs (ys ), n m n! (F (yr ))j (F (ys ) − F (yr ))m−j (1 − F (ys ))n−m , yr < ys . ⎩ j!(m−j)!(n−m)! m=s j=r
n
(c) FXmax ,Xmin (u, v) = (F (u)) − (d) fXmax ,Xmin (u, v) =
0, u≤v n (F (u) − F (v)) , v < u
? .
0, u≤v n−2 , v
(e) Marginal cdf: FYk (y) =
n n j=k
=
(F (y))j (1 − F (y))n−j
n j−1 j=k
=
j
k−1
(F (y))k (1 − F (y))j−k = (F (y))k
n! (k − 1)! (n − k)!
!
n−k m=0
F (y)
k+m−1 (1 − F (y))m k−1
tk−1 (1 − t)n−k dt.
0
Note that N (y) ∼ B (n, F (y)). The last equality comes from integrating the marginal density fYk in (31) with change of variable t = F (y). (i) FXmax (y) = (F (y))n and fXmax (y) = n (F (y))n−1 fX (y). (ii) FXmin (y) = 1 − (1 − F (y))n and fXmin (y) = n (1 − F (y))n−1 fX (y). (f) Marginal density: fYk (y) =
n! (F (y))k−1 (1 − F (y))n−k fX (y) (k − 1)! (n − k)!
(31)
[2, Theorem 5.4.4 p 229] Consider small neighborhood Δy around y. To have Yk ∈ Δy , we must have exactly less than y, and exactly n − k of them one of the Xi ’s in Δy , exactly n−1k− 1 of them n! 1 possible setups. greater than y. There are n k−1 = (k−1)!(n−k)! = B(k,n−k+1) 105
(g) The range R is defined as R = Xmax − Xmin . (i) For x > 0, fR (x) = n(n − 1) (F (u) − F (u − x))n−2 f (u − x)f (u)du. (ii) For x ≥ 0, FR (x) = n (F (u) − F (u − x))n−1 f (u)du. Both pdf and cdf above are derived by first finding the distribution of the range conditioned on the value of the Xmax = u. See also [4, Sec. 2.2] and [2, Sec. 5.4]. 11.22. Let X1 , X2 , . . . , Xn be a random sample from a discrete distribution with pmf pX (xi ) = pi , where x1 < x2 < . . . are the possible values of X in ascending order. Define P0 = 0 and Pi = ik=1 pk , then P [Yj ≤ xi ] =
n n k=j
and P [Yj = xi ] =
n n k=j
k
k
Pik (1 − Pi )n−k
k Pik (1 − Pi )n−k − Pi−1 (1 − Pi−1 )n−k .
Example 11.23. If U1 , U2 , . . . , Uk are independently uniformly distributed on the interval 0 to t0 , then they have joint pdf 1 k , 0 ≤ ui ≤ t0 tk0 fU1k u1 = 0, otherwise The order statistics τ1 , τ2 , . . . , τk corresponding to U1 , U2 , . . . , Uk have joint pdf 7 k! k , 0 ≤ t1 ≤ t2 ≤ · · · ≤ tk ≤ t0 tk0 fτ1k t1 = 0, otherwise Example 11.24 (n = 2). Suppose U = max(X, Y ) and V = min(X, Y ) where X and Y have joint cdf FXY . u ≤ v, FX,Y (u, u), FU,V (u, v) = , FX,Y (v, u) + FX,Y (u, v) − FX,Y (v, v), u > v FU (u) = FX,Y (u, u), FV (v) = FX (v) − FY (v) − FX,Y (v, v). [9, Ex 7.5, 7.6]. The joint density is fU,V (u, v) = fX,Y (u, v) + fX,Y (v, u),
106
v < u.
The marginal densities is given by ∂ ∂ fU (u) = FX,Y (x, y) FX,Y (x, y) + x=u, x=u, ∂x ∂y y=u
!u =
fX,Y (x, u)dx + −∞
y=u
!u fX,Y (u, y) dy,
−∞
fV (v) = fX (v) + fY (v) − fU (v) + , ∂ ∂ = fX (v) + fY (v) − + FX,Y (x, y) FX,Y (x, y) x=v, x=v, ∂x ∂y y=v y=v ⎛ v ⎞ v ! ! ⎝ = fX (v) + fY (v) − fX,Y (x, u)dx + fX,Y (u, y) dy ⎠ . !
−∞ ! ∞
∞
=
fX,Y (v, y)dy + v
−∞
fX,Y (x, v)dx v
If, furthermore, X and Y are independent, then u≤v FX (u) FY (u) , FU,V (u, v) = FX (v) FY (u) + FX (u) FY (v) − FX (v) FY (v) , u > v, FU (u) = FX (u) FY (u) , FV (v) = FX (v) + FY (v) − FX (v) FY (v) , fU (u) = fX (u) FY (u) + FX (u) fY (u) , fV (v) = fX (v) + fY (v) − fX (v) FY (v) − FX (v) fY (v) . If, furthermore, X and Y are i.i.d., then FU (u) = F 2 (u) , FV (v) = 2F (v) − F 2 (v) , fU (u) = 2f (u) F (u) , fV (v) = 2f (v) (1 − F (v)) . 11.25. Let the Xi Xmax − Xmin . FR (x) = fR (x) =
be i.i.d. with density f and cdf F . The range R is defined as R = 0, x<0 n (F (v) − F (v − x))n−1 f (v) dv, x ≥ 0 0, x<0 n−2 f (v − x) f (v) dv, x > 0. n (n − 1) (F (v) − F (v − x)) i.i.d.
For example, when Xi ∼ U(0, 1), fR (x) = n (n − 1) xn−2 (1 − x) , [15, Ex 9F p 322–323]. 107
0≤x≤1
12
Convergences
Definition 12.1. A sequence of random variables (Xn ) converges pointwise to X if ∀ω ∈ Ω lim Xn (ω) → X (ω) n→∞
Definition 12.2 (Strong Convergence). The following statements are all equivalent conditions/notations for a sequence of random variables (Xn ) to converge almost surely to a random variable X a.s.
(a) Xn −−→ X (i) Xn → X a.s. (ii) Xn → X with probability 1. (iii) Xn → X w.p. 1 (iv) lim Xn = Xa.s. n→∞
a.s.
(b) (Xn − X) −−→ 0 (c) P [Xn → X] = 1 )* (' (i) P ω : lim Xn (ω) = X (ω) = 1 n→∞ (' )c * (ii) P ω : lim Xn (ω) = X (ω) =0 n→∞
(iii) P [Xn X] = 0 (' )* (iv) P ω : lim |Xn (ω) − X (ω)| = 0 = 1 (d) ∀ε > 0 P
('
n→∞
ω : lim |Xn (ω) − X (ω)| < ε n→∞
)* =1
12.3. Properties of convergence a.s. a.s.
a.s.
(a) Uniqueness: if Xn −−→ X and Xn −−→ Y , then X = Y a.s. a.s.
a.s.
(b) If Xn −−→ X and Yn −−→ Y , then a.s.
(i) Xn + Yn −−→ X + Y a.s.
(ii) Xn Yn −−→ XY a.s.
a.s.
(c) g continuous, Xn −−→ X ⇒ g (Xn ) −−→ g (X) (d) Suppose that ∀ε > 0,
∞
a.s.
P [|Xn − X| > ε] < ∞, then Xn −−→ X
n=1 a.s.
(e) Let A1 , A2 , . . . be independent. Then, 1An −−→ 0 if and only if
108
n
P (An ) < ∞
Definition 12.4 (Convergence in probability). The following statements are all equivalent conditions/notations for a sequence of random variables (Xn ) to converges in probability to a random variable X P
(a) Xn − →X (i) Xn →P X (ii) p lim Xn = X n→∞
P
(b) (Xn − X) − →0 (c) ∀ε > 0 lim P [|Xn − X| < ε] = 1 n→∞
(i) ∀ε > 0 lim P ({ω : |Xn (ω) − X (ω)| > ε}) = 0 n→∞
(ii) ∀ε > 0 ∀δ > 0 ∃Nδ ∈ N such that ∀n ≥ Nδ P [|Xn − X| > ε] < δ (iii) ∀ε > 0 lim P [|Xn − X| > ε] = 0 n→∞
(iv) The strict inequality between |Xn − X| and ε can be replaced by the corresponding “non-strict” version. 12.5. Properties of convergence in probability P
P
(a) Uniqueness: If Xn − → X and Xn − → Y , then X = Y a.s. P
P
→ X, Yn − → Y , and an → a, then (b) Suppose Xn − P
(i) Xn + Yn − →X +Y P
→ aX (ii) an Xn − P
→ XY (iii) Xn Yn − (c) Suppose (Xn ) i.i.d. with distribution U [0, θ]. Let Zn = max {Xi : 1 ≤ i ≤ n}. Then, P →θ Zn − P
P
→ X ⇒ g (Xn ) − → g (X) (d) g continuous, Xn − P
(i) Suppose that g : Rd → R is continuous. Then, ∀i Xi,n − → Xi implies n→∞
P
g (X1,n , . . . , Xd,n ) − → g (X1 , . . . , Xd ) n→∞
P
P
(e) Let g be a continuous function at c. Then, Xn − → c ⇒ g (Xn ) − → g (c) P
→ X ⇒ lim inf EXn ≥ EX (f) Fatou’s lemma: 0 ≤ Xn − n→∞
P
(g) Suppose Xn − → X and |Xn | ≤ Y with EY < ∞, then EXn → EX 109
P
(h) Let A1 , A2 , . . . be independent. Then, 1An − → 0 iff P (An ) → 0 P
→ 0 iff ∃δ > 0 such that ∀t ∈ [−δ, δ] we have ϕXn (t) → 1 (i) Xn − Definition 12.6 (Weak convergence for probability measures). Let Pn and P be d probability measure on R (d ≥ 1). The sequence Pn converges weakly to P if the sequence of real numbers gdPn → gdP for any g which is real-valued, continuous, and bounded on Rd 12.7. Let (Xn ), X be Rd -valued random variables with distribution functions (Fn ), F , distributions (μn ) , μ, and ch.f. (ϕn ) , ϕ respectively. The following are equivalent conditions for a sequence of random variables (Xn ) to converge in distribution to a random variable X (a) (Xn ) converges in distribution (or in law ) to X (i) Xn ⇒ X L
i. Xn − →X D
ii. Xn − →X (ii) Fn ⇒ F i. FXn ⇒ FX ii. Fn converges weakly to F iii. lim P Xn (A) = P X (A) for every A of the form A = (−∞, x] for which n→∞ X
P {x} = 0
(iii) μn ⇒ μ i. P Xn converges weakly to P X (b) Skorohod’s theorem: ∃ random variables Yn and Y on a common probability space D D (Ω, F, P ) such that Yn = Xn , Y = X, and Yn → Y on (the whole) Ω (c) lim Fn = F for all continuity points of F n→∞
(i) FXn (x) → FX (x) ∀x such that P [X = x] = 0 (d) ∃ a (countable) set D dense in R such that Fn (x) → F (x) ∀x ∈ D (e) Continuous Mapping theorem: lim Eg (Xn ) = Eg (X)for all g which is realn→∞
valued, continuous, and bounded on Rd
(i) lim Eg (Xn ) = Eg (X) for all bounded real-valued function g such that P [X ∈ Dg ] = n→∞ 0 where Dg is the set of points of discontinuity of g (ii) lim Eg (Xn ) = Eg (X) for all bounded Lipschitz continuous functions g n→∞
(iii) lim Eg (Xn ) = Eg (X) for all bounded uniformly continuous functions g n→∞
110
(iv) lim Eg (Xn ) = Eg (X) for all complex-valued functions g whose real and imagn→∞ inary parts are bounded and continuous (f) Continuity Theorem: ϕXn → ϕX (i) For nonnegative random variables: ∀s ≥ 0 LXn (s) → LX (s) where LX (s) = Ee−sX Note that there is no requirement that (Xn ) and X be defined on the same probability space (Ω, A, P ) 12.8. Continuity Theorem: Suppose lim ϕn (t) exists ∀t; call this limit ϕ∞ Furthern→∞ more, suppose ϕ∞ is continuous at 0. Then there exists ∃ a probability distribution μ∞ such that μn ⇒ μ∞ and ϕ∞ is the characteristic function of μ∞ 12.9. Properties of convergence in distribution (a) If Fn ⇒ F and Fn ⇒ G, then F = G (b) Suppose Xn ⇒ X (i) If P [X is a discontinuity point of g] = 0, then g (Xn ) ⇒ g (X) (ii) Eg (Xn ) → Eg (X) for every bounded real-valued function g such that P [X ∈ Dg ] = 0 where Dg is the set of points of discontinuity of g i. g (Xn ) ⇒ g (X) for g continuous. P
(iii) If Yn − → 0, then Xn + Yn ⇒ X (iv) If Xn − Yn ⇒ 0, then Yn ⇒ X (c) If Xn ⇒ a and g is continuous at a, then g (Xn ) ⇒ g (a) (d) Suppose (μn ) is a sequence of probability measures on R that are all point masses with μn ({αn }) = 1. Then, μn converges weakly to a limit μ iff αn → α; and in this case μ is a point mass at α (e) Scheff´ e’s theorem: (i) Suppose P Xn and P X have densities δn and δ w.r.t. the same measure μ. Then, δn → δ μ-a.e. implies i. ∀B ∈ BR P Xn (E) → P X (E) • FXn → FX • FXn ⇒ FX ii. Suppose g is bounded. Then, g (x) P Xn (dx) → g (x) P X (dx). In Equivalently, E [g (Xn )] → E [g (X)] where the E is defined with respect to appropriate P . (ii) Remarks: 111
i. For absolutely continuous random variables, μ is the Lebesgue measure, δ is the probability density function. ii. For discrete random variables, μ is the counting measure, δ is the probability mass function. (f) Normal r.v. (i) Let Xn ∼ N (μn , σn2 ). Suppose μn → μ ∈ R and σn2 → σ 2 ≥ 0. Then, Xn ⇒ N (μ, σ 2 ) (ii) Suppose that Xn are normal random variables and let Xn ⇒ X. Then, (1) the mean and variance of Xn converge to some limit m and σ 2 . (2) X is normal with mean m and variance σ 2 (g) Xn ⇒ 0 if and only if ∃δ > 0 such that ∀t ∈ [−δ, δ] we have ϕXn (t) → 1 12.10. Relationship between convergences
X nk
p
L X n oX
a.s. X n oX
p
Xn d Y L
f ( ) continuous Xn
p
o X
p
X
X n d Y Lp
nk
P X n oX
c P X
f ( ) continuous
f ( ) continuous
^
x x : F x dense in
f X n o fX
c 1
X n oX
`
F x
lim Fn x
n of
x D dense in
F x
Figure 22: Relationship between convergences a.s.
P
→X (a) For discrete probability space, Xn −−→ X if and only if Xn − Lp
P
(b) Suppose Xn − → X, ∀n |Xn | ≤ Y , Y ∈ Lp . Then, X ∈ Lp and Xn −→ X Lp
a.s.
(c) Suppose Xn −−→ X, ∀n |Xn | ≤ Y , Y ∈ Lp . Then, X ∈ Lp and Xn −→ X P
D
→ X ⇒ Xn − →X (d) Xn − 112
D
P
(e) If Xn − → X and if ∃a ∈ RX = a a.s., then Xn − →X D
P
→ X and Xn − → X are equivalent. (i) Hence, when X = a a.s., Xn − See also Figure 22. Example 12.11. (a) Ω = [0, 1] ,
P is uniform on [0,1]. Xn (ω) =
0, ω ∈ 0, 1 − n12 en , ω ∈ 1 − n12 , 1
Lp
(i) Xn 0 a.s.
(ii) Xn −−→ 0 P
(iii) Xn − →0 (b) Let Ω = [0, 1] with uniform probability distribution. Define Xn (ω) = ω +1[ j−1 , j ] (ω) , k k @ √ A where k = 12 1 + 8n − 1 and j = n − 12 k (k − 1). The sequence of intervals j−1 j , k under the indicator function is shown in figure 23. k
>0,1@ ª 1º ª1 º « 0, 2 » o « 2 ,1» ¬ ¼ ¬ ¼
ª 1 º ª1 2 º ª 2 º « 0, 3 » o « 3 , 3 » o « 3 ,1» ¬ ¼ ¬ ¼ ¬ ¼
ª 1º ª1 1º ª1 3º ª3 º « 0, 4 » o « 4 , 2 » o « 2 , 4 » o « 4 ,1» ¬ ¼ ¬ ¼ ¬ ¼ ¬ ¼
Figure 23: Diagram for example (12.11) Let X (ω) = ω. (i) The sequence of real numbers (Xn (ω)) does not converge for any ω Lp
(ii) Xn −→ X P
(iii) Xn − →X 1 , w.p. 1 − n (c) Xn = n, w.p. n1
1 n
.
P
→0 (i) Xn − Lp
(ii) Xn 0 113
12.1
Summation of random variables
Set Sn =
n
Xi .
i=1
12.12 (Markov’s theorem; Chebyshev’s inequality). For finite variance (Xi ), if P Var Sn → 0, then n1 Sn − n1 ESn − →0
1 n2
• If (Xi ) are pairwise independent, then
12.2
1 n2
Var Sn =
1 n2
n
Var Xi See also (12.16).
i=1
Summation of independent random variables
12.13. For independent Xn , the probability that
n
Xi converges is either 0 or 1.
i=1
12.14 (Kolmogorov’s SLLN). Consider a sequence (Xn ) of independent random vari Var Xn a.s. < ∞, then n1 Sn − n1 ESn −−→ 0 ables. If n2 n
• In particular, for independent Xn , if
1 n
n i=1
a.s.
EXn → a or EXn → a, then n1 Sn −−→ a
12.15. Suppose that X1 , X2 , . . . are independent and that the series X = X1 + X2 + · · · converges a.s. Then, ϕX = ϕX1 ϕX2 · · · 12.16. For pairwise independent (Xi ) with finite variances, if 1 S n n
1 n2
n
Var Xi → 0, then
i=1
P
− n1 ESn − →0
(a) Chebyshev’s Theorem: For pairwise independent (Xi ) with uniformly bounded P →0 variances, then n1 Sn − n1 ESn −
12.3
Summation of i.i.d. random variable
Let (Xi ) be i.i.d. random variables.
6 5 n 1 Xn − EX1 ≥ ε ≤ 12.17 (Chebyshev’s inequality). P n i=1
1 Var[X1 ] ε2 n
• The Xi ’s don’t have to be independent; they only need to be pairwise uncorrelated. 12.18 (WLLN). Weak Law of Large Numbers: (a) L2 weak law: (finite variance) Let (Xi ) be uncorrelated (or pairwise independent) random variables n (i) V arSn = Var Xi i=1
(ii) If
1 n2
n i=1
P
Var Xi → 0, then n1 Sn − n1 ESn − →0 114
(iii) If EXi = μ and V arXi ≤ C < ∞. Then,
1 n
n
L2
Xi −→ μ
i=1
(b) Let (Xi ) be i.i.d. random variables with EXi = μ and V arXi = σ 2 < ∞. Then, 5 n 6 1 1] (i) P n Xn − EX1 ≥ ε ≤ ε12 Var[X n i=1
(ii)
1 n
n
L2
Xi −→ μ
i=1
(The fact that σ 2 < ∞ implies μ < ∞). (c)
1 n
n
L2
Xi −→ μ implies
i=1
1 n
n
P
Xi − → μ which in turn imply
i=1
1 n
n
D
Xi − →μ
i=1 (n) P
(d) If Xn are i.i.d. random variables such that lim tP [|X1 | > t] → 0, then n1 Sn −EX1 t→∞ 0
− →
(e) Khintchine’s theorem: If Xn are i.i.d. random variables and E |X1 | < ∞, then P (n) → EX1 . EX1 → EX1 and n1 Sn − • No assumption about the finiteness of variance. 12.19 (SLLN). (a) Kolmogorov’s SLLN : Consider a sequence (Xn ) of independent random variables. Var Xn a.s. < ∞, then n1 Sn − n1 ESn −−→ 0 If n2 n
• In particular, for independent Xn , if
1 n
n i=1
a.s.
EXn → a or EXn → a, then n1 Sn −−→ a
(b) Khintchin’s SLLN : If Xi ’s are i.i.d. with finite mean μ, then
1 n
n
a.s.
Xi −−→ μ
i=1
(c) Consider the sequence (Xn ) of i.i.d. random variables. Suppose EX1− < ∞ and a.s. EX1+ = ∞, then n1 Sn −−→ ∞ a.s.
• Suppose that Xn ≥ 0 are i.i.d. random variables and EXn = ∞. Then, n1 Sn −−→ ∞ 12.20 (Relationship between LLN and the convergence of relative frequency to 1 1 the probability). Consider i.i.d. Zi ∼ Z. Let Xi = 1A (Zi ). Then, n Sn = n ni=1 1A (Zi ) = rn (A), the relative frequency of an event A. Via LLN and appropriate conditions, rn (A) converges to E n1 ni=1 1A (Zi ) = P [Z ∈ A].
115
12.4
Central Limit Theorem (CLT)
Suppose that (Xk )k≥1 isa sequence of i.i.d. random variables with mean m and variance 0 < σ 2 < ∞. Let Sn = nk=1 Xk . 12.21 (Lindeberg-L´ evy theorem). (a) (b)
Sn −mc √ σ n
=
Sn√ −mc n
=
√1 n
√1 n
n k=1 n
Xk −m σ
⇒ N (0, 1).
(Xk − m) ⇒ N (0, σ).
k=1
Xk −m iid ∼ Z and Yn = nk=1 Zk . Then, EZ = 0, Var Z = 1, and ϕYn (t) = To n let Zk = σ see this, ϕZ ( √tn ) . By approximating ex ≈ 1 + x + 12 x2 . We have ϕX (t) ≈ 1 + jtEX − 12 t2 E [X 2 ] (see also (22)) and n t2 1 t2 ϕYn (t) = 1 − → e− 2 . 2n • The case of Bernoulli(1/2) was derived by Abraham de Moivre around 1733. The case of Bernoulli(p) for 0 < p < 1 was considered by Pierre-Simon Laplace [9, p 208]. 12.22 (Approximation of densities and pmfs using the CLT). Approximate the distribution of Sn by N (nm, nσ 2 ). √ • FSn (s) ≈ Φ s−nm σ n • fSn (s) ≈
−1 √ 1√ e 2 2πσ n
x−nm √ σ n
2
• If the Xi are integer-valued, then 6 5 2 1 1 1 √ − 12 k−nm σ n P [Sn = k] = P k − < Sn ≤ k + ≈√ √ e 2 2 2πσ n [9, eq (5.14) p 213]. The approximation is best for s, k near nm [9, p 211]. √ 1 • The approximation n! ≈ 2πnn+ 2 e−n can be derived from approximating the density n−1 e−s . Approximate the density at of Sn when Xi ∼ E(1). We know that fSn (s) = s(n−1)! √ 1 s = n, gives (n − 1)! ≈ 2πnn− 2 e−n . Multiply through by n. [9, Ex 5.18 p 212] • See also normal approximation for the binomial in (13).
116
13 13.1
Conditional Probability and Expectation Conditional Probability
Definition 13.1. Suppose conditioned on X = x, Y has distribution Q. Then, we write Y |X = x ∼ Q [21, p 40]. It might be clearer to write P Y |X=x = Q. 13.2. Discrete random variables (a) pX|Y (x|y) =
pX,Y (x,y) pY (y)
(b) pX,Y (x, y) = pX|Y (x|y)pY (y) = pY |X (y|x)pX (x) (c) The law of total probability: P [Y ∈ C] = particular, pY (y) = x pY |X (y|x)pX (x). (d) pX|Y (x|y) =
x
P [Y ∈ C|X = x]P [X = x]. In
pY |X (y|x)pX (x) x pY |X (y|x )pX (x )
(e) If Y = X + Z where X and Z are independent, then • pY |X (y|x) = pZ (y − x) • pY (y) = x pZ (y − x)pX (x) • pX|Y (x|y) =
pZ (y−x)pX (x) . x pZ (y−x )pX (x )
(f) The substitution law of conditional probability: P [g(X, Y ) = z|X = x] = P [g(x, Y ) = z|X = x]. • When X and Y are independent, we can “drop the conditioning”. 13.3. Absolutely continuous random variables (a) fY |X (y|x) = (b) FY |X (y|x) =
fX,Y (x,y) . fY (y)
y
−∞
fY |X (t|x)dt
13.4. P [(X, Y ) ∈ A] = E [P [(X, Y ) ∈ A|X]] = 13.5.
∂ F ∂z Y |X
(g (z, x) |x ) =
∂ ∂z
g(z,x) −∞
A
fX,Y (x, y)d(x, y)
∂ fY |X (y |x )dy = fY |X (g (z, x) |x ) ∂z g (z, x) P [X≤x,y−Δ
13.6. FX|Y (x |y ) = P [X ≤ x |Y = y ] = lim 13.7. Define fX|A (x) to be
fX,A (x) . P (A)
x P [A |X = x ] = lim
Δ→0
= E 1(−∞,x] (X) |Y = y .
Then, fX,A (x ) dx
x−Δ x
= fX (x ) dx
P (A) fX|A (x) fX,A (x) = . fX (x) fX (x)
x−Δ
13.8. For independent Xi , P [∀i Xi ∈ Bi |∀i Xi ∈ Ci ] =
i
117
P [Xi ∈ Bi |Xi ∈ Ci ].
13.2
Conditional Expectation
Definition 13.9. E [g(Y )|X = x] = y g(y)pY |X (y|x). • In particular, E [Y |X = x] = y ypY |X (y|x). • Note that E [Y |X = x] is a function of x. 13.10. Properties of conditional expectation. (a) Substitution law for conditional expectation: E [g(X, Y )|X = x] = E [g(x, Y )|X = x] (b) E [h(X)g(Y )|X = x] = h(x)E [g(Y )|X = x]. (c) (The Rule of Iterated Expectations) EY = E [E [Y |X]]. (d) Law of total probability for expectation: E [g(X, Y )] = E [E [g(X, Y )|X]] . (i) E [g(Y )] = E [E [g(Y )|X]]. (ii) FY (y) = E FY |X (y|X) . (Take g(z) = 1(−∞,y] (z)). (e) E [g (X) h(Y ) |X = x ] = g (x) E [h(Y ) |X = x ] (f) E [g (X) h(Y )] = E [g (X) E [h(Y ) |X ]] (g) E [X + Y |Z = z ] = E [X |Z = z ] + E [Y |Z = z ] (h) E [AX|z] = AE [X|z]. (i) E [(X − E [X |Y ])| Y ] = 0 and E [X − E [X |Y ]] = 0 (j) ming(x) E [(Y − g(X))2 ] = E [(Y − E [Y |X])2 ] where g ranges over all functions. Hence, E [Y |X] is sometimes called the regression of Y on X, the “best” predictor of Y conditional on X. Definition 13.11. The conditional variance is defined as ! Var[Y |X = x] = (y − m(x))2 f (y|x)dy where m(x) = E [Y |X = x]. 13.12. Properties of conditional variance (a) Var Y = E [Var[Y |X]] + Var[E [Y |X]]. In other words, suppose given X = x, the mean and variance of Y is m(x), v(x). Then, the variance of Y is Var Y = E [v(X)] + Var[m(X)]. Recall that for any function g, we have E [g(Y )] = E [E [g(Y )|X]]. Because EY is just a constant, say μ, we can define g(y) = (y − μ)2 which then implies Var Y = E [E [g(Y )|X]]. Note, however, that E [g(Y )|X] and Var[Y |X] are not the same. Suppose conditioned on X, Y has distribution Q with mean m(x) v(x). Then, v(x) = (y − m(x))2 Q(dy). and variance 2 However, E [g(Y )|X = x] = (y − μ) Q(dy); note the use of μ in stead of m(x). Therefore, in general, Var Y = E [Var[Y |X]]. 118
• All three terms in the expression are nonnegative. Var Y is an upper bound for each of the terms on the RHS. (X, Z), then Var[X + N |Z] = Var[X|Z] + Var N .
|=
(b) Suppose N
(c) Var[AX|z] = A Var[X|z]AH . |=
13.13. Suppose E [Y |X] = X. Then, Cov [X, Y ] = Var X. See also (5.19). This is also true for Y = X + N with X N and N is zero-mean noise. Definition 13.14. μn [Y |X ] = E [(Y − E [Y |X ])n |X ] 13.15. Properties (a) μ3 [Y ] = E [μ3 [Y |X ]] + μ3 [E [Y |X ]] μ4 [Y ] = E [μ4 [Y |X ]] + 6E [Var [Y |X ]] Var [E [Y |X ]] + μ4 [E [Y |X ]]
13.3
Conditional Independence
13.16. The following statements are equivalent:conditions for X1 , X2 , . . . , Xn to be mutually independent conditioning on Y (a.s.). (a) p (xn1 |y ) =
n
p (xi |y ). (b) ∀i ∈ [n] \ {1} p xi xi−1 , y = p (xi |y ). 1 i=1
(c) ∀i ∈ [n] Xi and the vector (Xj )[n]\{i} are independent conditioning on Y . Example 13.17. Suppose X and Y are independent. Conditioned on another random variable Z, it is not true in general that X and Y are still independent. See example (4.45). Recall that Z = X ⊕ Y which can be rewritten as Y = X ⊕ Z. Hence, when Z = 0, we must have Y = X. 13.18. Suppose we know that fX|Y,Z (x|y, z) = g(x, y); that is fX|Y,Z does not depend on z. Then, conditioned on Y , X and Z are independent. In which case, fX|Y,Z (x |y, z ) = fX|Y (x |y ) = g (x, y) . 13.19. Suppose we know that fZ|V,U1 ,U2 (z |v, u1 , u2 ) = fZ|V (z |v ) for all z, v, u1 , u2 , then conditioned on V , we can conclude that Z and (U1 , U2 ) are independent. This further implies Z and Ui are independent. Moreover, fZ|V,U1 ,U2 (z |v, u1 , u2 ) = fZ|V,U1 (z |v, u1 ) = fZ|V,U1 (z |v, u2 ) = fZ|V (z |v ) .
119
14
Real-valued Jointly Gaussian
Definition 14.1. Random vector Rd is jointly Gaussian or jointly normal if and only if ∀v ∈ Rd , the random variable v T X is Gaussian. • In order for this definition to make sense when v = 0 or when X has a singular covariance matrix, we agree that any constant random variable is considered to be Gaussian. • Of course, the mean and variance are v T EX and v T ΛX v, respectively. If X is a Gaussian random vector with mean vector m and covariance matrix Λ, we write X ∼ N (m, Λ). 14.2. Properties of jointly Gaussian random vector X ∼ N (m, Λ) (a) m = EX, Λ = Cov [X] = E (X − EX)(X − EX)T . (b) fX (x) =
n (2π) 2
1 √
1
det(Λ)
e− 2 (x−m)
T
Λ−1 (x−m)
.
• To remember the form of the above formulas, both exponents have to be scalar. So, we better have (x − m)T Λ−1 (x − m) instead of having the transpose on the last term. To make this more clear, set Λ = I, then we must have a dot product. T vk Ak v . Note also that v Av = k
• The above formula can be derived by starting form a random vector Z whose 1 components are i.i.d. N (0, 1). Let X = CX2 Z + m. Use (28) and (32). i.i.d.
• For Xi ∼ N (mi , σi2 ), fX (x) =
1 n
(2π) 2
− 12
i σi
e
x−mi 2 i
σi
.
i.i.d.
In particular, if Xi ∼ N (0, 1), fX (x) =
1 − 12 xT x n e (2π) 2
(32)
" n " • (2π) 2 det (Λ) = det (2πΛ) • The Gaussian density is constant on the “ellipsoids” centered at m, −1 (x − m) = constant . x ∈ Rn : (x − m)T CX
(c) ϕX (v) = ejv
T m− 1 v T Λv 2
j
=e
i
vi EXi − 12 vk v Cov[Xk ,X ] k
.
• This can be derived from definition (14.1) by noting that ⎡ ⎤ Y ( T * T ⎢ ⎥ ϕX (v) = E ejv X = E ⎣ej1 v X ⎦ is simply ϕY (1) where Y = v T X which by definition is normal. 120
(d) Random vector Rd is Jointly Gaussian if and only if ∀v ∈ Rd , the random variable v T X is Gaussian. • Independent Gaussian random variables are jointly Gaussian. (e) Joint normality is preserved under linear transformation: suppose Y = AX + b, then T Y ∼ N Am + b, AΛA . (f) If (X, Y ) jointly Gaussian, then X and Y are independent if and only if Cov [X, Y ] = 0. Hence, uncorrelated jointly Gaussian random variables are independent. (g) Note that the joint density does not exists when the covariance matrix Λ is singular. (h) For i.i.d. N (μ, σ 2 ): fX1n
(xn1 )
? 1 2 = exp − 2 x − μ n 2σ (2π) 2 σ n > 7 n n 1 2 μ μ2 1 − 2 x + xi − 2 = n exp 2σ i=1 i σ 2 i=1 2σ (2πσ 2 ) 2 1
(i) Third order joint Gaussian moments are 0: E [(Xi − EXi ) (Xj − EXj ) (Xk − EXk )] = 0 ∀ i, j, k not necessarily distinct. In particular, E (X − EX)3 = 0. (j) Isserlis’s Theorem: Any forth-order central moment of jointly Gaussian r.v. is expressible as the a sum of all possible products of pairs of their covariances: E [(Xi − EXi ) (Xj − EXj ) (Xk − EXk ) (X − EX )] = Cov [Xi , Xj ] Cov [Xk , X ] + Cov [Xi , Xk ] Cov [Xj , X ] + Cov [Xi , X ] Cov [Xj , Xk ] .
Note that
1 4 2 2
= 3.
• In particular, E (X − EX)4 = 3σ 4 . (k) To generate N (m, Λ). First, by spectral theorem, Λ = V DV T where V is orthogonal matrix whose columns are eigenvectors of Λ and D is diagonal matrix with the eigenvalues of Λ. The random variable we want is V X + m where X ∼ N (0, D). 14.3. For bivariate normal, fX,Y (x, y) is ⎧ 2 2 ⎫ y−EY y−EY ⎪ ⎪ x−EX x−EX ⎬ ⎨ − 2ρ σX + σY σX σY 1 " , exp − ⎪ ⎪ 2 (1 − ρ2 ) 2πσX σY 1 − ρ2 ⎭ ⎩ where ρ =
Cov(X,Y ) σX σ Y
∈ [−1, 1]. Here, x, y ∈ R. 121
• fX,Y (x, y) =
1 σ X σY
ψρ
x−mX y−mY , σY σX
• fX,Y is constant on ellipses of the form • Λ=
2 Cov [X, Y ] σX Cov [X, Y ] σY2
=
x σX
2
2 σX ρσX σY
+
y σY
2
= r2 .
ρσX σY . σY2
• The following are equivalent: (a) ρ = 0 (b) Cov [X, Y ] = 0 (c) X and Y are independent. • |ρ| = 1 if and only if (X − EX) = k (Y − EY ). In which case ◦ ρ= ◦ |k|
k = |k| = σσXY
sign (k)
" • Suppose fX,Y (x, y) only depends on x2 + y 2 and X and Y are independent, then X and Y are normal with zero mean and equal variance. 2 • X|Y ∼ N ρ σσXY (Y − mY ) + mX , σX (1 − ρ2 ) Y |X ∼ N ρ σσXY (X − mX ) + mY , σY2 (1 − ρ2 ) • The standard bivariate density is defined as 1 1 − (u2 −2ρuv+v 2 ) " e 2(1−p)2 2π 1 − ρ2 + + , , v − ρu u − ρv 1 1 = ψ(u) " ψ " =" ψ " ψ(v) 1 − ρ2 1 − ρ2 1 − ρ2 1 − ρ2 fU (u) fV (v)
ψρ (u, v) =
fV |U (v|u) V |U ∼N (ρU,1−ρ2 )
fU |V (u|v) U |V ∼N (ρV,1−ρ2 )
[9, eq. (7.22) p 309, eq. (7.23) p 311, eq. (7.26) p 313]. This is the joint density of U, V where U, V ∼ N (0, 1) with Cov [U, V ] = E [U V ] = ρ. ◦ The general bivariate Gaussian pair is obtained from the transformation U mX X σX U + mX σX 0 = + . = 0 σY V Y σY V + mY mY ◦ fU |V (u|v) is N (ρv, 1 − ρ2 ). In other words, U |V ∼ N (ρV, 1 − ρ2 ). 14.4 (Conditional Gaussian).
122
ΛX ΛXY . Then, , (a) Suppose (X, Y ) are jointly Gaussian; that is Y ∼ N μY ΛY X ΛY fX|Y (x |y ) is N E [X |y ] , ΛX|y where E [X |y ] = μX +ΛXY Λ−1 Y (y − μY ) and ΛX|y = −1 ΛX − ΛXY ΛY ΛY X .
X
μX
• Note the direction of the formula for ΛX|y . ⎛ ⎞ ΛX → ΛXY ⎝ ↓ ⎠ ΛY X ← ΛY |=
(b) Suppose (X, Gaussian with W (X, Y ) . Set V = BX + W . Then, Y, W ) are jointly V |y ∼ N E [V |y ] , ΛV |y where E [V |y ] = BE [X |y ] + EW and ΛV |y = BΛX|y B T + ΛW .
15
Bayesian Detection and Estimation
Consider a pair of random vectors Θ and Y , where Θ is not observed, but Y is observed. We know the joint distribution of the pair (Θ, Y ) which is usually given in the form of the prior distribution pΘ (θ) and the conditional distribution pY |Θ (y|θ). By an estimator of Θ ˆ ) = g(Y ) is our estimate or “guess” of based on Y , we mean a function g such that Θ(Y the value of Θ. 15.1 (Orthogonality Principle). Let D be a collection of random vectors with the same dimension as Θ. For a random vector Z, suppose that ∀X ∈ D, If
Z − X ∈ D.
E X T (Θ − Z) = 0,
then
(33)
∀X ∈ D,
(34)
T E |Θ − X|2 = E |Θ − Z|2 + E |Z − X|2 + 2 E[(Z − X) (Θ − Z)],
∈D
=0
which implies
E |Θ − Z|2 ≤ E |Θ − X|2 ,
∀X ∈ D.
• If D is a subspace and Z ∈ D, then (33) is automatically satisfied. • (34) says that the vector Θ − Z is orthogonal to all vectors in D. Example 15.2. Suppose Θ and N are independent Poisson random variables with respective parameters λ and μ. Let Z = Θ + N . • Y is P(λ + μ)
• Conditioned on Y = y, Θ is B y,
λ λ+μ
.
123
ˆ MMSE (Y ) = E [Θ|Y ] = • Θ • Var[Θ|Y ] = Y
λμ , (λ+μ)2
λ Y λ+μ
.
and MSE = E [Var[Θ|Y ]] =
λμ λ+μ
< Var Y = λ + μ.
See also [7, Q 15.17]. T ˆ ˆ 15.3 (Weighted Error). Suppose we define the error by E = Θ − Θ (Y ) W Θ − Θ (Y ) for some positive definite matrix W . (Note that the usual MSE use W = I.) The MSE ˆ ) = E [Θ|Y ]. The resulting MSE EE is by the MMSE estimator Θ(Y uniquely minimized T is E (Θ − E [Θ|Y ]) W (Θ − E [Θ|Y ]) . * ( In fact, for any function, g(Y ), the conditional weight error E (Θ − g (Y ))T W (Θ − g (Y )) Y is given by * ( T E (Θ − E [Θ |Y ]) W (Θ − E [Θ |Y ]) Y + (E [Θ |Y ] − g (Y ))T W (E [Θ |Y ] − g (Y )) .
Hence, for each Y , it is minimized by having g(Y ) = E [Θ|Y ]. 15.4 (Linear minimum mean-squared-error estimator). A linear MMSE estimator ˆ ΘLMMSE = gLMMSE (Y ) minimizes the MSE E |Θ − g(Y )|2 among all affine estimators of the form g(y) = Ay + b. (a) It is sometimes called Wiener filters. (b) The scalar linear (affine) MMSE estimator is given by ˆ LMMSE (Y ) = EΘ + Cov [Y, Θ] (Y − EY ) . Θ Var Y • To see this in Hilbert space, note that we want the orthogonal projection of Θ onto the subspace spanned by two elements: Y and 1. The orthogonal basis of the subspace is {1, Y − EY }. Hence, the orthogonal projection is (Θ, 1) (Θ, Y − EY ) + (Y − EY ) . (1, 1) (Y − EY, Y − EY ) • The above discussion suggest alternative ways of arriving at the LMMSE by ˆ ) = aY + b such that the error E = Θ − Θ(Y ˆ ) is orthogonal finding a, b in Θ(Y ˆ ) to be unbiased. to both 1 and Y − EY . The condition (E, 1) = 0 requires Θ(Y Cov[Y,Θ] The condition (E, Y − EY ) = 0 gives a = Var Y . (c) The vector linear (affine) MMSE estimator is given by ˆ LMMSE (Y ) = EΘ + ΣΘY Σ−1 (Y − EY ) Θ Y and
* ( ˆ LMMSE (Y ) = ΣΘ − ΣΘY Σ−1 ΣY Θ . MMSE = Cov Θ − Θ Y
124
In fact, the optimal choice of A is any solution of AΣY = ΣΘY . In which case, * ˆ Cov Θ − ΘLMMSE (Y ) = ΣΘ − AΣY Θ − ΣΘY AT + AΣY AT (
= ΣΘ − ΣΘY AT = ΣΘ − AΣY Θ = ΣΘ − AΣY AT . When ΣY is invertible, A = ΣΘY Σ−1 Y . When ΣY is singular, see [9, Q8.38 p 359]. • The MSE can be rewrite as E ((Θ − EΘ) − A(Y − EY ))2 + |EΘ − AEY − b|2 , which show that the optimal choice of b is b = EΘ − AEY . This is the b which makes the estimator unbiased. ˜ = Θ − EΘ and Y˜ = Y − EY . Suppose for any matrix B, we have • Fix A. Let Θ 5 6 T ˜ ˜ ˜ E BY Θ − AY = 0 . if and only if, for all matrix B, 5 5 2 6 2 6 ˜ ˜ ˜ ˜ E Θ − AY ≤ E Θ − B Y . In which case, 5 5 5 2 6 2 6 2 6 ˜ ˜ − AY˜ + E (A − B)Y˜ . E Θ − B Y˜ = E Θ • Additive Noise in 1-D: Y = Θ + N where Cov [Θ, N ] = 0. ˆ LMMSE (Y ) = EΘ + Cov [Θ, Y ] (Y − EY ) Θ Var Y Var Θ (Y − (EΘ + EN )) = EΘ + Var Θ + Var N SNR = EΘ + (Y − (EΘ + EN )) . 1 + SNR and
Cov2 [Θ, Y ] Var Θ Var N MMSE = Var Θ − = . Var Y Var Θ + Var N
125
A
More Math
A.1
Inequalities
A.1. By definition, N ∞ • n=1 an = n=1 an and n∈N an = lim •
∞ n=1
an =
N →∞
n∈N
N
an = lim
n=1
N →∞
A.2. For |x| ≤ 0.5, we have
an . 2
ex−x ≤ 1 + x ≤ ex . This is because x − x2 ≤ ln (1 + x) ≤ x, which is semi-proved by the plot in figure 24. . 1
-0.6838
ݔ 0.5
0
lnሺ1 + ݔሻ ݔെ ݔ2
-0.5
-1
-1.5 -0.8
ݔ ݔ+1 -0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
ݔ
Figure 24: Bounds for ln(1 + x) when x is small. A.3. Let αi and βi be complex numbers with |αi | ≤ 1 and |βi | ≤ 1. Then, m m m βi ≤ |αi − βi |. αi − i=1
i=1
i=1
In particular, |αm − β m | ≤ m |α − β|. A.4. Consider a triangular array of real numbers (xn,k ). Suppose (i) rn k=1
rn k=1
x2n,k → 0. Then, rn
(1 + xn,k ) → ex .
k=1
126
xn,k → x and (ii)
Proof. Use (A.2). n Suppose the sum rk=1 |xn,k | converges as n → ∞. Conditions (i) and (ii) is equivalent to conditions (i) and (iii) where (iii) is the requirement that max |xk,n | → 0 as n → ∞. k∈[rn ]
rn
Proof. Suppose max |xk,n |. Then,
k=1
|xn,k | → x0 . To show that (iii) implies (ii) under (i), let an =
k∈[rn ]
0≤
rn
x2n,k
≤ an
k=1
rn
|xn,k | → 0 × x0 = 0.
k=1
On the other hand, suppose we have (i) and (ii). Given any ε > 0, by (ii), ∃n0 such that rn rn ∀n ≥ n0 , x2n,k ≤ ε2 . Hence, for any k, x2n,k ≤ x2n,k ≤ ε2 and hence |xn,k | ≤ ε which k=1
k=1
implies an ≤ ε.
rnNote that when the xk,n are non-negative, condition (i) already implies that the sum k=1 |xn,k | converges as n → ∞. n A.5. Suppose lim an = a. Then lim 1 − ann = e−a [9, p 584]. n→∞
n→∞
Proof. Use (A.4) with rn = n, xn,k = − ann . Then, nk=1 xn,k = −an → −a and nk=1 x2n,k = a2n n1 → a · 0 = 0. n Alternatively, from L’Hˆopital’s rule, lim 1 − na = e−a . (See also [18, Theorem 3.31, n→∞
p 64]) This gives a direct proof for the case when a > 0. For n large enough, note that an a and 1 − are ≤ 1 where we need a > 0 here. Applying (A.3), we get both 1− n n n n 1 − an − 1 − a ≤ |an − a| → 0. n n n n −1 b −1 For a < 0, we use the fact that, for bn → b > 0, (1) 1 + n = 1 + nb → e−b −1 −1 and (2) for n large enough, both 1 + nb and 1 + bnn are ≤ 1 and hence + −1 ,n + −1 ,n b bn |bn− b| → 0. − 1+ 1+ ≤ n n 1 + bnn 1 + nb
A.2
Summations
A.6. Basic formulas: (a)
n k=0
(b)
n
k=
n(n+1) 2
k2 =
k=0
(c)
n k=0
n(n+1)(2n+1) 6
k3 =
n k=0
= 16 (2n3 + 3n2 + n)
2 k
= 14 n2 (n + 1)2 = 14 (n4 + 2n3 + n2 ) 127
A nicer formula is given by n
k (k + 1) · · · (k + d) =
k=1
A.7. Let g (n) =
n
1 n (n + 1) · · · (n + d + 1) d+2
(35)
h (k) where h is a polynomial of degree d. Then, g is a polynomial
k=0
of degree d + 1; that is g (n) =
d+1
am x m .
m=1
• To find the coefficients am , evaluate g (n) for n = 1, 2, . . . , d + 1. Note that the case when n = 0 gives a0 = 0 and hence the sum starts with m = 1. • Alternative, first express h (k) in terms of summation of polynomials: , + d−1 bi k (k + 1) · · · (k + i) + c. h (k) = i=0
To do this, substitute k = 0, −1, −2, . . . , − (d − 1). • k 3 = k (k + 1) (k + 2) − 3k (k + 1) + k Then, to get g (n), use (35). A.8. Geometric Sums: ∞ 1 (a) ρi = 1−ρ for |ρ| < 1 i=0
(b)
∞
ρi =
ρk 1−ρ
ρi =
ρa −ρb+1 1−ρ
iρi =
ρ (1−ρ)2
iρi =
ρb+1 (bρ−b−1)−ρa (aρ−a−ρ) (1−ρ)2
iρi =
kρk 1−ρ
i=k
(c)
b i=a
(d)
∞ i=0
(e)
b i=a
(f)
∞ i=k
(g)
∞
i2 ρ i =
i=0
+
ρk+1 (1−ρ)2
ρ+ρ2 (1−ρ)3
A.9. Double Sums: n 2 n n ai = ai aj (a) i=1
i=1 j=1
128
(b)
∞ ∞
i ∞
f (i, j) =
j=1 i=j
f (i, j) =
i=1 j=1
1 [i ≥ j]f (i, j)
(i,j)
A.10. Exponential Sums: • eλ =
∞ k=0
λk k!
=1+λ+
λ2 2!
+
λ3 3!
2
+ ... 3
• λeλ + eλ = 1 + 2λ + 3 λ2! + 4 λ3! + . . . =
∞ k=1
k−1
λ k (k−1)! =
∞ k=0
k−1
λ k (k−1)!
A.11. Zeta function ξ (s) is defined for any complex number s with Re {s} > 1 by the ∞ 1 Dirichlet series: ξ (s) = . ns n=1
• For real-valued nonnegative x (a) ξ (x) converges for x > 1 (b) ξ (x) diverges for 0 < x ≤ 1 [9, Q2.48 p 105]. • ξ (1) = ∞ corresponds to harmonic series. A.12. Abel’s theorem: Let a = (ai : i ∈ N) be any sequence of real or complex numbers and let ∞ Ga (z) = ai z i , i=0
be the power series with coefficients a. Suppose that the series lim− Ga (z) =
z→1
∞
ai .
∞ i=0
ai converges. Then, (36)
i=0
In the special case where all the coefficientsai are nonnegative real numbers, then the above formula (36) holds also when the series ∞ i=0 ai does not converge. I.e. in that case both sides of the formula equal +∞.
A.3
Derivatives
A.13. Basic Formulas (a)
d u a dx
(b)
d dx
= au ln a du dx
loga u =
loga e du , u dx
a = 0, 1
129
(c) Derivatives of the products: Suppose f (x) = g (x) h (x), then n n (n−k) (n) g f (x) = (x) h(k) (x). k k=0 In fact,
r dn f (t) = i dtn i=1 n +···+n 1
r
dni n! fi (t). n !n ! · · · nr ! i=1 dtni =n 1 2 r
Definition A.14 (Jacobian). In vector calculus, the Jacobian is shorthand for either the Jacobian matrix or its determinant, the Jacobian determinant. Let g be a function from a subset D of Rn to Rm . If g is differentiable at z ∈ D, then all partial derivatives exists at z and the Jacobian matrix of g at a point z ∈ D is ⎛ ∂g1 ⎞ ∂g1 (z) · · · (z) ∂x1 ∂xn ∂g ∂g ⎜ .. ⎟ . . .. ⎠ = .. (z) , . . . , (z) . dg (z) = ⎝ . ∂x1 ∂xn ∂gm ∂gm (z) · · · ∂xn (z) ∂x1 ∂(g1 ,...,gn ) Alternative notations for the Jacobian matrix are J, ∂(x [7, p 242], Jg (x) where the it 1 ,...,xn ) is assumed that the Jacobian matrix is evaluated at z = x = (x1 , . . . , xn ). ∂gk exists • Let g : D → Rm with open D ⊂ Rn , y ∈ D. If ∀k ∀j partial derivatives ∂x j in a neighborhood of z and continuous at z, then g is differentiable at z. And dg is continuous at z.
• Let A be an n-dimensional “box” defined by the corners x and x+Δx. The “volume” of the image g(A) is ( i Δxi ) |det dg(x)|. Hence, the magnitude of the Jacobian determinant gives the ratios (scaling factor) of n-dimensional volumes (contents). In other words, ∂(y1 , . . . , yn ) dx1 · · · dxn . dy1 · · · dyn = ∂(x1 , . . . , xn ) • d(g −1 (y)) is the Jacobian of the inverse transformation. • In MATLAB, use jacobian. • Change of variable: Let g be a continuous differentiable map of the open set U onto V . Suppose that g is one-to-one and that det(dg(x)) = 0 for all x. ! ! h(g(x)) |det(dg(x))| dx = h(y)dy. U
V
Definition A.15. The gradient (or gradient vector field) of a scalar function f (x) with respect to a vector variable x = (x1 , . . . , xn ) is ⎛ ∂f ⎞ ∂θ1
⎜ ⎟ ∇θ f (θ) = ⎝ ... ⎠ . ∂f ∂θn
130
If the argument is row vector, then, ⎛
∂f1 ∂θ1
∂f2 ∂θ1
∂f1 ∂θn
∂f2 ∂θn
⎜ ∇θ f T (θ) = ⎝ ...
.. .
...
∂fm ∂θ1
.. .
⎞ ⎟ ⎠.
∂fm ∂θn
Definition A.16. Given a scalar-valued function f , the Hessian matrix of second partial derivatives ⎡ 2 2f ∂ f · · · ∂θ∂1 ∂θ ∂θ12 n ⎢ . T . 2 . ⎢ . . . ∇θ f (θ) = ∇θ (∇θ f (θ)) = ⎣ . . . ∂2f ∂2 · · · ∂θ 2 ∂θn ∂θ1
matrix is the square ⎤ ⎥ ⎥. ⎦
n
It is symmetric for nice function. A.17. ∇x f T (x) = (df (x))T . A.18. Let f, g : Ω → Rm . Ω ⊂ Rn . h (x) = (f (x) , g (x)) : Ω → R. Then dh (x) = (f (x))T dg (x) + (g (x))T df (x) . • For an n × n matrix A, let f (x) = (Ax, x). Then df (x) = (Ax)T I + xT A = xT AT + xT A. ◦ If A is symmetric, then df (x) = 2xT A. So,
∂ ∂xj
(Ax, x) = 2 (Ax)j .
A.19. Chain rule: If f is differentiable at y and g is differentiable at z = f (y), then g ◦ f is differentiable at y and d (g ◦ f ) (y) = dg (z) df (y) p×n
p×m
m×n
(matrix multiplication). • In particular, d g (x (t) , y (t) , z (t)) = dt
∂ d ∂ d g (x, y, z) x (t) + g (x, y, z) y (t) ∂x dt ∂y dt d ∂ . g (x, y, z) z (t) + ∂z dt (x,y,z)=(x(t),y(t),z(t))
A.20. Let f : D → Rm where D ⊂ Rn is open and connected (so arcwise-connected). Then, df (x) = 0 ∀x ∈ D ⇒ f is constant. A.21. If f is differentiable at y, then all ⎞ ⎛ ∂f1 ⎛ (y) ∇f1 (y) ∂x1 ⎟ ⎜ ⎜ . . .. .. df (y) = ⎝ ⎠=⎝ ∂fm ∇fm (y) (y) ∂x1
partial and directional derivative exists at y ⎞ ∂f1 · · · ∂x (y) n ∂f ∂f ⎟ .. ... (y) , . . . , (y) . ⎠= . ∂x1 ∂xn ∂fm · · · ∂xn (y) 131
A.22. Inversion (Mapping) Theorem: Let open Ω ⊂ Rn , f : Ω → Rn is C 1 , c ∈ Ω. df (c) is bijective, then ∃U open neighborhood of c such that n×n
(a) V = f (U ) is an open neighborhood of f (c). (b) f |U : U → V is bijective. (c) g = (f |U )−1 : V → U is C 1 . (d) ∀y ∈ V dg (y) = [df (g (y))]−1 .
∇x xT = I 2 ∇x x = 2x ∇x (Ax + b)T = AT ∇x aT x = a
d (x) = I d x2 = 2xT d (Ax b) =T A T + d a x =a
∇x f T (x) g (x)
d f T (x) g (x) T
= (dg (x))T f (x) + (df (x))T g (x) = ∇x g T (x) f (x) + ∇x f T (x) g (x)
T
= f (x) dg (x) + g (x) df (x)
For symmetric Q, ∇x f T (x) Qf (x)
For symmetric Q, d f T (x) Qf (x)
= 2 (df (x))T Qf (x) = 2∇x f T (x) Qf (x)
= f T (x) Qdf (x) + f T (x) QT df (x) = 2f T (x) Qdf (x) d xT Qx = 2xT Q d f (x)2 = 2f T (x) df (x)
A.4
∇x xT Qx = 2Qx. ∇x f (x)2 = 2∇x f T (x) f (x)
Integration
A.23. Basic Formulas u (a) au du = lna a , a > 0, a = 1. 1 1 1 α ∞ α , α > −1 , α < −1 α+1 α+1 and t dt = So, the integration of the (b) t dt = ∞, α ≤ −1 ∞, α ≥ −1 0 1 1 ∞ function 1t is the test case. In fact, 1t dt = 1t dt = ∞. (c)
xm ln xdx =
0
xm+1 m+1 1 ln2 2
ln x − x,
A.24. Integration by Parts:
1 m+1
1
, m = −1 m = −1
!
! udv = uv − 132
vdu
4 4 t t
2 1.5
3
t t
0.5
2 1 t t
0.5 1
1
0 0
0
1
2
3
0
4
5
t
5
Figure 25: Plots of tα (a) Basic idea: Start with an integral of the form f (x) g (x)dx. Match this with an integral of the form udv by choosing dv to be part of the integrand including dx and possibly f (x) or g (x). (b) In particular, repeated application of integration by parts gives ! f (x) g (x)dx = f (x) G1 (x) +
n−1
! (−1)i f (i) (x) Gi+1 (x) + (−1)n
f (n) (x) Gn (x) dx
i=1
di f dxi
(i)
where f (x) = (x), G1 (x) = can be used to derived (37). f x
1
n
Gi (x)dx. Figure 26
G1 x
2
1
(37)
g x
+
x f x + n 1 f x n f x f
Differentiate
1
g (x)dx, and Gi+1 (x) =
Integrate
G2 x n1
Gn 1 x Gn x
Figure 26: Integration by Parts To see this, note that ! ! f (x) g (x)dx = f (x) G1 (x) − f (x) G1 (x) dx, and
!
! f
(n)
(x) Gn (x) dx = f
(n)
(x) Gn+1 (x) −
133
f (n+1) (x) Gn+1 (x) dx.
x2
e3 x
+
1 3x e - 3 1 3x e + 9 1 3x e 27
2x 2 0
2 3x
2 · 3x §1 2 2 ¨ x x ¸e 9 27 ¹ ©3
x
³x e
x
³ sin x e dx
dx x
+
ex
cos x sin x +
ex
sin x
ex
sin x cos x e x ³ sin x e x dx 1 sin x cos x e x 2
Figure 27: Examples of Integration by Parts using figure 26. (c)
xn eax a
xn eax dx =
−
n a
xn−1 eax
A.25. If n is a positive integer, ! n eax (−1)k n! n−k xn eax dx = x . a k=0 ak (n − k)! (a) n = 1 :
eax a
(b) n = 2 :
eax a
(c)
t
x−
∞
eat a
xn eax dx =
xn e−ax dx =
t
(e)
∞ 0
∞
n k=0
e−at a
2 a2
(−1)k n! n−k t ak (n−k)!
n k=0
n! , (−a)n+1
xn eax dx = • n! =
x2 − a2 x +
0
(d)
1 a
n! ak (n−k)!
a
−
(−1)n n! an+1
=
e−at a
tn−k =
eat a
n j=0
n k=0
(−1)k n! n−k t ak (n−k)!
+
n! (−a)n+1
n! tj an−k j!
< 0. (See also Gamma function)
e−t tn dt.
0
• In MATLAB, consider using gamma(n+1) in stead of factorial(n). Note also that gamma() allows vector input. (f)
1
xβ e−x dx is finite if and only if β > −1.
0
Note that (g) ∀β ∈ R,
∞ 1
For β ≤ 0,
1 e
1
xβ dx ≤
0
1 0
1
xβ dx.
0
xβ e−x dx < ∞. ∞
xβ e−x dx ≤
1
For β > 0,
xβ e−x dx ≤
∞ 1
∞
e−x dx <
1
β −x
x e dx ≤
∞ 1
∞
e−x dx = 1.
0
x
β −x
e dx ≤
∞ 0
134
xβ e−x dx = $β%!
(h)
∞
xβ e−x dx is finite if and only if β > −1.
0
A.26 (Differential of integral). Leibniz’s Rule: Let g : R2 → R, a : R → R, and b(x) g (x, y)dy is C 1 and b : R → R be C 1 . Then f (x) = a(x)
!b(x)
f (x) = b (x) g (x, b (x)) − a (x) g (x, a (x)) +
∂g (x, y)dy. ∂x
(38)
a(x)
In particular, we have d dx
!x f (t)dt = f (x) ,
(39)
a
!v(x) dv d f (t)dt = f (t)dt = f (v (x)) v (x) , dx dv a a ⎛ v(x) ⎞ u(x) !v(x) ! ! d d ⎝ f (t)dt = f (t)dt − f (t)dt⎠ = f (v (x)) v (x) − f (u (x)) u (x) . dx dx d dx
!v(x)
a
u(x)
(40)
(41)
a
Note that (38) can be derived from (A.19) by considering f (x) = h(a(x), b(x), x) where b h (a, b, c) = g (c, y)dy. [9, p 318–319]. a
A.5
Gamma and Beta functions
A.27. Gamma function: (a) Γ (q) =
∞
xq−1 e−x dx. ;
q > 0.
0
(b) Γ (0) = ∞ (c) Γ (n) = (n − 1)! for n ∈ N. Γ (n + 1) = n! if n ∈ N ∪ {0}. (d) 0! = 1. √ (e) Γ 12 = π. (f) Γ (x + 1) = xΓ (x) (Integration by parts). • This relationship is used to define the gamma function for negative numbers. 135
(g)
Γ(q) αq
=
∞
xq−1 e−αx dx, α > 0.
0
A.28. The incomplete beta function is defined as ! x B(x; a, b) = ta−1 (1 − t)b−1 dt. 0
For x = 1, the incomplete beta function coincides with the (complete) beta function. The regularized incomplete beta function (or regularized beta function for short) is defined in terms of the incomplete beta function and the (complete) beta function: Ix (a, b) =
B(x; a, b) . B(a, b)
• For integers m, k, Ix (m, k) =
m+k−1 j=m
(m + k − 1)! xj (1 − x)m+k−1−j . j!(m + k − 1 − j)!
• I0 (a, b) = 0, I1 (a, b) = 1 • Ix (a, b) = 1 − I1−x (b, a)
136
References [1] Patrick Billingsley. Probability and Measure. John Wiley & Sons, New York, 1995. 5.29, 2b [2] George Casella and Roger L. Berger. Statistical Inference. Duxbury Press, 2001. 4.42, 6.9, 11, 8.13, 11.11, 11.4, 1, 6, 11.21 [3] Donald G. Childers. Probability And Random Processes Using MATLAB. McGrawHill, 1997. 1.3, 1.3 [4] Herbert A. David and H. N. Nagaraja. Order Statistics. Wiley-Interscience, 2003. 11.4, 11.21 [5] W. Feller. An Introduction to Probability Theory and Its Applications, volume 2. John Wiley & Sons, 1971. 4.36 [6] William Feller. An Introduction to Probability Theory and Its Applications, Volume 1. Wiley, 3 edition, 1968. [7] Terrence L. Fine. Probability and Probabilistic Reasoning for Electrical Engineering. Prentice Hall, 2005. 3.1, 3.10, 3.14, 4.18, 4.23, 5.26, 9.1, 9.3, 11.10, 15.2, A.14 [8] Boris Vladimirovich Gnedenko. Theory of probability. Chelsea Pub. Co., New York, 4 edition, 1967. Translated from the Russian by B.D. Seckler. 5.29 [9] John A. Gubner. Probability and Random Processes for Electrical and Computer Engineers. Cambridge University Press, 2006. 1.2, 4.30, 4.33, 4.35, 4, 6.43, 7.9, 9.9, 10.1, 2, 3, 6, 11.24, 12.21, 12.22, 14.3, 3, A.5, A.11, A.26 [10] Samuel Karlin and Howard E. Taylor. A First Course in Stochastic Processes. Academic Press, 1975. 4.6, 9.10, 10.1 [11] J. F. C. Kingman. Poisson Processes. 0198536933. 5.23, 5.24
Oxford University Press, 1993.
ISBN:
[12] A.N. Kolmogorov. The Foundations of Probability. 1933. 3.14 [13] Nabendu Pal, Chun Jin, and Wooi K. Lim. Handbook of Exponential and Related Distributions for Engineers and Scientists. Chapman & Hall/CRC, 2005. 6.25 [14] Athanasios Papoulis. Probability, Random Variables and Stochastic Processes. McGraw-Hill Companies, 1991. 5.17, 11.5 [15] E. Parzen. Stochastic Processes. Holden Day, 1962. 3, 4, 11.25 [16] Sidney I. Resnick. Adventures in Stochastic Processes. Birkhuser Boston, 1992. 5 [17] Kenneth H. Rosen, editor. Handbook of Discrete and Combinatorial Mathematics. CRC, 1999. 1, 4, 5, 6, 7, 8 137
[18] Walter Rudin. Principles of Mathematical Analysis. McGraw-Hill, 1976. A.5 [19] Gbor J. Szkely. Paradoxes in Probability Theory and Mathematical Statistics. 1986. 8.2 [20] Tung. Fundamental of Probability and Statistics for Reliability Analysis. 2005. 17, 20 [21] Larry Wasserman. All of Statistics: A Concise Course in Statistical Inference. Springer, 2004. 4.16, 4.18, 4.43, 7.9, 3, 13.1
138
Index Bayes Theorem, 21 Binomial theorem, 12 Birthday Paradox, 23 Chevalier de Mere’s Scandal of Arithmetic, 22 Delta function, 19 Dirac delta function, see Delta function event algebra, 25 False Positives on Diagnostic Tests, 24 gradient, 130 gradient vector field, 130 Hessian matrix, 131 Integration by Parts, 132 Jacobian, 98, 130 Jacobian formulas, 99 Leibniz’s Rule, 135 Monte Hall’s Game, 23 multinomial coefficient, 13 Multinomial Counting, 13 Multinomial Theorem, 13 Order Statistics, 101 Probability of coincidence birthday, 23 Total Probability Theorem, 21 uncorrelated not independent, 76, 93 Zeta function, 129 Zipf or zeta random variable, 59
139