Probability Theory Pdf

  • June 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Probability Theory Pdf as PDF for free.

More details

  • Words: 49,791
  • Pages: 125
MIT OpenCourseWare http://ocw.mit.edu

18.175 Theory of Probability Fall 2008

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

Contents 1 Probability Spaces, Properties of Probability.

1

2 Random variables and their properties. Expectation.

4

3 Kolmogorov’s Theorem about consistent distributions.

10

4 Laws of Large Numbers.

12

5 Bernstein Polynomials. Hausdorff and de Finetti theorems.

16

6 0 - 1 Laws. Convergence of random series.

21

7 Stopping times, Wald’s identity. Another proof of SLLN.

26

8 Convergence of Laws. Selection Theorem.

29

9 Characteristic Functions. Central Limit Theorem on R.

34

10 Multivariate normal distributions and CLT.

38

11 Lindeberg’s CLT. Levy’s Equivalence Theorem. Three Series Theorem.

42

12 Levy’s Continuity Theorem. Poisson Approximation. Conditional Expectation.

46

13 Martingales. Doob’s Decomposition. Uniform Integrability.

51

14 Optional stopping. Inequalities for martingales.

55

15 Convergence of martingales. Fundamental Wald’s identity.

59

16 Convergence on metric spaces. Portmanteau Theorem. Lipschitz Functions.

65

17 Metrics for convergence of laws. Empirical measures.

70

18 Convergence and uniform tightness.

74

19 Strassen’s Theorem. Relationships between metrics.

76

20 Kantorovich-Rubinstein Theorem.

82

21 Prekopa-Leindler inequality, entropy and concentration.

88

1

22 Stochastic Processes. Brownian Motion.

96

23 Donsker Invariance Principle.

100

24 Empirical process and Kolmogorov’s chaining.

103

25 Markov property of Brownian motion. Reflection principles.

109

26 Laws of Brownian motion at stopping times. Skorohod’s imbedding.

114

2

List of Figures 2.1 2.2 2.3

A random variable defined by quantile transformation. . . . . . . . . . . . . . . . . . . . . . . σ(X) generated by X. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pairwise independent but not independent r.v.s. . . . . . . . . . . . . . . . . . . . . . . . . . .

5

5

7

5.1

Polya urn model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

7.1

A sequence of stopping times. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

8.1

Approximating indicator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

14.1 Stopping times of level crossings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

57

25.1 Reflecting the Brownian motion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

3

List of Tables

4

Section 1

Probability Spaces, Properties of Probability. A pair (Ω, A) is a measurable space if A is a σ-algebra of subsets of Ω. A collection A of subsets of Ω is an algebra (ring) if: 1. Ω ∈ A. 2. C, B ∈ A =⇒ C ∩ B, C ∪ B ∈ A. 3. B ∈ A =⇒ Ω \ B ∈ A. 4. A is a σ-algebra, if in addition, Ci ∈ A, ∀i ≥ 1 =⇒



Ci ∈ A.

i≥1

(Ω, A, P) is a probability space if P is a probability measure on A, i.e. 1. P(Ω) = 1. 2. P(A) ≥ 0, A ∈ A. ∞ ∞ �

� � 3. P is countably additive: Ai ∈ A, ∀i ≥ 1, Ai ∩ Aj = ∅ ∀i = � j =⇒ P Ai = P(Ai ). i=1

An equivalent formulation of Property 3 is: 3� . P is a finitely additive measure and Bn ⊇ Bn+1 ,



Bn = B =⇒ P(B) = lim P(Bn ). n

n≥1

Lemma 1 Properties 3 and 3� are equivalent. Proof.

1

i=1

3 =⇒ 3� :

Let Cn = Bn \ Bn+1 , then Bn = B By 3, P(Bn ) = P(B) +



� 

� C - all disjoint. k k≥n

P(Ck ) → P(B) when n → ∞.

k≥n

3� =⇒ 3 :



Ai = A1 ∪ A2 ∪ · · · ∪ An ∪



i≥1

� Ai .

i≥n

� � � �

P Ai = P(A1 ) + · · · + P(An ) + P Bn where Bn = Ai . i≥1

Since Bn ⊇ Bn+1

i≥n

�� � we have P(Bn ) → P n≥1 Bn = P(∅) = 0 because Ai ’s are disjoint.

Given algebra A, let A = σ(A) be a σ-algebra generated by A, i.e. intersection of all σ-algebras that contain A. It is easy to see that intersection of all such σ-algebras is itself a σ-algebra.  Indeed, consider a sequence Ai for i ≥ 1 such that each Ai belongs to all σ-algebras that contains A. Then i≥1 Ai belongs to all these σ-algebras and therefore to their intersection. Let us recall an important result from measure theory. Theorem 1 (Caratheodory extension) If A is an algebra of sets and µ : A → R is a non-negative countably additive function on A, then µ can be extended

to a measure on σ-algebra σ(A). If µ is σ-finite, then this extension is unique. (σ-finite means that Ω = Ai for disjoint sequence Ai and µ(Ai ) < ∞.) i≥1

Example. Let A be an algebra of sets



i≤n [ai , bi )

where all [ai , bi ) are disjoint and n ≥ 1. Let

n �

� � λ [ai , bi ) = |bi − ai |. i=1

i≤n

One can prove that λ is countably additive on A and therefore can be extended to a Lebesgue measure λ on a sigma algebra σ(A) of Borel-measurable sets.

Lemma 2 (Approximation property) If A is an algebra of sets then for any B ∈ σ(A) there exists a sequence Bn ∈ A such that P(B�Bn ) → 0. Remark. Here � denotes symmetric difference (B ∪ Bn ) \ (B ∩ Bn ). Lemma states that any B in σ(A) can be approximated by elements of A. Proof. Let D = {B ∈ σ(A) : ∃Bn ∈ A, P(B�Bn ) → 0}. We will prove that D is a σ-algebra and since A ⊆ D this will imply that σ(A) ⊆ D. One can easily check that d(B, C) := P(B�C) is a metric. It is also easy to check that 1. d(BC, DE) ≤ d(B, D) + d(C, E), 2. |P(B) − P(C)| ≤ d(B, C), 3. d(B c , C c ) = d(B, C).

2

Consider D1 , . . . , Dn ∈ D. If a sequence Cij ∈ A for j ≥ 1 approximates Di ,

then  by properties 1 - 3, Cjn := D = i≥1 Di . Then



P(Cij �Di ) → 0, j → ∞  n n i≤n Cij approximates D := i≤n Di , which means that D ∈ D. Let P(D) = P(Dn ) + P(D \ Dn )

and obviously P(D \ Dn ) → 0 as n → ∞. Therefore, D ∈ D and D is a σ-algebra.

3

Section 2

Random variables and their properties. Expectation. Let (Ω, A, P) be a probability space and (S, B) be a measurable space where B is a σ-algebra of subsets of S. A random variable X : Ω → S is a measurable function, i.e. B ∈ B =⇒ X −1 (B) ∈ A. When S = R we will usually consider a σ-algebra B of Borel measurable sets generated by sets (or, equivalently, generated by sets (ai , bi ) or by open sets).



i≤n (ai , bi ]

Lemma 3 X : Ω → R is a random variable iff for all t ∈ R {X ≤ t} := {ω ∈ Ω : X(ω) ∈ (−∞, t]} ∈ A. Proof. Only ⇐ direction requires proof. We will prove that D = {D ⊆ R : X −1 (D) ∈ A} is a σ-algebra. Since sets (−∞, t] ∈ D this will imply that B ⊆ D. The result follows simply because taking pre-image preserves set operations. For example, if we consider a sequence Di ∈ D for i ≥ 1 then � �

X −1 Di = X −1 (Di ) ∈ A i≥1

i≥1

because X −1 (Di ) ∈ A and A is a σ-algebra. Therefore, similarly, so D is a σ-algebra.



i≥1

Di ∈ D. Other properties can be checked

Let us define a measure PX on B by PX = P ◦ X −1 , i.e. for B ∈ B, PX (B) = P(X ∈ B) = P(X −1 (B)) = P ◦ X −1 (B). (S, B, PX ) is called the sample space of a random variable X and PX is called the law of X. Clearly, on this space a random variable ξ : S → S defined by the identity ξ(s) = s has the same law as X. When S = R, a function F (t) = P(X ≤ t) is called the cumulative distribution function (c.d.f.) of X. Lemma 4 F is a c.d.f. of some r.v. X iff 1. 0 ≤ F (t) ≤ 1, 2. F is non-decreasing, right-continuous, 4

3. limt→−∞ F (t) = 0, limt→+∞ F (t) = 1. Proof. The fact that any c.d.f. satisfies properties 1 - 3 is obvious. Let us showthat F which satisfies properties 1 - 3 is a c.d.f. of some r.v. X. Consider algebra A consisting of sets i≤n (ai , bi ] for disjoint intervals and for all n ≥ 1. Let us define a function P on A by �

� �� � P (ai , bi ] = F (ai ) − F (bi ) . i≤n

i≤n

One can show that P is countably additive on A. Then, by Caratheodory extension Theorem 1, P extends uniquely to a measure P on σ(A) = B - Borel measurable sets. This means that (R, B, P) is a probability space and, clearly, random variable X : R → R defined by X(x) = x has c.d.f. P(X ≤ t) = F (t). Below we will sometimes abuse the notations and let F denote both c.d.f. and probability measure P. Alternative proof. Consider a probability space ([0, 1], B, λ), where λ is the Lebesgue measure. Define r.v. X : [0, 1] → R by the quantile transformation X(t) = inf{x ∈ R, F (x) ≥ t}. The c.d.f. of X is λ(t : X(t) ≤ a) = F (a) since X(t) ≤ a ⇐⇒ inf{x : F (x) ≥ t} ≤ a ⇐⇒ ∃an → a, F (an ) ≥ t ⇐⇒ F (a) ≥ t.

0

1

Figure 2.1: A random variable defined by quantile transformation.

Definition. Given a probability space (Ω, A, P) and a r.v. X : Ω → S let σ(X) be a σ-algebra generated by a collection of sets {X −1 (B) : B ∈ B}. Clearly, σ(X) ⊆ A. Moreover, the above collection of sets is itself a σ-algebra. Indeed, consider a sequence Ai = X −1 (Bi ) for some Bi ∈ B. Then � �



Ai = X −1 (Bi ) = X −1 Bi = X −1 (B) i≥1

where B ∈



i≥1

i≥1

i≥1

Bi ∈ B. σ(X) is called the σ-algebra generated by a r.v. X. X 1

0

1

1/2

λ

Figure 2.2: σ(X) generated by X. Example. Consider a r.v. defined in figure 2.2. We have P(X = 0) = 12 , P(X = 1) = � � � 1� �1 � σ(X) = ∅, 0, , , 1 , [0, 1] . 2 2 5

1 2

and

Lemma 5 Consider a probability space (Ω, A, P), a measurable space (S, B) and random variables X : Ω → S and Y : Ω → R. Then the following are equivalent: 1. Y = g(X) for some (Borel) measurable function g : S → R. 2. Y : Ω → R is measurable on (Ω, σ(X)), i.e. with respect to the σ-algebra generated by X. Remark. It should be obvious from the proof that R can be replaced by any separable metric space. Proof. The fact that 1 implies 2 is obvious since for any Borel set B ⊆ R the set B � := g −1 (B) ∈ B and, therefore, {Y = g(X) ∈ B} = {X ∈ g −1 (B) = B � } = X −1 (B � ) ∈ σ(X). Let us show that 2 implies 1. For all integer n and k consider sets � � k k + 1 �� �� k k + 1 �� An,k = ω : Y (ω) ∈ n , n = Y −1 , . 2 2n 2n 2 By 2, An,k ∈ σ(X) = {X −1 (B) : B ∈ B} and, therefore, An,k = X −1 (Bn,k ) for some Bn,k ∈ B. Let us consider a function � k gn (X) = I(X ∈ Bn,k ). 2n k∈Z

By construction, |Y − gn (X)| ≤ Y (ω) ∈

1 2n

since

� k k + 1� k , n ⇐⇒ X(ω) ∈ Bn,k ⇐⇒ gn (X(ω)) = n . n 2 2 2

It is easy to see that gn (x) ≤ gn+1 (x) and, therefore, g(x) = limn→∞ gn (x) is a measurable function on (S, B) and, clearly, Y = g(X). Discrete random variables. A r.v. X : Ω → S is called discrete if PX ({Si }i≥1 ) = 1 for some sequence Si ∈ S. Absolutely continuous random variables. On a measure space (S, B), a measure P is called absolutely continuous w.r.t. a measure λ if ∀B ∈ B, λ(B) = 0 =⇒ P(B) = 0. The following is a well known result from measure theory. Theorem 2 (Radon-Nikodym) If P and λ are sigma-finite and P is absolutely continuous w.r.t. λ then there exists a Radon-Nikodym derivative f ≥ 0 such that for all B ∈ B � P(B) = f (s)dλ(s). B

f is uniquely defined up to a λ-null sets. In a typical setting of S = Rk , a probability measure P and Lebesgue’s measure λ, f is called the density of the distribution P. Independence. Consider a probability space (Ω, C, P) and two σ-algebras A, B ⊆ C. A and B are called independent if P(A ∩ B) = P(A)P(B) for all A ∈ A, B ∈ B.

6

σ-algebras Ai ⊆ C for i ≤ n are independent if P(A1 ∩ · · · ∩ An ) =



P(Ai ) for all Ai ∈ Ai .

i≤n

σ-algebras Ai ⊆ C for i ≤ n are pairwise independent if P(Ai ∩ Aj ) = P(Ai )P(Aj ) for all Ai ∈ Ai , Aj ∈ Aj , i �= j. Random variables Xi : Ω → S for i ≤ n are (pairwise) independent if σ-algebras σ(Xi ), i ≤ n are (pairwise) independent which is just another convenient way to state the familiar P(X1 ∈ B1 , . . . , Xn ∈ Bn ) = P(X1 ∈ B1 ) × . . . × P(Xn ∈ Bn ) for any events B1 , . . . , Bn ∈ B. Example. Consider a regular tetrahedron die, Figure 2.3, with red, green and blue sides and a red-green­ blue base. If we roll this die then indicators of different colors provide an example of pairwise independent r.v.s that are not independent since P(r) = P(b) = P(g) = but P(rbg) =

1 1 and P(rb) = P(rg) = P(bg) = 2 4

� 1 �3 1 � P(r)P(b)P(g) = = . 4 2

g b r r

g b

Figure 2.3: Pairwise independent but not independent r.v.s.

Independence of σ-algebras can be checked on generating algebras: Lemma 6 If algebras Ai , i ≤ n are independent then σ-algebras σ(Ai ) are independent. Proof. Obvious by Approximation Lemma 2.

Lemma 7 Consider r.v.s Xi : Ω → R on a probability space (Ω, A, P). 1. Xi ’s are independent iff P(X1 ≤ t1 , . . . , Xn ≤ tn ) = P(X1 ≤ t1 ) × . . . × P(Xn ≤ tn ). 2. If the laws of Xi ’s have densities fi (x) then Xi ’s are independent iff a joint density exists and � f (x1 , ..., xn ) = fi (xi ).

7

(2.0.1)

Proof. 1 is obvious by Lemma 6 because (2.0.1) implies the same equality for intervals P(X1 ∈ (a1 , b1 ], . . . , Xn ∈ (an , bn ]) = P(X1 ∈ (a1 , b1 ]) × . . . × P(Xn ∈ (an , bn ]) and, therefore, for finite union of disjoint such intervals. To check this for intervals (for example, for n = 2) we can write P(a1 < X1 ≤ b1 , a2 < Xn ≤ b2 ) as P(X1 ≤ b1 , X2 ≤ b2 ) − P(X1 ≤ a1 , X2 ≤ b2 ) − P(X1 ≤ b1 , X2 ≤ a2 ) + P(X1 ≤ a1 , X2 ≤ a2 ) P(X1 ≤ b1 )P(X2 ≤ b2 ) − P(X1 ≤ a1 )P(X2 ≤ b2 ) − P(X1 ≤ b1 )P(X2 ≤ a2 ) + P(X1 ≤ a1 )P(X2 ≤ a2 ) � �� � = P(X1 ≤ b1 ) − P(X1 ≤ a1 ) P(X2 ≤ b2 ) − P(X2 ≤ a2 ) = P(a1 < X1 ≤ b1 )P(a2 < X2 ≤ b2 ).

=

To prove 2 we start with ”⇐=”. �

P(∩{Xi ∈ Ai })

= =

� P(X ∈ A1 × · · · × An ) = fi (xi )dx A1 ×···×An � � � fi (xi )dxi {by Fubini’s Theorem} = P(Xi ∈ Ai ). Ai

i≤n

Next, we prove ”=⇒”. First of all, by independence, P(X ∈ A1 × · · · × An ) =



P(Xi ∈ Ai )

Fubini



=



fi (xi )dx.

A1 ×···×An

Therefore, the same equality holds for sets in algebra A that consists of finite unions of disjoint sets A1 × · · · × An , i.e. � � fi (xi )dx for B ∈ A. P(X ∈ B) = B

Both P(X ∈ B),

� � B

fi (xi )dx are countably additive on A and finite, � � P(Rn ) = fi (xi )dx = 1. Rn

By the Caratheodory extension Theorem 1, they extend uniquely to all Borel sets B = σ(A), so � � P(B) = fi (xi )dx for B ∈ B. B

Expectation. If X : Ω → R is a random variable on (Ω, A, P) then expectation of X is defined as � EX = X(ω)dP(ω). Ω

In other words, expectation is just another term for the integral with respect to a probability measure and, as a result, expectation has all the usual properties of the integrals. Let us emphasize some of them. Lemma 8

1. If F is the c.d.f. of X then for any measurable function g : R → R, � Eg(x) = g(x)dF (x). R

2. If X is discrete, i.e. P(X ∈ {xi }i≥1 ) = 1, then � EX = xi P(X = xi ). i≥1

8

3. If X : Ω → Rk has a density f (x) on Rk and g : Rk → R then � Eg(X) = g(x)f (x)dx. Proof. All these properties follow by making a change of variables x = X(ω) or ω = X −1 (x), i.e. � � � Eg(X) = g(X(ω))dP(ω) = g(x)dP ◦ X −1 (x) = g(x)dPX (x), Ω

R

R

−1

where PX = P ◦ X is the law of X. Another way to see this would be to start with indicator functions of sets g(x) = I(x ∈ B) for which � Eg(X) = P(X ∈ B) = PX (B) = I(x ∈ B)dPX (x) R

and, therefore, the same is true for simple step functions � g(x) = wi I(x ∈ Bi ) i≥n

for disjoint Bi . By approximation, this is true for any measurable functions.

9

Section 3

Kolmogorov’s Theorem about consistent distributions. The notion of a general probability space (Ω, A, P) and a random variable X : Ω → R on this space are rather abstract and often one is really interested in the law PX of X on the sample space (R, B, PX ). One can always define a random variable with this law by taking X : R → R to be the identity X(x) = x. Similarly, one can define a random vector X = (X1 , . . . , Xk ) on Rk by defining the distribution on the Borel σ-algebra Bk first. How can we define a distribution on an infinite dimensional space or, in other words, how can we define an infinite family of random variables � (Xt )t∈T ∈ RT = Rt = {f : T → R} t∈T

for some infinite set T ? Obviously, there are various ways to do that, for example, we can define explicitly Xt = cos(tU ) for some random variable U. In this section we will consider a typical situation when we start by defining the distribution on any finite subset of coordinates, i.e. for any finite subset N ⊆ T the law PN of (Xt )t∈N on the Borel σ-algebra BN on RN is given. Clearly, these laws must satisfy a natural consistency assumption: for any finite subsets N ⊆ M and any Borel set B ∈ BN , PN (B) = PM (B × RM −N ).

(3.0.1)

Then the problem is to define a sample space simultaneously for the entire family (Xt )t∈T , i.e. we need to define a σ-algebra A of measurable events in RT and a probability measure P on it that agrees with our finite dimensional distributions PN . At the very least, A should contain events expressed in terms of finite number of coordinates, i.e. the following algebra of sets on RT : A = {B × RT −N : B ∈ BN }. (It is easy to check that A is an algebra.) A set B × RT −N is called a cylinder and B is the base of the cylinder. The probability P on such sets is of course defined by by P(B × RT −N ) = PN (B). Notice that, by consistency assumption, P is well defined. Given two finite subsets N1 , N2 ⊂ T and B1 ∈ BN1 , the same set can be respresented as � � B1 × RT −N1 = B1 × R(N1 ∪N2 )\N1 × RT \(N1 ∪N2 ) . However, by consistency, P will not depend on the representation. Let A = σ(A) be a σ-algebra generated by algebra A, i.e. the minimal σ-algebra that contains all cylinders. Definition. A is called the cylindrical algebra and A is the cylindrical σ-algebra on RT . Example. If N ⊆ T then {supi≥1 Xi ≤ 1} is a measurable event in A. 10

Theorem 3 (Kolmogorov) For consistent family of distributions (3.0.1), P can be uniquely extended to A. Proof. To use the Caratheodory extension Theorem 1, we need to show that P is countably additive on A or, equivalently, that it satisfies continuity of measure property: given a sequence Bn ∈ A, � Bn ⊇ Bn+1 , Bn = ∅ =⇒ P(Bn ) → 0. n≥1

We will prove that if there exists ε > 0 such that P(Bn ) > ε for all n then



n≥1

Bn = � ∅. We have

Bn = Cn × RT −Nn , Nn - finite subset of T and Cn ∈ BNn . Since Bn ⊇ Bn+1 , we can assume that Nn ⊆ Nn+1 . First of all, by regularity of measure PNn there exists a compact set Kn ⊆ Cn such that ε PNn (Cn \ Kn ) ≤ n+1 . 2 We have, � �

Ci × RT −Ni \ Ki × RT −Ni ⊆ (Ci \ Ki ) × RT −Ni i≤n

i≤n

i≤n

and, therefore, �� � � � � P Ci × RT −Ni \ Ki × RT −Ni ≤ P (Ci \ Ki ) × RT −Ni i≤n

i≤n

i≤n

� � ε � � ε ≤ P (Ci \ Ki ) × RT −Ni ≤ ≤ . i +1 2 2 i≤n

i≤n

�� � Since P(Bn ) = P i≤n Ci × RT −Ni > ε this implies that �� � ε P Ki × RT −Ni ≥ > 0. 2 i≤n

We can write �

Ki × RT −Ni =

i≤n n

where K =



i≤n (Ki



(Ki × RNn −Ni ) × RT −Nn = K n × RT −Nn

i≤n Nn −Ni

) is a compact in RNn , since Kn is a compact in RNn . We proved that �� � PNn (K n ) = P(K n × RT −Nn ) = P Ki × RT −Ni > 0

×R

i≤n

and, therefore, there exists a point xn = (xn1 , . . . , xnNn , . . .) ∈ K n × RT −Nn . We also have the following inclusion property. For m > n, xm ∈ K m × RT −Nm ⊆ K n × RT −Nn m n 1 and, therefore, (xm 1 , ..., xNn ) ∈ K . Any sequence on a compact has a converging subsequence. Let {nk }k≥1 be n1

n1

k→∞

such that (x1 k , . . . , xNk1 ) −→ (x1 , . . . , xN1 ) ∈ K 1 . Then we can take a subsequence {n2k }k≥1 ⊆ {n1k }k≥1 such n1k

2 nk N2

that (x1 , . . . , x such that

m−1 ) −→ (x1 , . . . , xN2 ) ∈ K 2 . By iteration, we can find a subsequence {nm }k≥1 , k }k≥1 ⊆ {nk nm

nm

(x1 k , ..., xNkm ) −→ (x1 , ..., xNm ) ∈ K m . Therefore, a point (x1 , x2 , . . .) ∈



K n × RT −Nn ⊆

n≥1

� n≥1

so this last set is not empty.

11

Bn ,

Section 4

Laws of Large Numbers. Consider a r.v. X and sequence of r.v.s (Xn )n≥1 on some probability space. We say that Xn converges to X in probability if for all ε > 0 lim P(|Xn − X| ≥ ε) = 0. n→∞

We say that Xn converges to X almost surely or with probability 1 if P(ω : lim Xn (ω) = X(ω)) = 1. n→∞

Lemma 9 (Chebyshev’s inequality) If a r.v. X ≥ 0 then for t > 0, P(X ≥ t) ≤

EX . t

Proof. EX = EXI(X < t) + EXI(X ≥ t) ≥ EXI(X ≥ t) ≥ tEI(X ≥ t) = tP(X ≥ t).

Theorem 4 (Weak law of large numbers) Consider a sequence of r.v.s (Xi )i≥1 that are centered, EXi = 0, have finite second moments, EXi2 ≤ K < ∞ and are uncorrelated, EXi Xj = 0, i �= j. Then Sn =

1� Xi → 0 n i≤n

in probability. Proof. By Chebyshev’s inequality we have P(|Sn − 0| ≥ ε) = =

P(Sn2 ≥ ε2 ) ≤ 1 n2 ε2

ESn2 ε2

E(X1 + · · · + Xn )2 =

1 n2 ε2

� i≤n

EXi2 ≤

nK K n→∞ = 2 −→ 0. 2 2 nε n ε

Example. Before we turn to a.s. convergence results, let us note that convergence in probability is weaker than a.s. convergence. For example, consider a probability space which is a circle of circumference 1 with uniform measure on it. Consider a sequence of r.v. on this probability space defined by � � � 1 1 1 � mod 1 . Xk (x) = I x ∈ 1 + + · · · + , 1 + · · · + 2 k k+1 12

Then Xk → 0 in probability, since for 0 < ε < 1 1 →0 k+1 � but, clearly, Xk does not converge a.s. because the series k≥1 1/k diverges and, as a result, each point x on the sphere will fall into the above intervals infinitely many times, i.e. it will satisfy Xk (x) = 1 for infinitely many k. P(|Xk − 0| ≥ ε) =

Lemma 10 Consider a sequence (pi )i≥1 such that pi ∈ [0, 1). Then � � (1 − pi ) = 0 ⇐⇒ pi = +∞. i≥1

i≤1

Proof. ”⇐=”. Using that 1 − p ≤ e−p we get � � (1 − pi ) ≤ exp(− pi ) → 0 as n → ∞. i≤n

i≤n

”=⇒”. We can assume that pi ≤ 12 for i ≥ m for large enough m because, otherwise, the series obviously diverges. Since 1 − p ≥ e−2p for p ≤ 1/2 we have � � � � (1 − pi ) ≥ exp −2 pi m≤i≤n

m≤i≤n

and the result follows.

Lemma 11 (Borel-Cantelli) Consider a sequence (An )n≥1 of events An ∈ A on probability space (Ω, A, P). Consider an event �

An i.o. := lim sup An := Am n≥1 m≥n

that An occur infinitely often. Then � 1. n≥1 P(An ) < ∞ =⇒ P(An i.o.) = 0. � 2. If An are independent then n≥1 P(An ) = +∞ =⇒ P(An i.o.) = 1. Proof. 1. If Bn =



m≥n

Am then Bn ⊇ Bn+1 and by continuity of measure �� � P(An i.o) = P Bn = lim P(Bn ). n→∞

n≥1

We have



� � � P(Bn ) = P Am ≤ P(Am ) → 0 as n → +∞ because P(Am ) < ∞. m≥n

m≥n

m≥1

2. We have � � P Ω \ Bn =

� � �� �

P Ω\ Am = P (Ω \ Am ) m≥n

=



m≥n

P(Ω \ Am )(by independence) =

m≥n

by Lemma 10, since



(1 − P(Am )) = 0,

m≥n

�� � P(A ) = +∞. Therefore, P(B ) = 1 and P(A i.o) = P B = 1. m n n n m≥n n≥1



13

Strong law of � ∞large numbers. The following simple observation will be useful. If a random variable X ≥ 0 then EX = 0 P(X ≥ x)dx. Indeed, � ∞ � ∞� x � ∞� ∞ � ∞ EX = xdF (x) = 1dsdF (x) = 1dF (x)ds = P(X ≥ s)ds. 0

0

0

0

s

0

For X ≥ 0 such that EX < ∞ this implies �



� P(X ≥ i) ≤

P(X ≥ s)ds = EX < ∞. 0

i≥1

Theorem 5 (Strong law of large numbers) If E|X| < ∞ and (Xi )i≥1 are i.i.d. copies of X then n

1� Xi → EX1 almost surely (a.s.). n i=1 Proof. The proof will proceed in several steps. 1. First, without loss of generality we can assume that Xi ≥ 0. Indeed, for signed r.v.s we can decompose Xi = Xi+ − Xi− where Xi+ = Xi I(Xi ≥ 0) and Xi− = |Xi |I(Xi < 0) and the general result would follow since n

n

n

1� 1� + 1� − Xi = X − X → EX1+ − EX1− = EX1 . n i=1 n i=1 i n i=1 i Thus, from now on we assume that Xi ≥ 0. 2. (Truncation) Next, we can replace Xi by Yi = Xi I(Xi ≤ i) using Borel-Cantelli lemma. We have � � P(Xi = � Yi ) = P(Xi > i) ≤ EX1 < ∞ i≥1

i≥1

and Borel-Cantelli lemma implies that P({Xi = � Yi } i.o.) = 0. This means that for some (random) i0 and for i ≥ i0 we have Xi = Yi and, therefore, n

n

1� 1� Yi . Xi = lim n→∞ n n→∞ n i=1 i=1 lim

�n It remains to show that if Tn = i=1 Yi then Tnn → EX a.s. 3. (Limit over subsequences) We will first prove this along the subsequences n(k) = �αk � for α > 1. For any ε > 0, � � � � � � 1 1 P |Tn(k) − ETn(k) | ≥ εn(k) ≤ Var(T ) = Var(Yi ) n(k) ε2 n(k)2 ε2 n(k)2 k≥1

k≥1

≤ (∗)



k≥1

i≤n(k)

� � 1 1 � 1 2 2 EY = EY i i ε2 n(k)2 ε2 n(k)2 i≥1 k≥1 i≤n(k) k:n(k)≥i � i � � 1 1 K EYi2 2 = K x2 dF (x) ≤ K × (∗∗) i i2 0 �

i≥1

i≥1

where F (x) is the law of X and a constant K = K(α) depends only on α. (*) follows from αk ≤ n(k) = �αk � ≤ αk 2 14

and if k0 = min{k : αk ≥ i} then � n(k)≥i

� 4 1 4 ≤ = 2k 2 2k 0 n(k) α α (1 − k α ≥i

1 α2 )



K . i2

We can continue, (∗∗)

� 1 � � m+1 � � 1 � m+1 2 x dF (x) = x2 dF (x) 2 i2 mm � 1 � m+1 � � m+1 2 x dF (x) ≤ xdF (x) = EX < ∞. m+1 m m

= ≤

m≥0

m≥0

Thus, we proved that � � � P |Tn(k) − ETn(k) | ≥ εn(k) < ∞ k≥1

and Borel-Cantelli lemma implies that � � P |Tn(k) − ETn(k) | ≥ εn(k) i.o. = 0. This means that for some (random) k0 � � P ∀k ≥ k0 , |Tn(k) − ETn(k) | ≤ εn(k) = 1. If we take a sequence εm =

1 m,

this implies that

� � 1 P ∀m ≥ 1, k ≥ k0 (m), |Tn(k) − ETn(k) | ≤ n(k) = 1 m and this proves that Tn(k) ETn(k) − → 0 a.s. n(k) n(k) On the other hand, � 1 1 ETn(k) = EXi I(Xi ≤ i) → EX as k → ∞, n(k) n(k) i≤n(k)

by Lebesgue’s dominated convergence theorem. We proved that Tn(k) → EX a.s. n(k) 4. Finally, for j such that n(k) ≤ j < n(k + 1) = n(k) we can write

n(k + 1) ≤ n(k)α2 n(k)

Tn(k+1) 1 Tn(k) Tj ≤ ≤ α2 2 α n(k) j n(k + 1)

and, therefore, 1 Tj Tj EX ≤ lim inf ≤ lim sup ≤ α2 EX a.s. α2 j j Taking α = 1 + m−1 and letting m → ∞ proves that limj→∞

15

Tj j

= EX a.s.

Section 5

Bernstein Polynomials. Hausdorff and de Finetti theorems. Let us look at some applications related to the law of large numbers. Consider an i.i.d. sequence of real valued r.v. (Xi ) with distribution Pθ from a family of distributions parametrized by θ ∈ Θ ⊆ R such that Eθ Xi = θ, σ 2 (θ) := Varθ (Xi ) ≤ K < +∞. ¯n = Let X

1 n



i≤n

Xi . The following holds.

¯ n ) → u(θ) uniformly over Θ.

Theorem 6 If u : R → R is uniformly continuous and bounded then Eθ u(X Proof. For any ε > 0, ¯ n ) − u(θ)| |Eθ u(X

¯ n ) − u(θ)| ≤ Eθ |u(X = ≤

� � ¯ n ) − u(θ)| I(|X ¯ n − θ| ≤ ε) + I(|X ¯ n − θ| > ε) Eθ |u(X ¯ n − θ| > ε) max |u(x) − u(θ)| + 2 max |u(x)|Pθ (|X x

|x−θ|≤ε

≤ δ(ε) + 2�u�∞

1 ¯ n − θ)2 ≤ δ(ε) + 2�u�∞ K , Eθ (X ε2 nε2

where δ(ε) is the modulus of continuity of u. Letting ε = εn → 0 so that nε2n → ∞ finishes the proof. Example. Let (Xi ) be i.i.d. with Bernoulli distribution B(θ) with probability of success θ ∈ [0, 1], i.e. Pθ (Xi = 1) = θ, Pθ (Xi = 0) = 1 − θ, and let u : [0, 1] → R be continuous. Then, by the above Theorem, the following Bernstein polynomials n n n � k � �� � � � k ��n� � ¯n) = Xi = k = u θk (1 − θ)n−k → u(θ) Bn (θ) := Eθ u(X u Pθ n n k i=1 k=0

k=0

uniformly on [0, 1]. Example. Let (Xi ) have Poisson distribution Π(θ) with intensity parameter θ > 0 defined by Pθ (Xi = k) =

θk −θ e for integer k ≥ 0. k!

Then it is well known (and easy to check) that Eθ Xi = θ, σ 2 (θ) = θ and the sum X1 + . . . + Xn has Poisson distribution Π(nθ). If u is bounded and continuous on [0, +∞) then ¯n) = Eθ u(X

∞ n ∞ � k � �� � � � k � (nθ)k � u Pθ Xi = k = u e−nθ → u(θ) n n k! i=1

k=0

k=0

16

uniformly on compact sets. Moment problem. Consider a random variable X ∈ [0, 1] and let µk = EX k be its moments. Given a sequence (c0 , c1 , c2 , . . .) let us define a sequence of increments by Δck = ck+1 − ck . Then −Δµk = µk − µk+1 = E(X k − X k+1 ) = EX k (1 − X), (−Δ)(−Δµk ) = (−1)2 Δ2 µk = EX k (1 − X) − EX k+1 (1 − X) = EX k (1 − X)2 and by induction (−1)r Δr µk = EX k (1 − X)r . Clearly, (−1)r Δr µk ≥ 0 since X ∈ [0, 1]. If u is a continuous function on [0, 1] and Bn is its corresponding Bernstein polynomial then EBn (X) =

n n � k ��n� � k ��n� � � u EX k (1 − X)n−k = u (−1)n−k Δn−k µk . n k n k

k=0

k=0

Since Bn (X) converges uniformly to u(X), EBn (X) converges to Eu(X). Let us define (n)

pk

=

� � n � n (n) (−1)n−k Δn−k µk ≥ 0, pk = 1 (take u = 1). k k=0

(n)

We can think of pk

as the distribution of a r.v. X (n) such that � k� (n) P X (n) = = pk . n

(5.0.1)

We showed that � � EBn (X) = Eu X (n) → Eu(X) for any continuous function u. We will later see that by definition this means that X (n) converges to X in distribution. Given the moments of a r.v. X, this construction allows us to approximate the distribution of X and expectation of u(X). Next, given a sequence (µk ), when is it the sequence of moments of some [0, 1] valued r.v. X? By the above, it is necessary that (5.0.2) µk ≥ 0, µ0 = 1 and (−1)r Δr µk ≥ 0 for all k, r. It turns out that this is also sufficient. Theorem 7 (Hausdorff ) There exists a r.v. X ∈ [0, 1] such that µk = EX k iff (5.0.2) holds. Proof. The idea of the proof is as follows. If µk are the moments of the distribution of some r.v. X, then the discrete distributions defined in (5.0.1) should approximate it. Therefore, our goal will be to show that (n) condition (5.0.2) ensures that (pk ) is indeed a distribution and then show that the moments of (5.0.1) converge to µk . As a result, any limit of these distributions will be a candidate for the distribution of X. (n)

First of all, let us express µk in terms of (pk ). Since Δµk = µk+1 − µk we have the following inversion formula: µk

= =

µk+1 − Δµk = (µk+2 − Δµk+1 ) + (−Δµk+1 + Δ2 µk ) r � � � r µk+2 − 2Δµk+1 + Δ2 µk = (−1)r−j Δr−j µk+j , j j=0

17

by induction. Take r = n − k. Then � �� � � � n−k n−k � n−k � n−k n j j n−(k+j) n−(k+j) � n � � n � p(n) µk = (−1) Δ µk+j = k+j . k + j j=0 k+j j=0 k+j We have

�n−k� � nj � k+j

�k+j � (n − k)! (k + j)!(n − k − j)! = = �nk � j!(n − k − j)! n! k

so that µk =

n−k � j=0

(n)

By (5.0.2), pm ≥ 0 and



m≤n

�k+j � n �m� � (n) k �n� pk+j = �nk� p(mn) . k

m=k

k

(n)

pm = µ0 = 1 so we can consider a r.v. X (n) such that � m� P X (n) = = p(n) m for 0 ≤ m ≤ n. n

We have µk

= n→∞



n �

�m� n n � � m(m − 1) · · · (m − k + 1) (n) �nk � p(n) pm = m = n(n − 1) · · · (n − k + 1) k

m=k n � � m=0

m=k

m=k

m m n(n

− n1 ) · · · ( m n −

1(1 − n1 ) · · · (1 −

k+1 n ) (n) pm k+1 n )

� �k m �k (k) n→∞ pm = E X (n) −→ µk .

n

Any continuous � �function u can be approximated by (for example, Bernstein) polynomials so the limit limn→∞ Eu X (n) exists. By selection theorem that we will prove later in the course, one can choose a subsequence X (ni ) that converges to some r.v. X in distribution and, as a result, � �k E X (ni ) → EX k = µk , which means that µk are the moments of X. de Finetti’s theorem. Consider an exchangeable sequence X1 , X2 , . . . , Xn , . . . of Bernoulli random variables which means that for any n ≥ 1 the probability P(X1 = x1 , ..., Xn = xn ) depends only on x1 + . . . + xn , i.e. it does not depend on the order of 1’s or 0’s. Another way to say this is that for any n ≥ 1 and any permutation π of 1, . . . , n the distribution of (Xπ(1) , . . . , Xπ(n) ) does not depend on π. Then the following holds. Theorem 8 (de Finetti) There exists a distribution F on [0, 1] such that � pk := P(X1 + . . . + Xn = k) = 0

1

� � n k x (1 − x)n−k dF (x). k

This means that to generate such exchangeable sequence we can first pick x ∈ [0, 1] from distribution F and then generate a sequence of i.i.d Bernoulli random variables with probability of success x. Proof. Let µ0 = 1 and for k ≥ 1 define µk = P(X1 = 1, ..., Xk = 1).

18

(5.0.3)

We have P(X1 = 1, ..., Xk = 1, Xk+1 = 0)

= P(X1 = 1, ..., Xk = 1) − P(X1 = 1, ..., Xk = 1, Xk+1 = 1) = µk − µk+1 = −Δµk .

Next, using exchangeability P(X1 = 1, ..., Xk = 1, Xk+1 = 0, Xk+2 = 0)

= P(X1 = 1, ..., Xk = 1, Xk+1 = 0) − P(X1 = 1, ..., Xk = 1, Xk+1 = 0, Xk+2 = 1) = −Δµk − (−Δµk+1 ) = Δ2 µk .

Similarly, by induction, P(X1 = 1, ..., Xk = 1, Xk+1 = 0, ..., Xn = 0) = (−1)n−k Δn−k µk ≥ 0. By the Hausdorff theorem, µk = EX k for some r.v. X ∈ [0, 1] and, therefore, P(X1 = 1, ..., Xk = 1, Xk+1 = 0, ..., Xn = 0)

=

(−1)n−k Δn−k µk

=

EX k (1 − X)n−k =



1

xk (1 − x)n−k dF (x).

0

Since, by exchangeability, changing the order of 1’s and 0’s does not affect the probability, we get � 1� � n k P(X1 + . . . + Xn = k) = x (1 − x)n−k dF (x). k 0

Example. (Polya urn model). Suppose we have b blue and r red balls in the urn. We pick a ball

Pick

b

r + c of the same color

Figure 5.1: Polya urn model. randomly and return it with c balls of the same color. Consider r.v.s � 1 if the ith ball picked is blue Xi = 0 otherwise. Xi ’s are not independent but exchangeable. For example, P(bbr) =

b b+c r b r b+r × × , P(brb) = × × b + r b + r + c b + r + 2c b + r b + r + c b + r + 2c

are equal. To identify the distribution F in de Finetti’s theorem, let us look at its moments µk in (5.0.3), � � b b+c b + (k − 1)c µk = P b� .�� . . �b = × × ··· × . b+r b+r+c b + r + (k − 1)c k times One can recognize or easily check that µk are the moments of Beta(α, β) distribution with the density Γ(α + β) α−1 x (1 − x)β−1

Γ(α)Γ(β) 19

on [0, 1] with parameters� α = b/c, � β = r/c. By de Finetti’s theorem, we can generate Xi ’s by first picking x from distribution Beta b/c, r/c and then generating i.i.d. Bernoulli (Xi )’s with probability of success x. By strong law of large numbers, the proportion of blue balls in the first n repetitions will converge to this probability of success x, i.e. in the limit it will be random with Beta distribution. This example will come up once more when we talk about convergence of martingales.

20

Section 6

0 - 1 Laws. Convergence of random series. � � Consider a sequence (Xi )i≥1 of real valued independent random variables and let σ (Xi )i≥1 be a σ-algebra of events generated by this sequence, i.e. {(Xi )i≥1 ∈ B} for B in the cylindrical σ-algebra on RN . � � � � Definition. An event A ∈ σ (Xi )i≥1 is called a tail event if A ∈ σ (Xi )i≥n for all n ≥ 1. For example, if Ai ∈ σ(Xi ) then �

Ai i.o. = Ai n≥1 i≥n

is a tail event. It turns out that such events have probability 0 or 1. Theorem 9 (Kolmogorov’s 0-1 law) If A is a tail event then P(A) = 0 or 1. Proof. For � � a finite subset F = {i1 , . . . , in } ⊂ N, let us denote by XF = (Xi1 , . . . , Xin ). A σ-algebra σ (Xi )i≥1 is generated by algebra {XF ∈ B : F - finite ⊆ N, B ∈ B(R|F | )}. � � By approximation lemma, we can approximate any event A ∈ σ (Xi )i≥1 by events in this generating algebra. Therefore, for any ε > 0 there exists a set A� in this algebra such that P(AΔA� ) ≤ ε and by definition A� ∈ σ(X1 , ..., Xn ) for large enough n. This implies |P(A) − P(A� )| ≤ ε, |P(A) − P(AA� )| ≤ ε. Since A is a tail event, A ∈ σ((Xi )i≥n+1 ) which means that A, A� are independent, i.e. P(AA� ) = P(A)P(A� ). We get P(A) ≈ P(AA� ) = P(A)P(A� ) ≈ P(A)P(A) and letting ε → 0 proves that P(A) = P(A)2 . Examples. �� � 1. i≥1 Xi − converges is a tail event, it has probability 0 or 1. � 2. Consider series i≥1 Xi z i on a complex plane, z ∈ C. Its radius of convergence is 1

r = lim inf |Xi |− i . i→∞

For any x ≥ 0, event {r ≤ x} is, obviously, a tail event. This implies that r = const with probability 1.

21

The Savage-Hewitt 0 - 1 law. Next we will prove a stronger result under more restrictive assumption that the r.v.s Xi , i ≥ 1 are not only independent but also identically distributed with the law µ. Without loss of generality, we can assume that each Xi is given by the identity Xi (x) = x on its sample space (R, B, µ). By Kolmogorov’s consistency theorem the entire sequence (Xi )i≥1 can be defined on the sample space (RN , B ∞ , P) where B ∞ is the cylindrical σ-algebra and P is the measure guaranteed by the Caratheodory extension theorem. In our case Xi ’s are i.i.d. and P = µ∞ is called the infinite product measure. It will be convenient to use the notation σ((Xi )i≥1 ) for the cylindrical σ-algebra since similar notation can be used for the cylindrical σ-algebra on any subset of coordinates. Definition. An event A ∈ σ((Xi )i≥1 ) - is called exchangeable/symmetric if for all n ≥ 1, (x1 , x2 , . . . , xn , xn+1 , . . .) ∈ A =⇒ (xn , x2 , ..., xn−1 , x1 , xn+1 , ...) ∈ A. In other words, the set A is symmetric under permutations of a finite number of coordinates. Note that any tail event is symmetric. Theorem 10 (Savage-Hewitt 0-1 law) If A is symmetric then P(A) = 0 or 1. Proof. Given a sequence x = (x1 , x2 , . . .) let us define an operator Γx = (xn+1 , . . . , x2n , x1 , . . . , xn , x2n+1 , . . .) that switches the first n coordinates with the second n coordinates. Since A is symmetric, ΓA = {Γx : x ∈ A} = A. By the Approximation Lemma 2 for any ε > 0 for large enough n, there exists An ∈ σ(X1 , ..., Xn ) such that P(An ΔA) ≤ ε. Clearly, Bn = ΓAn ∈ σ(Xn+1 , ..., X2n ) and by i.i.d. P(Bn ΔA) = P(ΓAn ΔΓA) = P(An ΔA) ≤ ε, � � which implies that P (An Bn )ΔA ≤ 2ε. Therefore, we can conclude that P(A) ≈ P(An ), P(A) ≈ P(An Bn ) = P(An )P(Bn ) = P(An )2 where we used the fact that the events An , Bn are defined in terms of different sets of coordinates and, thus, are independent. Letting ε → 0 implies that P(A) = P(A)2 . Example. Let Sn = X1 + . . . + Xn and let r = lim sup n→∞

Sn − an . bn

Event {r ≤ x} is symmetric since changing the order of any finite set of coordinates does not affect Sn for large enough n. As a result, P(r ≤ x) = 0 or 1, which implies that r = const with probability 1. � Random series. We already saw above that, by Kolmogorov’s 0-1 law, the series i≥1 Xi for inde­ pendent (Xi )i≥1 converges with probability 0 or 1. This means that either Sn = X1 + . . . + Xn converges to its limit S with probability one, or with probability one it does not converge. Two section back, before the proof of the strong law of large numbers, we saw the example of a sequence which with probability one does not converge yet converges to 0 in probability. In case when with probability one Sn does not converge, is it still possible that it converges to some random variable in probability? The answer is no because we will now prove that for random series convergence in probability implies a.s. convergence. 22

Theorem 11 (Kolmogorov’s inequality) Suppose that (Xi )i≥1 are independent and Sn = X1 + . . . + Xn . If for all j ≤ n, P(|Sn − Sj | ≥ a) ≤ p < 1 (6.0.1) then for x > a, � � P max |Sj | ≥ x ≤ 1≤j≤n

1 P(|Sn | > x − a). 1−p

Proof. First of all, let us notice that this inequality is obvious without the maximum because (6.0.1) is equivalent to 1 − p ≤ P(|Sn − Sj | < a) and we can write � � � � � � (1 − p)P |Sj | ≥ x ≤ P |Sn − Sj | < a P |Sj | ≥ x � � = P |Sn − Sj | < a, |Sj | ≥ x ≤ P(|Sn | > x − a). The equality is true because events {|Sj | ≥ x} and {|Sn −Sj | < a} are independent since the first depends only on X1 , ..., Xj and the second only on Xj+1 , ..., Xn . The last inequality is true simply by triangle inequality. To deal with the maximum, instead of looking at an arbitrary partial sum Sj we will look at the first partial sum that crosses level x. We define that first time by τ = min{j ≤ n : |Sj | ≥ x} and let τ = n + 1 if all |Sj | < x. Notice that event {τ = j} also depends only on X1 , ..., Xj so we can again write � � � � � � (1 − p)P τ = j ≤ P |Sn − Sj | < a P τ = j � � = P |Sn − Sj | < a, τ = j ≤ P(|Sn | > x − a, τ = j). The last inequality is again true by triangle inequality because when τ = j we have |Sj | ≥ x and � � � � |Sn − Sj | < a, τ = j ⊆ |Sn | > x − a, τ = j . It remains to add up over j ≤ n to get (1 − p)P(τ ≤ n) ≤ P(|Sn | > x − a, τ ≤ n) ≤ P(|Sn | > x − a) and notice that τ ≤ n is equivalent to maxj≤n |Sj | ≥ x.

Theorem 12 (Kolmogorov) If the series



i≥1

Xi converges in probability then it converges almost surely.

Proof. Suppose that partial sums Sn converge to some r.v. S in probability, i.e. for any ε > 0, for large enough n ≥ n0 (ε) we have P(|Sn − S| ≥ ε) ≤ ε. If k ≥ j ≥ n ≥ n0 (ε) then P(|Sk − Sj | ≥ 2ε) ≤ P(|Sk − S| ≥ ε) + P(|Sj − S| ≥ ε) ≤ 2ε. Next, we use Kolmogorov’s inequality for x = 4ε and a = 2ε (we let partial sums start at n): � � P max |Sj − Sn | ≥ 4ε ≤ n≤j≤k

1 2ε ≤ 3ε, P(|Sk − Sn | ≥ 2ε) ≤ 1 − 2ε 1 − 2ε

for small ε. The events {maxn≤j≤k |Sj − Sn | ≥ 4ε} are increasing as k ↑ ∞ and by continuity of measure � � P max |Sj − Sn | ≥ 4ε ≤ 3ε. n≤j

Finally, since P(|Sn − S | ≥ ε) ≤ ε we get � � P max |Sj − S| ≥ 5ε ≤ 4ε. n≤j

23

This kind of ”maximal” statement about any sequence Sj is actually equivalent to its a.s. convergence. To see this take ε = m12 , take n(m) = n0 (ε) and consider an event � � 5 Am = max |Sj − S| ≥ 2 . m n(m)≤j We proved that � 4 <∞ m2 i.o.) = 0. This means that with probability 1 for large enough

� and by the Borel-Cantelli lemma, P(Am (random) m,

P(Am ) ≤

max |Sj − S | ≤

j≥n(m)

5 m2

and, therefore, Sn → S a.s. Let us give one easy-to-check criterion for convergence of random series, which is called Kolmogorov’s strong law of large numbers. Theorem 13 If (Xi )i≥1 is a sequence of independent random variables such that � EXi = 0 and EXi2 < ∞ i≥1

then



i≥1

Xi converges a.s.

Proof. It is enough to prove convergence in probability. Let us first show the existence of a limit of partial sums Sn over some subsequence. For m < n, P(|Sn − Sm | ≥ ε) ≤

1 1 � m→∞ E(Sn − Sm )2 = 2 EXi2 −→ 0. 2 ε ε m
If we take ε =

1 l2

then for large enough m(l) and for any n ≥ m(l), � � 1 1 P |Sn − Sm(l) | ≥ 2 ≤ 2 . l l

(6.0.2)

W.l.o.g. we can assume that m(l + 1) ≥ m(l) so that � � 1 1 P |Sm(l+1) − Sm(l) | ≥ 2 ≤ 2 . l l Then, � � � � 1 1 P |Sm(l+1) − Sm(l) | ≥ 2 ≤ <∞ l l2 l≥1

l≥1

and by Borel-Cantelli lemma � � 1 P |Sm(l+1) − Sm(l) | ≥ 2 i.o = 0. l As a result, for large enough (random) l and for k > l, |Sm(k) − Sm(l) | ≤

� 1 1 < . i2 l−1 i≥l

This means that (Sm(l) )l≥1 is a Cauchy sequence and there exists the limit S = lim Sm(l) . (6.0.2) implies that Sn → S in probability.

24

Example. Consider random series

εi i≥1 iα



where P(εi = ±1) = 12 . We have

� � εi �2 � 1 1 E α = < ∞ if α > , i i2α 2 i≥1

i≥1

so the series converges a.s. for such α.

25

Section 7

Stopping times, Wald’s identity. Another proof of SLLN. Consider a sequence (Xi )i≥1 of independent r.v.s and an integer valued random variable V ∈ {1, 2, . . .}. We say that V is independent of the future if {V ≤ n} is independent of σ((Xi )i≥n+1 ). We say that V is a stopping time (Markov time) if {V ≤ n} ∈ σ(X1 , . . . , Xn ) for all n. Clearly, a stopping time is independent of the future. An example of stopping time is V = min{k ≥ 1, Sk ≥ 1}. Suppose that V is independent of the future. We can write � � ESV = ESV I(V = k) = ESk I(V = k) k≥1

=

k≥1

��

(∗)

EXn I(V = k) =

��

EXn I(V = k) =

n≥1 k≥n

k≥1 n≤k



EXn I(V ≥ n).

n≥1

In (*) we can interchange the order of summation if, for example, the double sequence is absolutely summable, by Fubini-Tonelli theorem. Since V is independent of the future, the event {V ≥ n} = {V ≤ n − 1}c is independent of σ(Xn ) and we get � ESV = EXn × P(V ≥ n). (7.0.1) n≥1

This implies the following. Theorem 14 (Wald’s identity.) If (Xi )i≥1 are i.i.d., E|X1 | < ∞ and EV < ∞, then ESV = EX1 EV. Proof. By (7.0.1) we have, ESV =



EXn P(V ≥ n) = EX1

n≥1



P(V ≥ n) = EX1 EV.

n≥1

The reason we can interchange the order of summation in (*) is because under our assumptions the double sequence is absolutely summable since �� � E|Xn |I(V = k) = E|Xn |I(V ≥ n) = E|X1 | EV < ∞, n≥1

n≥1 k≥n

so we can apply Fubini-Tonelli theorem. Theorem 15 (Markov property) Suppose that (Xi )i≥1 are i.i.d. and V is a stopping time. Then (V, X1 , . . . , XV ) is independent of (XV +1 , XV +2 , . . .) and dist

(XV +1 , XV +2 , . . .) = (X1 , X2 , . . .), 26

dist

where = means equality in distribution. Proof. Given a subset N ⊆ N and sequences (Bi ) and (Ci ) of Borel sets on R, define events � � A = V ∈ N, X1 ∈ B1 , ..., XV ∈ BV and for any k ≥ 1, � � D = XV +1 ∈ C1 , ..., XV +k ∈ Ck . We have, P(DA) =



P (DA{V = n}) =

n≥1



P (Dn A{V = n})

n≥1

where Dn = {Xn+1 ∈ C1 , . . . , Xn+k ∈ Ck }. The intersection of events � A{V = n} =

∅, {V = n, X1 ∈ B1 , . . . , Xn ∈ Bn },

n �∈ N otherwise.

Since V is a stopping time, {V = n} ∈ σ(X1 , . . . , Xn ) and A{V = n} ∈ σ(X1 , . . . , Xn ). On the other hand, Dn ∈ σ(Xn+1 , . . .) and, as a result, � � P(DA) = P(Dn )P(A{V = n}) = P(D0 )P(A{V = n}) = P(D0 )P(A), n≥1

n≥1

and this finishes the proof. Remark. One could be a little bit more careful when talking about the events generated by a vector (V, X1 , . . . , XV ) that has random length. In the proof we implicitly assumed that such events are generated by events � � A = V ∈ N, X1 ∈ B1 , ..., XV ∈ BV which is a rather intuitive definition. However, one could be more formal and define a σ-algebra of events generated by (V, X1 , . . . , XV ) as events A such that A ∩ {V ≤ n} ∈ σ(X1 , . . . , Xn ) for any n ≥ 1. This means that when V ≤ n the event A is expressed only in terms of X1 , . . . , Xn . It is easy to check that with this more formal definition the proof remains exactly the same. Let us give one interesting application of Markov property and Wald’s identity that will yield another proof of strong law of large numbers. Theorem 16 Suppose that (Xi )i≥1 are i.i.d. such that EX1 > 0. If Z = inf n≥1 Sn then P(Z > −∞) = 1. (Partial sums can not drift down to −∞ if EX1 > 0. Of course, this is obvious by SLLN.) Proof. Let us define (see figure 7.1), (2)

τ1 = min{k ≥ 1, Sk ≥ 1}, Z1 = min Sk , Sk = Sτ1 +k − Sτ1 , k≤τ1

� � (2) (2) (3) (2) τ2 = min k ≥ 1, Sk ≥ 1 , Z2 = min Sk , Sk = Sτ2 +k − Sτ(2) . 2 k≤τ2

By induction, � � (n) (n) (n+1) (n) τn = min k ≥ 1, Sk ≥ 1 , Zn = min Sk , Sk = Sτn +k − Sτ(n) . n k≤τn

Z1 , ..., Zn are i.i.d. by Markov property. 27

1 0 1 0

!2

z2

!1

z1

Figure 7.1: A sequence of stopping times. Notice that, by construction, Sτ1 +···+τn−1 ≥ n − 1 and Z = inf Sk = inf{Z1 , Sτ1 + Z2 , Sτ1 +τ2 + Z3 , ...}. k≥1

We have, {Z ≤ −N } =



{Sτ1 +...+τk−1 + Zk ≤ −N } ⊆

k≥1



{k − 1 + Zk ≤ −N }.

k≥1

Therefore, P(Z ≤ −N ) ≤



P(k − 1 + Zk ≤ −N ) =

k≥1

=





P(Zk ≤ −N − k + 1)

k≥1

P(Z1 ≤ −N − k + 1) =



P(Z1 ≤ −j) ≤

N →∞

P(|Z1 | ≥ j) −→ 0

j≥N

j≥N

k≥1



if we can show that E|Z1 | < ∞ since �

P(|Z1 | ≥ j) ≤ E|Z1 | < ∞.

j≥1

We can write E|Z1 | ≤ E



Wald

|Xi | = E|X1 |Eτ1 < ∞

i≤τ1

if we can show that Eτ1 < ∞. This is left as an exercise (hint: truncate Xi ’s and τ1 and use Wald’s identity). N →∞ We proved that P(Z ≤ −N ) −→ 0 which, of course, implies that P(Z > −∞) = 1. This result gives another proof of the SLLN. Theorem 17 If (Xi )i≥1 are i.i.d. and EX1 = 0 then

Sn n

→ 0 a.s.

Proof. Given ε > 0 we define Xiε = Xi + ε so that EX1ε = ε > 0. By the above result, inf n≥1 (Sn + nε) > −∞ with probability one. This means that for all n ≥ 1, Sn + nε ≥ −M > −∞ for some large enough M. Dividing both sides by n and letting n → ∞ we get lim inf n→∞

Sn ≥ −ε n

with probability one. We can then let ε → 0 over some sequence. Similarly, we prove that lim sup Skk ≤ 0 with probability one.

28

Section 8

Convergence of Laws. Selection Theorem. In this section we will begin the discussion of weak convergence of distributions on metric spaces. Let (S, d) be a metric space with a metric d. Consider a measurable space (S, B) with Borel σ-algebra B generated by open sets and let (Pn )n≥1 and P be probability distributions on B. We define Cb (S) = {f : S → R - continuous and bounded}. We say that Pn → P weakly if



� f dPn →

f dP for all f ∈ Cb (S).

Theorem 18 If S = R then Pn → P iff �� �� �� �� Fn (t) = Pn −∞, t → F (t) = P −∞, t for any point of continuity t of F (t). Proof. ”=⇒ ” Let us approximate an indicator function by a continuous functions as in figure 8.1, i.e. ϕ1 (X) ≤ I(X ≤ t) ≤ ϕ2 (X), ϕ1 , ϕ2 ∈ Cb (R).

For convenience of notations, instead of writing integrals w.r.t. Pn we will write expectations of a r.v. Xn "2

"1 t!!

t

t+ !

x

Figure 8.1: Approximating indicator. with distribution Pn . P(X ≤ t − ε) ≤ Eϕ1 (X) ← Eϕ1 (Xn ) ≤ Fn (t) = P(Xn ≤ t) ≤ Eϕ2 (Xn ) → Eϕ2 (X) ≤ P(X ≤ t + ε) as n → ∞. Therefore, for any ε > 0, F (t − ε) ≤ lim inf Fn (t) ≤ lim sup Fn (t) ≤ F (t + ε). Since t is a point of continuity of F, letting ε → 0 proves the result.

29

”⇐=” Let P C(F ) be the set of points of continuity of F. Since F is monotone, the set P C(F ) is dense in R. Take M large enough such that both M, −M ∈ P C(F ) and P([−M, M ]c ) ≤ ε. Clearly, for large enough k we have Pk ([−M, M ]c ) ≤ 2ε. For any n > 1, take a sequence of points −M = xn1 ≤ xn2 ≤ · · · ≤ xnn = M such that all xi ∈ P C(F ) and maxi |xni+1 − xni | → 0 as n → ∞. Given a function f ∈ Cb (R), consider an approximating function � fn (x) = f (xi )I(x ∈ (xni−1 , xni ]) + 0 · I(x ∈ / [−M, M ]). 1
Since f in continuous, sup |fn (x) − f (x)| ≤ δn → 0, n → ∞. |x|≤M

Since all xni are in P C(F ) we can write � � � � � � k→∞ Efn (Xk ) = fn (xni ) Fk (xni ) − Fk (xni−1 −→ fn (xni ) F (xni ) − F (xni−1 ) = Efn (X). 1
1
Also, � � � � / [−M, M ]) + δn ≤ �f �∞ ε + δn �Ef (X) − Efn (X)� ≤ �f �∞ P(X ∈ and, similarly, � � � � / [−M, M ]) + δn ≤ �f �∞ 2ε + δn . �Ef (Xk ) − Efn (Xk )� ≤ �f �∞ P(Xk ∈ Letting k → ∞, ε → 0 and n → ∞ proves that Ef (Xk ) → Ef (X). We say that a sequence of distributions (Pn )n≥1 on a metric space (S, d) is uniformly tight if for any ε > 0 there exists a compact K ⊆ S such that Pn (K) > 1 − ε for all n. Our next goal is to prove the following theorem. Theorem 19 (Selection theorem) If (Pn )n≥1 is a uniformly tight sequence of laws on the metric space (S, d) then there exists a subsequence (n(k)) such that Pn(k) converges weakly to some probability law P. We will prove the Selection Theorem for arbitrary metric spaces, since this result will be useful to us later when we study the convergence of laws on general metric spaces. However, when S = Rk one can proceed in a more intuitive way, based on the following Lemma. Lemma 12 (Cantor’s diagonalization) Let A be a countable ⊆ S and fn : S → R, n ≥ 1. Then there exists a subsequence (n(k)) such that fn(k) (a) converges for all a ∈ A (if fn(k) (a) is unbounded, maybe to ±∞). Proof. Let A = {a1 , a2 , . . .}. Take (n1 (k)) such that fn1 (k) (a1 ) converges. Take (n2 (k)) ⊆ (n1 (k)) such that fn2 (k) (a2 ) converges. By induction, take (nl (k)) ⊆ (nl−1 (k)) such that fnl (k) (al ) converges. Now consider a sequence (nk (k)). Clearly, fnk (k) (al ) converges for any l because for k ≥ l, nk (k) ∈ {nl (k)} by construction. Define a joint c.d.f. on Rk by F (t) = P(X1 ≤ t1 , . . . , Xk ≤ tk ) where t = (t1 , . . . , tk ). Let A be a dense set of points in Rk . By Lemma, there exists a subsequence (n(k)) such that Fn(k) (a) → F (a) for all a ∈ A. For x ∈ Rk \ A we can extend F by F (x) = inf{F (a) : a ∈ A, xi < ai }.

30

F (x) is a c.d.f. on Rk (exercise). The fact that Pn are uniformly tight ensures that F (x) → 0 or 1 if all xi → −∞ or +∞. Let x be a point of continuity of F (x) and let a, b ∈ A such that ai < xi < bi for all i. We have, F (a) ← Fn(k) (a) ≤ Fn(k) (x) ≤ Fn(k) (b) → F (b) as k → ∞. Since x is a point of continuity and A is dense, a→x

b→x

F (a) −→ F (x), F (b) −→ F (x), and this proves that Fn(k) (x) → F (x) for all such x. Similarly to one-dimensional case one can show that for any f ∈ Cb (Rk ), � � f dFn(k) → f dF. Proof of Theorem 19. If K is a compact then Cb (K) = C(K). Later in these lectures, when we deal in more detail with convergence on general metric spaces, we will prove the following fact which is well-known and is a consequence of the Stone-Weierstrass theorem. Fact. C(K) is separable w.r.t. �∞ norm ||f ||∞ = supx∈K |f (x|. Even though we are proving Selection theorem for a general metric space, right now we are mostly interested in the case S = Rk where this fact is a simple consequence of the Weierstrass theorem that any continuous function can be approximated by polynomials. Since Pn are uniformly tight, for any r ≥ 1 we can find a compact Kr such that Pn (Kr ) > 1 − 1r . Let Cr ⊂ C(Kr ) be a countable and dense subset of C(Kr ). By Cantor’s diagonalization argument there exists a subsequence (n(k)) such that Pn(k) (f ) converges for all f ∈ Cr for all r ≥ 1. Since Cr is dense in C(Kr ) this implies that Pn(k) (f ) converges for all f ∈ C(Kr ) for all r ≥ 1. Next, for any f ∈ Cb (S), �� � � � � � ||f ||∞ � f dPn(k) − f dPn(k) �� ≤ |f |dPn(k) ≤ ||f ||∞ Pn(k) (Krc ) ≤ . � r Kr Krc This implies that the limit � I(f ) := lim

k→∞

f dPn(k)

(8.0.1)

exists. The question is why this limit is an integral over some probability measure P? On each of the compacts Kr we could use Riesz’s representation theorem for continuous functionals on C(Kr ) and then extend this representation to the union of Kr . Instead, we will prove this as a consequence of a more general result, the Stone-Daniell theorem from measure theory, which says the following. A family of function α = {f : S → R} is called a vector lattice if f, g ∈ α =⇒ cf + g ∈ α, ∀c ∈ R and f ∧ g, f ∨ g ∈ α. A functional I : α → R is called a pre-integral if 1. I(cf + g) = cI(f ) + I(g), 2. f ≥ 0, I(f ) ≥ 0, 3. fn ↓ 0, ||fn ||∞ < ∞ =⇒ I(fn ) → 0. � Theorem 20 (Stone-Daniell) If α is a vector lattice and I is a pre-integral on α then I(f ) = f dµ for some unique measure µ on σ-algebra generated by functions in α (i.e. minimal σ-algebra on which all functions in α are measurable). We will use this theorem with α = Cb (S) and I defined in (8.0.1). The first two properties are obvious. To prove the third one let us consider a sequence such that fn ↓ 0, 0 ≤ fn (x) ≤ f1 (x) ≤ ||f1 ||∞ . 31

On any compact Kr , fn ↓ 0 uniformly, i.e. n→∞

||fn ||∞,Kr ≤ εn,r −→ 0. Since



� fn dPn(k) =

1 fn dPn(k) ≤ εn,r + �f1 ||∞ , r Kr ∪Krc

we get � I(fn ) = lim

k→∞

1 fn dPn(k) ≤ εn,r + ||f1 ||∞ . r

Letting n → ∞ and r → ∞ we get that I(fn ) → 0. By the Stone-Daniell theorem � I(f ) = f dP for some measure on σ(Cb (S)). The choice of f = 1 gives I(f ) = 1 = P(S) which means that P is a probability measure. Finally, let us show that σ(Cb (S)) = B - Borel σ-algebra generated by open sets. Since any f ∈ Cb (S) is measurable on B we get σ(Cb (S)) ⊆ B. On the other hand, let F ⊆ S be any closed set and take a function f (x) = min(1, d(x, F )). We have, |f (x) − f (y)| ≤ d(x, y) so f ∈ Cb (S) and f −1 ({0}) ∈ σ(Cb (S)). However, since F is closed, f −1 ({0}) = {x : d(x, F ) = 0} = F and this proves that B ⊆ σ(Cb (S)). Theorem 21 If Pn converges weakly to P on Rk then (Pn )n≥1 is uniformly tight. Proof. For any ε > 0 there exists large enough M > 0, such that P(|x| > M ) < ε. Consider a function ⎧ 0, s ≤ M, ⎨ 1, s ≥ 2M, α(s) = ⎩ 1 M (s − M ), M ≤ s ≤ 2M. and let α(x) := α(|x|) for x ∈ Rk . Since Pn → P weakly, � � α(x)dPn → α(x)dP. This implies that � � � � � � Pn |x| > 2M ≤ α(x)dPn → α(x)dP ≤ P |x| > M ≤ ε. For n large enough, n ≥ n0 , we get Pn (|x| > 2M ) ≤ 2ε. For n < n0 choose Mn so that Pn (|x| > Mn ) ≤ 2ε. Take M � = max{M1 , . . . , Mn0 −1 , 2M }. As a result, Pn (|x| > M � ) ≤ 2ε for all n ≥ 1. Lemma 13 If for any sequence (n(k))k≥1 there exists a subsequence (n(k(r)))r≥1 such that Pn(k(r)) → P weakly then Pn → P weakly. Proof. Suppose not. Then for some f ∈ Cb (S) and for some ε > 0 there exists a subsequence (n(k)) such that � �� � � � � f dPn(k) − f dP� > ε. But this contradicts the fact that for some subsequence Pn(k(r)) → P weakly. Consider r.v.s X and Xn on some probability space (Ω, A, P) with values in a metric space (S, d). Let P and Pn be their corresponding laws on Borel sets B in S. Convergence of Xn to X in probability and almost surely is defined exactly the same way as for S = R by replacing |Xn − X| with d(Xn , X). 32

Lemma 14 Xn → X in probability iff for any sequence (n(k)) there exists a subsequence (n(k(r))) such that Xn(k(r)) → X a.s. Proof. ”⇐=”. Suppose Xn does not converge to X in probability, Then for small enough ε > 0 there exists a subsequence (n(k)) such that � � P d(X, Xn(k) ) ≥ ε ≥ ε. This contradicts the existence of subsequence Xn(k(r)) that converges to X a.s. ”=⇒”. Given a subsequence (n(k)) let us choose (k(r)) so that � 1� 1 P d(Xn(k(r)) , X) ≥ ≤ 2. r r By Borel-Cantelli lemma, these events can occur i.o. with probability 0, which means that with probability one for large enough r 1 d(Xn(k(r)) , X) ≤ , r i.e. Xn(k(r)) → X a.s.

Lemma 15 Xn → X in probability then Xn → X weakly. Proof. By Lemma 14, for any subsequence (n(k)) there exists a subsequence (n(k(r))) such that Xn(k(r)) → X a.s. Given f ∈ Cb (R), by dominated convergence theorem, Ef (Xn(k(r)) ) → Ef (X), i.e. Xn(k(r)) → X weakly. By Lemma 13, Xn → X weakly.

33

Section 9

Characteristic Functions. Central Limit Theorem on R. Let X = (X1 , . . . , Xk ) be a random vector on Rk with distribution P and let t = (t1 , . . . , tk ) ∈ Rk . Charac­ teristic function of X is defined by � f (t) = Eei(t,X) = ei(t,x) dP(x). If X has standard normal distribution N (0, 1) and λ ∈ R then � � 2 (x−λ)2 λ2 λ2 1 1 λX λx− x2 2 √ e− 2 dx = e 2 . Ee =√ e dx = e 2π 2π For complex λ = it, consider analytic function x2 1 ϕ(x) = eitx √ e− 2 for x ∈ C. 2π

By Cauchy’s theorem, integral over a closed path is equal to 0. Let us take a closed path x + i0 for x from −∞ to +∞ and x + it for x from +∞ to −∞. Then � ∞ � ∞ 2 x2 1 1 1 f (t) = √ eitx− 2 dx = √ eit(it+x)− 2 (it+x) dx 2π −∞ 2π −∞ � ∞ � 2 2 1 1 2 t2 x2 t2 1 1 √ e− 2 dx = e− 2 . = √ e−t +itx+ 2 t −itx− 2 x dx = e− 2 (9.0.1) 2π −∞ 2π If Y has normal distribution N (m, σ 2 ) then EeitY = Eeit(m+σX) = eitm−

t2 σ 2 2

.

Lemma 16 If X is a real-valued r.v. such that E|X|r < ∞ for integer r then f (t) ∈ C r (R) and f (j) (t) = E(iX)j eitX for j ≤ r. Proof. If r = 0, then |eitX | ≤ 1 implies f (t) = EeitX → EeisX = f (s) if t → s,

34

by dominated convergence theorem. This means that f ∈ C(R). If r = 1, E|X| < ∞, we can use � itX � �e − eisX �� � � � ≤ |X| t−s and, therefore, by dominated convergence theorem, f � (t) = lim E s→t

eitX − eisX = EiXeitX . t−s

Also, by dominated convergence theorem, EiXeitX ∈ C(R), which means that f ∈ C 1 (R). We proceed by induction. Suppose that we proved that f (j) (t) = E(iX)j eitX and that r = j + 1, E|X |j+1 < ∞. Then, we can use that � � � (iX)j eitX − (iX)j eisX � � � ≤ |X|j+1 � � t−s so that by dominated convergence theorem f (j+1) (t) = E(iX)j+1 eitX ∈ C(R). The main goal of this section is to prove one of the most famous results in Probability Theory. 2 2 Theorem 22 (Central Limit Theorem) � √ Consider an i.i.d. sequence (Xi )i≥1 2such that EX1 = 0, EX1 = σ < ∞ and let Sn = i≤n Xi . Then Sn / n converges in distribution to N (0, σ ).

We will start with the following. Lemma 17 We have, Sn it √ n

lim Ee

n→∞

Proof. By independence, Ee

Sn it √ n

=



Ee

1

= e− 2 σ

itXi √ n

2 2

t

.

� itX1 �n √ = Ee n .

i≤n

Since

EX12

< ∞ previous lemma implies that ϕ(t) ∈ C 2 (R) and, therefore, 1 ϕ(t) = EeitX1 = ϕ(0) + ϕ� (0)t + ϕ�� (0)t2 + o(t2 ) as t → 0. 2

Since ϕ(0) = 1, ϕ� (0) = EiXei·0·X = iEX = 0, ϕ�� (0) = E(iX)2 = −EX 2 = −σ 2 we get ϕ(t) = 1 −

σ 2 t2 + o(t2 ). 2

Finally, Ee

itSn √ n

� � t ��n � � t2 ��n 1 2 2 σ 2 t2 = ϕ √ = 1− +o → e− 2 σ t , n → ∞. 2n n n

Next, we want to show that characteristic function uniquely determines the distribution. Let X ∼ P, Y ∼ Q be two independent random vectors on Rk . We denote by P ∗ Q the convolution of P and Q which is the law L(X + Y ) of the sum X + Y. We have, �� P ∗ Q(A) = EI(X + Y ∈ A) = I(x + y ∈ A)dP(x)dQ(y) �� � = I(x ∈ A − y)dP(x)dQ(y) = P(A − y)dQ(y). 35

If P has density p then ��

P ∗ Q(A)

�� I(x + y ∈ A)p(x)dxdQ(y) = I(z ∈ A)p(z − y)dzdQ(y) �� � �� � = p(z − y)dzdQ(y) = p(z − y)dQ(y) dz =

A

A

which means that P ∗ Q has density

� p(x − y)dQ(y).

f (x) = If, in addition, Q has density q then

(9.0.2)

� f (x) =

p(x − y)q(y)dy.

Denote by N (0, σ 2 I) the law of the random vector X = (X1 , . . . , Xk ) of i.i.d. N (0, σ 2 ) random variables whose density on Rk is k � 1 �k � 2 2 1 1 1 √ e− 2σ2 xi = √ e− 2σ2 |x| . 2πσ 2πσ i=1 For a distribution P denote Pσ = P ∗ N (0, σ 2 I). Lemma 18 Pσ = P ∗ N (0, σ 2 I) has density pσ (x) = where f (t) =



� 1 �k � 2 σ2 f (t)e−i(t,x)− 2 |t| dt 2π

ei(t,x) dP(x).

Proof. By (9.0.2), P ∗ N (0, σ 2 I) has density � 1 �k � 2 1 pσ (x) = √ e− 2σ2 |x−y| dP(y). 2πσ Using (9.0.1), we can write 2 1 1 e− 2σ2 (xi −yi ) = √ 2π



1

1

2

e−i σ (xi −yi )zi e− 2 zi dzi

and taking a product over i ≤ k we get � 1 �k � 2 1 2

1 1 e− 2σ2 |x−y| = √ e−i σ (x−y,z) e− 2 |z| dz.

2π Then we can continue � 1 �k � � 2 1 1 e−i σ (x−y,z)− 2 |z| dzdP(y) 2πσ � 1 �k � � 2 1 1 = e−i σ (x−y,z)− 2 |z| dP(y)dz 2πσ � 1 �k � � z � 2 1 1 = f e−i σ (x,z)− 2 |z| dz. 2πσ σ

pσ (x) =

Let z = tσ.

36

Theorem 23 (Uniqueness) If �

i(t,x)

e

� dP(x) =

ei(t,x) dQ(x)

then P = Q. Proof. By the above Lemma, Pσ = Qσ . If X ∼ P and µ ∼ N (0, I) then X + σµ → X almost surely as σ → 0 and, therefore, Pσ → P weakly. Similarly, Qσ → Q. √ We proved that the characteristic function of Sn / n converges to the c.f. of N (0, σ 2 ). Also, the sequence � � � Sn � L √ - is uniformly tight, n n≥1 since by Chebyshev’s inequality ��� S �� � σ2 n P �� √ �� > M ≤ 2 < ε M n for large enough M. To finish the proof of the CLT on the real line we apply the following. Lemma 19 If (Pn ) is uniformly tight and � fn (t) = then Pn → P and f (t) =



eitx dPn (x) → f (t)

eitx dP(x).

Proof. For any sequence (n(k)), by Selection Theorem, there exists a subsequence (n(k(r))) such that Pn(k(r)) converges weakly to some distribution P. Since ei(t,x) is bounded and continuous, � � ei(t,x) dPn(k(r)) → ei(t,x) dP(x) as r → ∞ and, therefore, f is a c.f. of P. By uniqueness theorem, distribution P does not depend on the sequence (n(k)). By Lemma 13, Pn → P weakly.

37

Section 10

Multivariate normal distributions and CLT. Let P be a probability distribution on Rk and let � g(t) =

ei(t,x) dP(x).

We proved that Pσ = P ∗ N (0, σ 2 I) has density σ

p (x) = (2π)

−k



1

g(t)e−i(t,x)− 2 σ

2

|t|2

dt.



|g(t)|dt < ∞ then P has density � p(x) = (2π)−k g(t)e−i(t,x) dt.

Lemma 20 (Fourier inversion formula) If

Proof. Since

1

g(t)e−i(t,x)− 2 σ

2

|t|2

→ g(t)e−i(t,x)

pointwise as σ → 0 and � � 2� 1 2 � �g(t)e−i(t,x)− 2 σ |t| � ≤ |g(t)| - integrable, by dominated convergence theorem pσ (x) → p(x). Since Pσ → P weakly, for any f ∈ Cb (Rk ), � � σ f (x)p (x)dx → f (x)dP(x). On the other hand, since |pσ (x)| ≤ (2π)−k

� |g(t)|dt < ∞,

by dominated convergence theorem, for any compactly supported f ∈ Cb (Rk ), � � f (x)pσ (x)dx → f (x)p(x)dx. Therefore, for any such f, �

� f (x)dP(x) =

38

f (x)p(x)dx.

It is now a simple exercise to show that for any bounded open set U, � � dP(x) = p(x)dx. U

U

This means that P restricted to bounded sets has density p(x) and, hence, on entire Rk . For a random vector X = (X1 , . . . , Xk ) ∈ Rk we denote EX = (EX1 , . . . , EXk ). k 2 Theorem � 24 � Consider a sequence (Xi )i≥1 of i.i.d. random vectors on R such that EX1 = 0, E|X1 | < ∞. Sn Then L √ converges weakly to distribution P which has characteristic function n

� � 1 fp (t) = e− 2 (Ct,t) where C = Cov(X1 ) = EX1,i X1,j

.

(10.0.1)

1≤i,j≤k

Proof. Consider any t ∈ Rk . Then Zi = (t, Xi ) is i.i.d. real-valued sequence and by the proof of the CLT on the real line, � � Ee

Sn i t, √ n

i √1n

= Ee

P

i (t,Xi )

→ e− 2 Var((t,Xi )) = e− 2 (Ct,t) 1

1

as n → ∞, since � � � Var t1 X1,1 + · · · + tk X1,k = ti tj EX1,i X1,j = (Ct, t) = tT Ct. i,j

� � �� Sn The sequence L √ is uniformly tight on Rk since n �� S � � 1 �� Sn ��2 1 1 � 1 M →∞ � n� 2 2 P � √ � ≥ M ≤ 2 E� √ � = E |(Sn,1 , . . . , Sn,k )| = ESn,i = 2 E|X1 |2 −→ 0. 2 M nM nM 2 M n n i≤k

It remains to apply Lemma 19 from previous section. The covariance matrix C = Cov(X) is symmetric and non-negative definite, (Ct, t) = E(t, X)2 ≥ 0. The unique distribution with c.f. in (10.0.1) is called a multivariate normal distribution with covariance C and is denoted by N (0, C). It can also be defined more constructively as follows. Consider an i.i.d. sequence g1 , . . . , gn of N (0, 1) r.v. and let g = (g1 , . . . , gn )T . Given a k × n matrix the covariance matrix of Ag ∈ Rk is C = Cov(Ag) = EAg(Ag)T = AEgg T AT = AAT . The c.f. of Ag is Eei(t,Ag) = Eei(A

T

t,g)

1

T

= e− 2 |A

t|2

1 T

= e− 2 t

AAT t

1

= e− 2 (Ct,t) .

This means that Ag ∼ N (0, C). Interestingly, the distribution of Ag depends only on AAT and does not depend on the choice of n and A. Exercise. Show constructively, using linear algebra, that the distribution of Ag and Bg � is the same if T AA = BB T . On the other hand, given a covariance matrix C one can always find A such that C = AAT . For example, let C = QDQT be its eigenvalue decomposition for orthogonal matrix Q and diagonal D. Since C is nonnegative definite, the elements of D are nonnegative. Then, one can take n = k and A = C 1/2 := QD1/2 QT or A = QD1/2 . Density in the invertible case. Suppose det(C) �= 0. Take A such that C = AAT so that Ag ∼ N (0, C). Since the density of g is � i≤k

� 1 � � 1 �k � 1 � 1 √ exp − x2i = √ exp − |x|2 , 2 2 2π 2π 39

for any set Ω ⊆ Rk we can write P(Ag ∈ Ω) = P(g ∈ A−1 Ω) =



� 1 �k � 1 � √ exp − |x|2 dx. 2 2π A−1 Ω

Let us now make the change of variables y = Ax or x = A−1 y. Then � � � 1 � 1 �k 1 √ P(Ag ∈ Ω) = exp − |A−1 y|2 dy. | det(A)| 2 2π Ω But since det(C) = det(AAT ) = det(A) det(AT ) = det(A)2 � we have | det(A)| = det(C). Also |A−1 y|2 = (A−1 y)T (A−1 y) = y T (AT )−1 A−1 y = y T (AAT )−1 y = y T C −1 y. Therefore, we get P(Ag ∈ Ω) =

� � � 1 � 1 �k 1 √ � exp − y T C −1 y dy. 2 2π det(C) Ω

This means that the distribution N (0, C) has the density � 1 �k � 1 � 1 √ � exp − y T C −1 y . 2 2π det(C)

General case. Let us take, for example, a vector X = QD1/2 g for i.i.d. standard normal vector g so that X ∼ N (0, C). If q1 , . . . , qk are the column vectors of Q then 1/2

1/2

X = QD1/2 g = (λ1 g1 )q1 + . . . + (λk gn )qk . 1/2

1/2

Therefore, in the orthonormal coordinate basis q1 , . . . , qk a random vector X has coordinates λ1 g1 , . . . , λk gk . These coordinates are independent with normal distributions with variances λ1 , . . . , λk correspondingly. When det(C) = 0, i.e. C is not invertible, some of its eigenvalues will be zero, say, λn+1 = . . . = λk = 0. Then the random X vector will be concentrated on the subspace spanned by vectors q1 , . . . , qn but it will not have density on the entire space Rk . On the subspace spanned by vectors q1 , . . . , qn a vector X will have a density n � x2 � � 1 √ f (x1 , . . . , xn ) = exp − i . 2λi 2πλi i=1

Let us look at a couple of properties of normal distributions. Lemma 21 If X ∼ N (0, C) on Rk and A : Rk → Rm is linear then AX ∼ N (0, ACAT ) on Rm . Proof. The c.f. of AX is T

Eei(t,AX) = Eei(A

t,X)

1

= e− 2 (CA

T

t,AT t)

1

= e− 2 (ACA

Lemma 22 X is normal on Rk iff (t, X) is normal on R for all t ∈ Rk .

40

T

t,t)

.

Proof. ”=⇒”. The c.f. of real-valued random variable (t, X) is 1

1

2

f (λ) = Eeiλ(t,X) = Eei(λt,X) = e− 2 (Cλt,λt) = e− 2 λ

(Ct,t)

which means that (t, X) ∼ N (0, (Ct, t)). ”⇐=”. If (t, X) is normal then 1

Eei(t,X) = e− 2 (Ct,t) because the variance of (t, X) is (Ct, t).

Lemma 23 Let Z = (X, Y ) where X = (X1 , . . . , Xi ) and Y = (Y1 , . . . , Yj ) and suppose that Z is normal on Ri+j . Then X and Y are independent iff Cov(Xm , Yn ) = 0 for all m, n. Proof. One way is obvious. The other way around, suppose that � � D 0 C = Cov(Z) = . 0 F Then the c.f. of Z is 1

1

1

Eei(t,Z) = e− 2 (Ct,t) = e− 2 (Dt1 ,t1 )− 2 (F t2 ,t2 ) = Eei(t1 ,X) Eei(t2 ,Y ) , where t = (t1 , t2 ). By uniqueness, X and Y are independent. Lemma 24 (Continuous Mapping.) Suppose that Pn → P on X and G : X → Y is a continuous map. Then Pn ◦ G−1 → P ◦ G−1 on Y. In other words, if r.v. Zn → Z weakly then G(Zn ) → G(Z) weakly. Proof. This is obvious, because for any f ∈ Cb (Y ), we have f ◦ G ∈ Cb (X) and therefore, Ef (G(Zn )) → Ef (G(Z)).

Lemma 25 If Pn → P on Rk and Qn → Q on Rm then Pn × Qn → P × Q on Rk+m . Proof. By Fubini theorem, The c.f. � � � � � � i(t,x) i(t1 ,x1 ) i(t2 ,x2 ) i(t1 ,x1 ) i(t2 ,x2 ) e dPn × Qn (x) = e dPn e dQn → e dP e dQ = ei(t,x) dP × Q. By Lemma 19 it remains to show that (Pn × Qn ) is uniformly tight. By Theorem 21, since Pn → P, (Pn ) is uniformly tight. Therefore, there exists a compact K on Rk such that Pn (K) > 1 − ε. Similarly, for some compact K � on Rm , Qn (K � ) > 1 − ε. We have, Pn × Qn (K × K � ) > 1 − 2ε and K × K � is a compact on Rk+m . Corollary 1 If Pn → P and Qn → Q both on Rk then Pn ∗ Qn → P ∗ Q. Proof. Since a function G : Rk+k → Rk given by G(x, y) = x + y is continuous, by continuous mapping lemma, Pn ∗ Qn = (Pn × Qn ) ◦ G−1 → (P × Q) ◦ G−1 = P ∗ Q.

41

Section 11

Lindeberg’s CLT. Levy’s Equivalence Theorem. Three Series Theorem. Instead of considering i.i.d. sequences, for each n ≥ 1 we will consider a vector (X1n , . . . , Xnn ) of independent r.v., not necessarily identically distributed. This setting is called triangular arrays because the entire vector may change with n. Theorem 25 Consider a vector (Xin )1≤i≤n of independent r.v.s such that � EXin = 0, Var(Sn ) = E(Xin )2 = 1. i≤n

Suppose that the following Lindeberg’s condition is satisfied: n �

E(Xin )2 I(|Xin | > ε) → 0 as n → ∞ for all ε > 0.

(11.0.1)

i=1

�� � n Then L X → N (0, 1). i i≤n � �� �� n Proof. First of all, L is uniformly tight, because by Chebyshev’s inequality i≤n Xi � ��� � 1 � � P � Xin � > M ≤ 2 ≤ ε M i≤n

for large enough M. It remains to show that the characteristic function of Sn coverges to e− of notations let us omit the upper index n and write Xi instead of Xin . Since, � EeiλSn = EeiλXi

λ2 2

. For simplicity

i≤n

it is enough to show that � �� � λ2 log 1 + EeiλXi − 1 → − . 2 It is an easy exercise to prove, by induction on m, that for any a ∈ R, log EeiλSn =



� � |a|m+1 � ia � (ia)k � . �e − �≤ k! (m + 1)! k≤m

42

(11.0.2)

(11.0.3)

(Just integrate this inequality to make the induction step.) Using this for m = 1, � � � � λ2 λ2 2 λ2 1 � iλXi � � � − 1� = �EeiλXi − 1 − iλEXi � ≤ EXi2 ≤ ε + EXi2 I(|Xi | > ε) ≤ λ2 ε2 ≤ �Ee 2 2 2 2

(11.0.4)

for large n by (11.0.1) and for small enough ε. Using the expansion of log(1 + z) it is easy to check that � � 1 � � �log(1 + z) − z � ≤ |z|2 for |z| ≤ 2 and, therefore, we can write ��� � � �� � ��� �log 1 + EeiλXi − 1 − EeiλXi − 1 � ≤ i≤n

�� � λ4 � � �2 �EeiλXi − 1�2 ≤ EXi2 4 i≤n



i≤n

�� λ4 � λ4 max EXi2 EXi2 = max EXi2 → 0 4 i≤n 4 i≤n i≤n

because, as in (11.0.4), EXi2 ≤ ε2 + EXi2 I(|Xi | > ε) → 0 for large n by (11.0.1) and for ε → 0. Finally, to show (11.0.2) it remains to show that �� i≤n

� λ2 EeiλXi − 1 → − . 2

� � Using (11.0.3) for m = 1, on the event |Xi | > ε , � �� � λ2 2 � � � iλXi � − 1 − iλXi �I |Xi | > ε ≤ Xi I (|Xi | > ε �e 2 and, therefore, � �� � � � λ2 � iλXi � − 1 − iλXi + Xi2 �I |Xi | > ε ≤ λ2 Xi2 I |Xi | > ε . �e 2 � � Using (11.0.3) for m = 2, on the event |Xi | ≤ ε , � �� � λ3 � � λ3 ε 2 λ2 � iλXi � − 1 − iλXi + Xi2 �I |Xi | ≤ ε ≤ |Xi |3 I |Xi | ≤ ε ≤ X . �e 2 6 6 i Combining the last two equations and using that EXi = 0, � � � � λ3 ε λ2 � iλXi � − 1 + EXi2 � ≤ λ2 EXi2 I |Xi | > ε + EXi2 . �Ee 2 6 Finally, ��� � λ2 �� � EeiλXi − 1 + � � 2 i≤n

��� � � λ2 � � � = � EeiλXi − 1 + EXi2 � 2 i≤n 3



λ ε + λ2 6

i≤n



EXi2 I(|Xi | > ε) → 0

as n → ∞ (using Lindeberg’s condition) and ε → 0.

Lemma 26 If P, Q are distributions on R such that P ∗ Q = P then Q({0}) = 1.

43

Proof. Let us define

� fP (t) =

eitx dP(x), fQ (t) =



eitx dQ(x).

The condition P ∗ Q = P implies that fP (t)fQ (t) = fP (t). Since fP (0) = 1 and fP (t) is continuous, for small enough |t| ≤ ε we have |fP (t)| > 0 and, as a result, fQ (t) = 1. Since � � fQ (t) = cos(tx)dQ(x) + i sin(tx)dQ(x) for |t| ≤ ε this implies that



cos(tx)dQ(x) = 1 and since cos(s) ≤ 1 this can happen only if �� �� Q x : xt = 0 mod 2π = 1 for all |t| ≤ ε.

Take s, t such that |s|, |t| ≤ ε and s/t is irrational. For x to be in the support of Q we must have xs = 2πk and xt = 2πm for some integer k, m. This can happen only if x = 0.

Theorem 26 (Levy’s equivalence) If (Xi ) is a sequence of independent r.v. then in probability iff in law.



i≥1

Xi converges a.s. iff

Proof. We already proved (a Kolmogorov’s theorem) that convergence in probability implies a.s. convergence. It remains to prove only that convergence in law implies convergence in probability. Suppose that L(Sn ) → P. Convergence in law implies that {L(Sn )} is uniformly tight which easily implies that {L(Sn − Sk )}n,k≤1 is uniformly tight. This will imply that for any ε > 0 P(|Sn − Sk | > ε) < ε

(11.0.5)

for n ≥ k ≥ N for large enough N. Suppose not. Then there exists ε > 0 and sequences (n(l)) and (n� (l)) such that n(l) ≤ n� (l) and P(|Sn� (l) − Sn(l) | > ε) ≥ ε. Let us denote Yl = Sn� (l) − Sn(l) . Since {L(Yl )} is uniformly tight, by selection theorem, there exists a subsequence (l(r)) such that L(Yl(r) ) → Q. Since Sn� (l(r)) = Sn(l(r)) + Yl(r) and L(Sn� (l(r)) ) = L(Sn(l(r)) ) ∗ L(Yl(r) ) letting r → ∞ we get that P = P∗ Q. By the above Lemma, Q({0}) = 1, which implies that P(|Yl(r) | > ε) ≤ ε for large r - a contradiction. Once (11.0.5) is proved, by Borel-Cantelli lemma we can choose a.s. converging subsequence as in Kolmogorov’s theorem, and then by (11.0.5) Sn converges in probability to the same limit.

Theorem 27 (Three series theorem) Let (Xi )i≥1 be a sequence of independent r.v. and let Zi = Xi I(|Xi | ≤ 1). Then



1.



2.



3.



i≥1

Xi converges iff

i≥1

P(|Xi | > 1) < ∞,

i≥1

EZi converges,

i≥1

Var(Zi ) < ∞.

Proof. ”⇐=”. Suppose 1 - 3 hold. Since � � P(|Xi | > 1) = P(Xi = � Zi ) < ∞ i≥1

i≥1

44

� � by Borel-Cantelli lemma P({Xi �= �Zi } i.o.) = 0 which means that i≥1 Xi converges iff i≥1 Zi converges. By 2, it is enough to show that i≥1 (Zi − EZi ) converges, but this follows from Theorem 13 by 3. � ”=⇒”. If i≥1 Xi converges a.s.,

P({|Xi | > 1} i.o) = 0

and since (Xi ) are independent, by Borel-Cantelli, � P(|Xi | > 1) < ∞. i≥1

If



i≥1

Xi converges then, obviously,



i≥1

Zi converges. Let Smn =



Zk .

m≤k≤n

Since Smn → 0 as m, n → ∞, P(|Smn | > δ) ≤ δ for any δ > 0 for m, n large enough. Suppose that � Var(Z i ) = ∞. Then i≥1 � 2 σmn = Var(Smn ) = Var(Zk ) → ∞ m≤k≤n

as n → ∞ for any fixed m. Intuitively, this should not happen: Smn → 0 in probability but their variance goes to infinity. In principle, one can construct such sequence of random variables but in our case it will be ruled out by Lindeberg’s CLT. Because σmn → ∞, Lindeberg’s theorem will imply that, Tmn =

Smn − ESmn = σmn

� m≤k≤n

Zk − EZk → N (0, 1), σmn

2 if m, n → ∞ and σmn → ∞. We only need to check that



E

m≤k≤n

� Z − EZ �2 � � Z − EZ � � � k k k k� I � �>ε →0 σmn σmn

as m, n, n − m → ∞. Since |Zk − EZk | < 2 and σmn → ∞, the event in the indicator does not occur for large m, n and, therefore, the sum is equal to 0. Next, since P(|Smn | ≤ δ) ≥ 1 − δ we get �� ESmn �� δ � � P �Tmn + ≥ 1 − δ. �≤ σmn σmn But this is impossible, since Tmn is approximately standard normal and standard normal distribution does � not concentrate near any constant. We proved that Var(Z i ) < ∞. By Kolmogorov’s SLLN this implies i≥1 � � that i≥1 (Zi − EZi ) converges and, therefore, i≥1 EZi converges.

45

Section 12

Levy’s Continuity Theorem. Poisson Approximation. Conditional Expectation. Let us start with the following bound. Lemma 27 Let X be a real-valued r.v. with distribution P and let � itX f (t) = Ee = eitx dP(x). Then,

� � 1� 7 u ≤ P |X| > (1 − Ref (t))dt. u u 0

Proof. Since

� Ref (t) =

cos txdP(x)

we have 1 u

�u � 0

R

� �u 1 (1 − cos tx)dP(x)dt = (1 − cos tx)dtdP(x) u R 0 � � sin xu � = 1− dP(x) xu R � � sin xu � ≥ 1− dP(x) xu |xu|≥1

� � sin y sin 1 since < if y > 1 y 1

� ≥ (1 − sin 1)

1dP(x) ≥

1 � 1� P |X| ≥ . 7 u

|xu|≥1

Theorem 28 (Levy continuity) Let (Xn ) be a sequence of r.v. on Rk . Suppose that fn (t) = Eei(t,Xn ) → f (t)

46

and f (t) is continuous at 0 along each axis. Then there exists a probability distribution P such that � f (t) = ei(t,x) dP(x) and L(Xn ) → P. Proof. By Lemma 19 we only need to show that {L(Xn )} is uniformly tight. If we denote Xn = (Xn,1 , . . . , Xn,k ) then the c.f.s along the ith coordinate: fni (ti ) := fn (0, . . . , ti , 0, . . . 0) = Eeiti Xn,i → f (0, . . . , ti , . . . 0) =: f i (ti ). Since fn (0) = 1 and, therefore, f (0) = 1, for any ε > 0 we can find δ > 0 such that for all i ≤ k |f i (ti ) − 1| ≤ ε if |ti | ≤ δ. This implies that for large enough n |fni (ti ) − 1| ≤ 2ε if |ti | ≤ δ. Using previous Lemma, � � � � � 1 � 7 δ� 7 δ �� ≤ P |Xn,i | > 1 − Refni (ti ) dti ≤ 1 − fni (ti )� dti ≤ 7 · 2ε. δ δ 0 δ 0 The union bound implies that

√ � k ≤ 14kε P |Xn | > δ �

and {L(Xn )}n≥1 is uniformly tight.

CLT describes how sums of independent r.v.s are approximated by normal distribution. We will now give a

simple example of a different approximation. Consider independent Bernoulli random variables Xin ∼ B(pni ) for i ≤ n, i.e. P(Xin = 1) = pni and P(Xin = 0) = 1 − pin . If pni = p > 0 then by CLT S − np

� n → N (0, 1). np(1 − p) However, if p = pni → 0 fast enough then, for example, the Lindeberg conditions will be violated. It is well-known that if pni = pn and npn → λ then Sn has approximately Poisson distribution Πλ with p.f. f (k) =

λk −λ e for k = 0, 1, 2, . . . k!

Here is a version of this result. Theorem 29 Consider independent Xi ∼ B(pi ) for i ≤ n and let Sn = X1 + . . . + Xn and λ = p1 + . . . + pn . Then for any subset of integers B ⊆ Z, |P(Sn ∈ B) − Πλ (B)| ≤

� i≤n

47

p2i .

Proof. The proof is based on the construction on ”one probability space”. Let us construct Bernoulli r.v. Xi ∼ B(pi ) and Poisson r.v. Xi∗ ∼ Πpi on the same probability space as follows. Let us consider a probability space ([0, 1], B, λ) with Lebesque measure λ. Define � 0, 0 ≤ x ≤ 1 − pi , Xi = Xi (x) = 1, 1 − pi < x ≤ 1. Clearly, Xi ∼ B(pi ). Let us construct Xi∗ as follows. If for k ≥ 0 we define � (pi )l e−pi l!

ck =

0≤l≤k

then

⎧ 0, 0 ≤ x ≤ c0 , ⎪ ⎪ ⎨ 1, c0 < x ≤ c1 , Xi = Xi (x) = 2, c1 < x ≤ c2 , ⎪ ⎪ ⎩ ...

Clearly, Xi∗ ∼ Πpi . When Xi = � Xi∗ ? Since 1 − pj ≤ e−pj = c0 , this can only happen for 1 − pi < x ≤ c0 and c1 < x ≤ 1, i.e. P(Xj �= Xj∗ ) = epj − (1 − pj ) + (1 − e−pj − pj e−pj ) = pj (1 − e−pj ) ≤ pj2 We construct pairs (Xi , Xi∗ ) � on separate coordinates of a product space, thus, making them independent fo i ≤ n. It is well-known that i≤n Xi∗ ∼ Πλ and, finally, we get P(Sn �= Sn∗ ) ≤



P(Xj �= Xj∗ ) ≤

j≤n



p2j .

j≤n

Conditional expectation. Let (Ω, B, P) be a probability space and X : Ω → R be a random variable such that E|X| < ∞. Let A be a σ-subalgebra of B, A ⊆ B. Definition. Y = E(X|A) is called conditional expectation of X given A if 1. Y : Ω → R is measurable on A, i.e. if B is a Borel set on R then Y −1 (B) ∈ A. � 1 ω∈A 2. For any set A ∈ A we have EXIA = EY IA , where IA (ω) = 0 ω∈ / A. Definition. If X, Z are random variables then conditional expectation of X given Z is defined by Y = E(X|Z) = E(X|σ(Z)). Since Y is measurable on σ(Z), Y = f (Z) for some measurable function f . Properties of conditional expectation. 1. (Existence of conditional expectation.) Let us define � µ(A) = XdP for A ∈ A. A

µ(A) is a σ-additive signed measure on A. Since X is integrable, if P(A) = 0 then µ(A) = 0 which means that µ is absolutely continuous w.r.t. P. By Radon-Nikodym theorem, there exists Y = dµ dP measurable on A such that for A ∈ A � � µ(A) = XdP = Y dP. A

A

48

By definition Y = E(X|A). 2. (Uniqueness) Suppose there exists Y � = E(X|A) such that P(Y = � Y � ) > 0, i.e. P(Y > Y � ) > 0 or P(Y < Y � ) > 0. Since both Y, Y � are measurable on A the set A = {Y > Y � } ∈ A. One one hand, E(Y − Y � )IA > 0. On the other hand, E(Y − Y � )IA = EXIA − EXIA = 0 - a contradiction. 3. E(cX + Y |A) = cE(X|A) + E(Y |A). 4. If σ-algebras C ⊆ A ⊆ B then E(E(X|A)|C) = E(X|C). Consider a set C ∈ C ⊆ A. Then EIC (E(E(X|A)|C)) = EIC E(X|A) = EIC X and EIC (E(X|C)) = EXIC . We conclude by uniqueness. 5. E(X|B) = X, E(X|{∅, Ω}) = EX, E(X|A) = EX if X is independent of A. 6. If X ≤ Z then E(X|A) ≤ E(Z|A) a.s.; proof is similar to proof of uniqueness. 7. (Monotone convergence) If E|Xn | < ∞, E|X| < ∞ and Xn ↑ X then E(Xn |A) ↑ E(X|A). Since E(Xn |A) ≤ E(Xn+1 |A) ≤ E(X|A) there exists a limit g = lim E(Xn |A) ≤ E(X|A). n→∞

Since E(Xn |A) are measurable on A, so is g = lim E(Xn |A). It remains to check that for any set A ∈ A, EgIA = EXIA . Since Xn IA ↑ XIA and E(Xn |A)IA ↑ gIA , by monotone convergence theorem, EXn IA ↑ EXIA and EIA E(Xn |A) ↑ EgIA . But since EIA E(Xn |A) = EXn IA this implies that EgIA = EXIA and, therefore, g = E(X|A) a.s. 8. (Dominated convergence) If |Xn | ≤ Y, EY < ∞, and Xn → X then

lim E(Xn |A) = E(X|A).

We can write, −Y ≤ gn = inf Xm ≤ Xn ≤ hn = sup Xm ≤ Y. m≥n

m≥n

Since gn ↑ X, hn ↓ X, |gn | ≤ Y, |hn | ≤ Y by monotone convergence E(gn |A) ↑ E(X|A), E(hn |A) ↓ E(X|A) =⇒ EXn |A) → E(X|A). 9. If E|X| < ∞, E|XY | < ∞ and Y is measurable on A then

E(XY |A) = Y E(X|A).

49

We can assume that X, Y ≥ 0 by decomposing X = X + − X − , Y = Y + − Y − . Consider a sequence of simple functions � Yn = wk ICk , Ck ∈ A measurable on A such that 0 ≤ Yn ↑ Y. By monotone convergence theorem, it is enough to prove that E(XICk |A) = ICk E(X|A). Take B ∈ A. Since BCk ∈ A, EIB ICk E(X|A) = EIBCk E(X|A) = EXIBCk = E(XICk )IB . 10. (Jensen’s inequality) If f : R → R is convex then f (E(X|A)) ≤ E(f (X|A)). By convexity, f (X) − f (E(X|A)) ≥ ∂f (E(X|A))(X − E(X|A)). Taking condition expectation of both sides, E(f (X)|A) − f (E(X|A)) ≥ ∂f (E(X|A))(E(X|A) − E(X|A)) = 0.

50

Section 13

Martingales. Doob’s Decomposition. Uniform Integrability. Let (Ω, B, P) be a probability space and let (T, ≤) be a linearly ordered set. Consider a family of σ-algebras Bt , t ∈ T such that for t ≤ u, Bt ⊆ Bu ⊆ B. Definition. A family (Xt , Bt )t∈T is called a martingale if 1. Xt : Ω → R is measurable w.r.t. Bt ; in other words, Xt is adapted to Bt . 2. E|Xt | < ∞. 3. E(Xu |Bt ) = Xt for t ≤ u. If the last equality is replaced by E(Xu |Bt ) ≤ Xt then the process is called a supermartingale and if E(Xu |Bt ) ≥ Xt then it is called a submartingale. Examples. � 1. Consider a sequence (Xn )n≥1 of independent random variables such that EXi = 0 and let Sn = i≤n Xi . If Bn = σ(X1 , . . . , Xn ) is a σ-algebra generated by the first n r.v.s then (Sn , Bn )n≥1 is a martingale since E(Sn+1 |Bn ) = E(Xn+1 + Sn |Bn ) = 0 + Sn = Sn . 2. Consider a sequence of σ-algebras . . . ⊆ Bm ⊆ Bn ⊆ . . . ⊆ B and a r.v. X on B and let Xn = E(X|Bn ). Then (Xn , Bn ) is a martingale since for m < n E(Xn |Bm ) = E(E(X|Bn )|Bm ) = E(X|Bm ) = Xm . Definition. If (Xn , Bn ) is a martingale and for some r.v. X, Xn = E(X|Bn ), then the martingale is called right-closable. If X∞ = X, B∞ = B then (Xn , Bn )n≤∞ is called right-closed. � 3. Let (Xi )i≥1 be i.i.d. and let Sn = i≤n Xi . Let us take T = {. . . , −2, −1} and for n ≥ 1 define B−n = σ(Sn , Sn+1 , . . .) = σ(Sn , Xn+1 , Xn+2 , . . .). Clearly, B−(n+1) ⊆ B−n . For 1 ≤ k ≤ n, by symmetry, E(X1 |B−n ) = E(Xk |B−n ). Therefore, Sn = E(Sn |B−n ) =



E(Xk |B−n ) = nE(X1 |B−n ) =⇒ Z−n :=

1≤k≤n

51

Sn = E(X1 |B−n ). n

Thus, (Z−n , B−n )−n≤−1 is a right-closed martingale.

Lemma 28 Let f : R → R be a convex function. Suppose that either one of two conditions holds: 1. (Xt , Bt ) is a martingale, 2. (Xt , Bt ) is a submartingale and f is increasing. Then (f (Xt ), Bt ) is a submartingale. Proof. 1. For t ≤ u, by Jensen’s inequality, f (Xt ) = f (E(Xu |Bt )) ≤ E(f (Xu )|Bt ). 2. For t ≤ u, since Xt ≤ E(Xu |Bt ) and f is increasing, f (Xt ) ≤ f (E(Xu |Bt )) ≤ E(f (Xu )|Bt ), where the last step is again Jensen’s inequality.

Theorem 30 (Doob’s decomposition) If (Xn , Bn )n≥0 is a submartingale then it can be uniquely decomposed Xn = Zn + Yn , where (Yn , Bn ) is a martingale, Z0 = 0, Zn ≤ Zn+1 almost surely and Zn is Bn−1 -measurable. Proof. Let Dn = Xn − Xn−1 and Gn = E(Dn |Bn−1 ) = E(Xn |Bn−1 ) − Xn−1 ≥ 0 by the definition of submartingale. Let, Hn = Dn − Gn , Yn = H1 + . . . + Hn , Zn = G1 + · · · + Gn . Since Gn ≥ 0 a.s., Zn ≤ Zn+1 and, by construction, Zn is Bn−1 -measurable. We have, E(Hn |Bn−1 ) = E(Dn |Bn−1 ) − Gn = 0 and, therefore, E(Yn |Bn−1 ) = Yn−1 . Uniqueness follows by construction. Suppose that Xn = Zn + Yn with all stated properties. First, since Z0 = 0, Y0 = X0 . By induction, given a unique decomposition up to n − 1, we can write Zn = E(Zn |Bn−1 ) = E(Xn − Yn |Bn−1 ) = E(Xn |Bn−1 ) − Yn−1 and Yn = Xn − Zn . Definition. We say that (Xn )n≥1 is uniformly integrable if sup E|Xn | < ∞ and sup E|Xn |I(|Xn | > M ) → 0 as M → ∞. n

n

Lemma 29 The following holds. 1. If (Xn , Bn ) is a right-closable martingale then (Xn ) is uniformly integrable. 2. If (Xn , Bn )n≤∞ is a submartingale then for any a ∈ R, (max(Xn , a)) is uniformly integrable.

52

Proof. 1. If Xn = E(Y |Bn ) then |Xn | = |E(Y |Bn )| ≤ E(|Y ||Bn ) and E|Xn | ≤ E|Y | < ∞. Since {|Xn | > M } ∈ Bn , Xn I(|Xn | > M ) = I(|Xn | > M )E(Y |Bn ) = E(Y I(|Xn | > M )|Bn ) and, therefore, E|Xn |I(|Xn | > M ) ≤ E|Y |I(|Xn | > M ) ≤

KP(|Xn | > M ) + E|Y |I(|Y | > K) E|Xn | E|Y | ≤ K + E|Y |I(|Y | > K) ≤ K + E|Y |I(|Y | > K). M M

Letting M → ∞, K → ∞ proves that supn E|Xn |I(|Xn | > M ) → 0 as M → ∞. 2. Since (Xn , Bn )n≤∞ is a submartingale, for Y = X∞ we have Xn ≤ E(Y |Bn ). Below we will use the following observation. Since a function max(a, x) is convex and increasing in x, by Jensen’s inequality max(a, Xn ) ≤ E(max(a, Y )|Bn ).

(13.0.1)

Since, � � � � �max(Xn , a)� ≤ |a| + Xn I Xn > |a| and {|Xn | > |a|} ∈ Bn we can write � � � � � � E�max(Xn , a)� ≤ |a| + EXn I Xn > |a| ≤ |a| + EY I Xn > |a| ≤ |a| + E|Y | < ∞. If we take M > |a| then E| max(Xn , a)|I(| max(Xn , a)| > M )

by (13.0.1)

EXn I(Xn > M ) ≤ EY I(Xn > M ) ≤ KP(Xn > M ) + E|Y |I(|Y | > K) E max(Xn , 0) ≤ K + E|Y |I(|Y | > K) M E max(Y, 0) ≤ K + E|Y |I(|Y | > K). M =

Letting M → ∞ and K → ∞ finishes the proof. Uniform integrability plays an important role when studying the convergence of martingales. The following strengthening of the dominated convergence theorem will be useful. Lemma 30 Consider r.v.s (Xn ) and X such that E|Xn | < ∞, E|X| < ∞. Then the following are equivalent: 1. E|Xn − X| → 0, 2. (Xn ) is uniformly integrable and Xn → X in probability. Proof. 2=⇒1. We can write, E|Xn − X|

≤ ε + E|Xn − X|I(|Xn − X| > ε) ≤

ε + 2KP(|Xn − X| > ε) + 2E|Xn |I(|Xn | > K) + 2E|X|I(|X| > K)

≤ ε + 2KP(|Xn − X| > ε) + 2 sup E|Xn |I(|Xn | > K) + 2E|X|I(|X | > K). n

Letting n → ∞ and then ε → 0, K → ∞ proves the result. 1=⇒2. By Chebyshev’s inequality, P(|Xn − X| > ε) ≤ 53

1 E|Xn − X| → 0 ε

as n → ∞ so Xn → X in probability. To prove uniform integrability let us first show that for any ε > 0 there exists δ > 0 such that P(A) < δ =⇒ E|X|IA < ε. Suppose not. Then, for some ε > 0 one can find a sequence of events A(n) such that P(A(n)) ≤

1 2n

and E|X|IA(n) > ε.

� Since n≥1 P(A(n)) < ∞, by Borel-Cantelli lemma, P(A(n) i.o.) = 0. This means that |X|IA(n) → 0 almost surely and by the dominated convergence theorem E|X|IA(n) → 0 - a contradiction. Given ε > 0, take δ as above and take M > 0 large enough so that for all n ≥ 1 P(|Xn | > M ) ≤

E|Xn | < δ. M

Then, E|Xn |I(|Xn | > M ) ≤ E|Xn − X| + E|X|I(|Xn | > M ) ≤ E|Xn − X| + ε. For large enough n ≥ n0 , E|Xn − X| ≤ ε and, therefore, E|Xn |I(|Xn | > M ) ≤ 2ε. We can also choose M large enough so that E|Xn |I(|Xn | > M ) ≤ 2ε for n ≤ n0 and this finishes the proof.

54

Section 14

Optional stopping. Inequalities for martingales. Consider a sequence of σ-algebras (Bn )n≥0 such that Bn ⊆ Bn+1 . Integer valued r.v. τ ∈ {1, 2, . . .} is called a stopping time if {τ ≤ n} ∈ Bn . Let us denote by Bτ a σ-algebra of the events B such that {τ ≤ n} ∩ B ∈ Bn , ∀n ≥ 1. �τ If (Xn ) is adapted to (Bn ) then random variables such as Xτ or k=1 Xk are measurable on Bτ . For example, �

� � {Xτ ∈ A} = {τ = n} ∩ {Xn ∈ A} = {τ ≤ n} \ {τ ≤ n − 1} {Xn ∈ A} ∈ Bτ . n≥1

n≥1

Theorem 31 (Optional stopping) Let (Xn , Bn ) be a martingale and τ1 , τ2 < ∞ be stopping times such that E|Xτ2 | < ∞,

lim E|Xn |I(n ≤ τ2 ) = 0.

n→∞

(14.0.1)

Then on the event {τ1 ≤ τ2 } E(Xτ2 |Bτ1 ) = Xτ1 . More precisely, for any set A ∈ Bτ1 , EXτ2 IA I(τ1 ≤ τ2 ) = EXτ1 IA I(τ1 ≤ τ2 ). If (Xn , Bn ) is a submartingale then equality is replaced by ≥ . Remark. If stopping times τ1 , τ2 are bounded then (14.0.1) is satisfied. As the next example shows, without some control of the stopping times the statement is not true. Example. Consider an i.i.d. sequence (Xn ) such that P(Xn = ±2n ) =

1 . 2

If Bn = σ(X1 , . . . , Xn ) then (Sn , Bn ) is a martingale. Let τ1 = 1 and τ2 = min{k ≥ 1, Sk > 0}. Clearly, Sτ2 = 2 because if τ2 = k then Sτ2 = Sk = −2 − 22 − . . . − 2k−1 + 2k = 2. However, 2 = E(Sτ2 |B1 ) �= Sτ1 = X1 .

55

The second condition in (14.0.1) is violated since P(τ2 = n) = 2−n and E|Sn |I(n ≤ τ2 ) = 2P(τ2 = n) + (2n+1 − 2)P(n + 1 ≤ τ2 ) = 2 �→ 0. Proof of Theorem 31. Consider a set A ∈ Bτ1 . We have, � � � EXτ2 IA I(τ1 ≤ τ2 ) = EXτ2 I A ∩ {τ1 = n} I(n ≤ τ2 ) n≥1 (∗)

=



� � EXn I A ∩ {τ1 = n} I(n ≤ τ2 ) = EXτ1 IA I(τ1 ≤ τ2 ).

n≥1

To prove (*) it is enough to prove that for An = A ∩ {τ1 = n} ∈ Bn , EXτ2 IAn I(n ≤ τ2 ) = EXn IAn I(n ≤ τ2 ).

(14.0.2)

We can write EXn IAn I(n ≤ τ2 )

EXn IAn I(τ2 = n) + EXn IAn I(n + 1 ≤ τ2 )

=

EXτ2 IAn I(τ2 = n) + EXn IAn I(n + 1 ≤ τ2 ) � since {n + 1 ≤ τ2 } = {τ2 ≤ n} ∈ Bn , by martingale property =



c

EXτ IA I(τ2 = n) + EXn+1 IAn I(n + 1 ≤ τ2 ) � 2 n EXτ2 IAn I(τ2 = k) + EXm IAn I(m ≤ τ2 )

= �

by induction



=

n≤k<m

EXτ2 IAn I(n ≤ τ2 < m) + EXm IAn I(m ≤ τ2 ).

=

By (14.0.1), the last term � � �EXm IAn I(m ≤ τ2 )� ≤ E|Xm |I(m ≤ τ2 ) → 0 as m → ∞. Since Xτ2 IAn I(n ≤ τ2 ≤ m) → Xτ2 IAn I(n ≤ τ2 ) as m → ∞ and E|Xτ2 | < ∞, by dominated convergence theorem, EXτ2 IAn I(n ≤ τ2 < m) → EXτ2 IAn I(n ≤ τ2 ). This proves (14.0.2).

Theorem 32 (Doob’s inequality) If (Xn , Bn ) is a submartingale then for Yn = max1≤k≤n Xk and M > 0 � � 1 1 P Yn ≥ M ≤ EXn I(Yn ≥ M ) ≤ EXn+ . M M Proof. Define a stopping time � τ1 =

min{k : Xk ≥ M, k ≤ n} n

if such k exists, otherwise.

Let τ2 = n so that τ1 ≤ τ2 . By Theorem 31, E(Xn |Bτ1 ) = E(Xτ2 |Bτ1 ) ≥ Xτ1 . Let us apply this to the set A = {Yn = max1≤k≤n Xn ≥ M } which belongs to Bτ1 because � � A ∩ {τ1 ≤ k} = max Xi ≥ M ∈ Bk . 1≤i≤k

56

(14.0.3)

On the event A, Xτ1 ≥ M and, therefore, EXn IA = EXτ2 IA ≥ EXτ1 IA ≥ M EIA = M P(A). On the other hand, EXn IA ≤ EXn+ and this finishes the proof. As a corollary we obtain the second Kolmogorov’s inequality. If (Xi ) are independent and EXi = 0 then � Sn = 1≤i≤n Xi is a martingale and Sn2 is a submartingale. Therefore, � � � � 1 1 � Var(Xk ). P max |Sk | ≥ M = P max Sk2 ≥ M 2 ≤ 2 ESn2 = 2 1≤k≤n 1≤k≤n M M 1≤k≤n

Exercises. �∞ 1. Show that for any random variable Y, E|Y p | = 0 ptp−1 P(|Y | ≥ t)dt. � 2. Let X, Y be two non-negative random variables such that for every t > 0, P(Y ≥ t) ≤ t−1 XI(Y ≥ t)dP. � For any p > 1, �f �p = ( |f |p dP)1/p and 1/p + 1/q = 1, show that �Y �p ≤ q�X�p . 3. Given a non-negative submartingale (Xn , Bn ), let Xn∗ := maxj≤n Xj and X ∗ := maxj≥1 Xj . Prove that for any p > 1 and 1/p + 1/q = 1, �X ∗ �p ≤ q supn �Xn �p . Hint: use exercise 2 and Doob’s maximal inequality. Doob’s upcrossing inequality. Let (Xn , Bn )n≥1 be a submartingale. Given two real numbers a < b we will define a sequence of stopping times (τn ) when Xn is crossing a downward and b upward as in figure 14.1. Namely, we define

b

a

x

x

x

x

x

x

x

Figure 14.1: Stopping times of level crossings. τ1 = min{n ≥ 1, Xn ≤ a}, τ2 = min{n > τ2 : Xn ≥ b} and, by induction, for k ≥ 2 τ2k−1 = min{n > τ2k−2 , Xn ≤ a}, τ2k = min{n > 2k − 1, Xn ≥ b}. Define ν(a, b, n) = max{k : τ2k ≤ n} - the number of upward crossings of [a, b] before time n. Theorem 33 (Doob’s upcrossing inequality) We have, Eν(a, b, n) ≤

E(Xn − a)+ . b−a

(14.0.4)

Proof. Since x → (x − a)+ is increasing convex function, Zn = (Xn − a)+ is also a submartingale. Clearly,

µX (a, b, n) = νZ (0, b − a, n)

which means that it is enough to prove (14.0.4) for nonnegative submartingales. From now on we can assume

that 0 ≤ Xn and we would like to show that Eν(0, b, n) ≤ 57

EXn . b

Let us define a sequence of r.v.s � ηj =

1, 0,

τ2k−1 < j ≤ τ2k for some k otherwise,

i.e. ηj is the indicator of the event that at time j the process is crossing [0, b] upward. Define X0 = 0. Then bν(0, b, n) ≤

n �

ηj (Xj − Xj−1 ) =

j=1

n �

I(ηj = 1)(Xj − Xj−1 ).

j=1

The event ∈Bj−1

∈Bj−1

{ηj = 1} =



{τ2k−1

k

�� � �� �� � �� ��c

� < j ≤ τ2k } = τ2k−1 ≤ j − 1 τ2k ≤ j − 1 ∈ Bj−1 k

i.e. the fact that at time j we are crossing upward is determined completely by the sequence up to time j − 1. Then bEν(0, b, n)



n �

n � � � � � EE I(ηj = 1)(Xj − Xj−1 )�Bj−1 = EI(ηj = 1)E(Xj − Xj−1 |Bj−1 )

j=1

=

n �

j=1

EI(ηj = 1)(E(Xj |Bj−1 ) − Xj−1 ) ≤

j=1

n �

E(Xj − Xj−1 ) = EXn ,

j=1

where in the last inequality we used that (Xj , Bj ) is a submartingale, E(Xj |Bj−1 ) ≥ Xj−1 , which implies that I(ηj = 1)(E(Xj |Bj−1 ) − Xj−1 ) ≤ E(Xj |Bj−1 ) − Xj−1 . This finishes the proof.

58

Section 15

Convergence of martingales.

Fundamental Wald’s identity.

We finally get to our main result about the convergence of martingales and submartingales. Theorem 34 Let (Xn , Bn )−∞


n

Bn then (Xn , Bn )−∞≤n<∞

+ 2. If supn EXn+ < ∞, then X+∞ = limn→∞ Xn exists a.s. and EX+∞ < ∞.

3. If (max(Xn , a))−∞ lim inf Xn = lim sup Xn ≥ b > a ≥ lim inf Xn , a
where the union is taken over all rational numbers a < b. In other words, P(Xn diverges) > 0 iff there exist rational a < b such that � � P lim sup Xn ≥ b > a ≥ lim inf Xn > 0. If we recall that ν(a, b, n) denotes the number of upcrossings of [a, b], this is equivalent to � � P lim ν(a, b, n) = ∞ > 0. n→∞

Proof of 1. Let us define a new sequence Y1 = X−n , Y2 = X−n+1 , . . . , Yn = X−1 .

59

By Doob’s inequality, E(Yn − a)+ E(X−1 − a)+ = < ∞. b−a b−a Since 0 ≤ ν(a, b, n) ↑ ν(a, b), by monotone convergence theorem, EνY (a, b, n) ≤

Eν(a, b) = lim Eν(a, b, n) < ∞. n→∞

Therefore, P(ν(a, b) = ∞) = 0 which means that the limit X−∞ = lim Xn n→−∞

exists. By Fatou’s lemma and submartingale property + + + EX−∞ ≤ lim inf EX−n ≤ EX−1 < ∞. n→−∞

It remains to show that (Xn , Bn )−∞≤n<∞ is a submartingale. First of all, X−∞ = limn→−∞ Xn is measurable on Bm for all m and, therefore, measurable on B−∞ = ∩Bm . Let us take a set A ∈ ∩Bn . We would like to show that for any m EX−∞ IA ≤ EXm IA . Since (Xn , Bn )−∞
is uniformly integrable for any a ∈ R. Since max(X−∞ , a) = lim max(Xn , a), n→−∞

by Lemma 30, E max(X−∞ , a)IA = lim E max(Xn , a)IA ≤ E max(Xm , a)IA n→−∞

for any m. Letting a → −∞, by monotone convergence theorem, EX−∞ IA ≤ EXm IA . Proof of 2. If ν(a, b, n) is a number of upcrossings of [a, b] by X1 , . . . , Xn then by Doob’s inequality (b − a)Eν(a, b, n) ≤ E(Xn − a)+ ≤ |a| + EXn+ ≤ K < ∞. Therefore, Eν(a, b) < ∞ for ν(a, b) = limn→∞ ν(a, b, n) and, as above, X+∞ = limn→+∞ Xn exists. Since all Xn are measurable on B+∞ so is X+∞ . Finally, by Fatou’s lemma + EX+∞ ≤ lim inf EXn+ < ∞. n→∞

Notice that another condition, supn E|Xn | < ∞, would similarly imply E|X+∞ | ≤ lim inf E|Xn | < ∞. n→∞

Proof of 3. By 2, the limit X+∞ exists. We want to show that for any set A ∈ Bm EXm IA ≤ EX+∞ IA . Let us denote a ∨ b = max(a, b). Since (Xn ∨ a) is uniformly integrable and Xn ∨ a → X+∞ ∨ a, by Lemma 30, E(X+∞ ∨ a)IA = lim E(Xn ∨ a)IA . n→∞

Since a function x → x ∨ a is convex and increasing, (Xn ∨ a) is also a submartingale and, thus, for n > m, E(Xm ∨ a)IA ≤ E(Xn ∨ a)IA . Therefore, E(Xm ∨ a)IA ≤ E(X+∞ ∨ a)IA and letting a → −∞, by monotone convergence theorem, we get that EXm IA ≤ EX+∞ IA . As a corollary we get the following. 60

Corollary 2 Martingale (Xn , Bn ) is right-closable iff it is uniformly integrable. To prove this, apply case 3 above to (Xn ) and (−Xn ) which are both submartingales. Theorem 35 (Levy’s convergence) Let (Ω, B, P) be a probability space and X be a real-valued random vari­ able on it. Given a sequence of σ-algebras B1 ⊆ . . . ⊆ Bn ⊆ . . . ⊆ B+∞ ⊆ B where B+∞ = σ

�

1≤n<∞

� Bn , we have Xn = E(X|Bn ) → E(X|B+∞ ) a.s.

Proof. (Xn , Bn )1≤n<∞ is a right-closable martingale since Xn = E(X|Bn ). Therefore, it is uniformly inte­ grable and by previous theorem the limit X+∞ = limn→∞ E(X|Bn ) exists. It remains to show that X+∞ = E(X|B+∞ ). For a set A in algebra



Bn , A ∈ Bm for some m and by Lemma 30

EX+∞ IA = lim EXn IA = EXm IA = E(E(X|Bm )IA ) = EXIA . n→∞

By uniqueness in the Caratheodory extension theorem, EX+∞ IA = EXIA for all A ∈ σ

�

n Bn



which

means that X+∞ = E(X|B+∞ ). Exercise. (Kolmogorov’s 0-1 law) Consider arbitrary random variables (Xi )i≥1 and consider σ-algebras Bn = σ(X1 , . . . , Xn ) and B∞ = σ((Xl )l≥1 ). A set A ∈ B∞ is a tail event if it is independent of Bn for any n. If we consider conditional expectations E(IA |Bn ) then using Levy’s convergence theorem and the fact that E(IA |Bn ) and IA are independent one can prove that P(A) = 0 or 1. This gives a martingale proof of improved Kolmogorov’s 0-1 law. Examples. 1. (Strong law of large numbers) Let (Xn )n≥1 be i.i.d. Let B−n = σ(Sn , Sn+1 , . . .), Sn = X1 + · · · + Xn . We showed before that

Sn = E(X1 |B−n ), n i.e. (Z−n , B−n ) is a�reverse martingale and, therefore, the limit Y = limn→+∞ Z−n exists. Y is measurable on σ-algebra B−∞ = n B−n . Since each event in B−∞ is symmetric, i.e. invariant under permutations of Xn ’s, by Savage-Hewitt 0 - 1 law, the probability of each event is 0 or 1, i.e. B−∞ consists of ∅ and Ω up to sets of measure zero. Therefore, Y is a constant a.s. and since (Zn )−∞≤n≤−1 is also a martingale, EY = EX1 . Therefore, Sn /n → EX1 a.s. 2. (Kolmogorov’s law of large numbers) Consider a sequence (Xn )n≥1 such that Z−n =

E(Xn+1 |X1 , . . . , Xn ) = 0. We do not assume independence. If a sequence (bn ) is such that bn < bn+1 , then Sn /n → 0 a.s. Indeed, Yn =

lim bn = ∞ and

n→∞



k≤n (Xk /bk )

E|Yn |I(Yn | > M ) ≤

∞ � EXn2 <∞ b2n n=1

is a martingale and (Yn ) is uniformly integrable since ∞ 1 1 � EXk2 E|Yn |2 ≤ →0 M M n=1 b2k

61

as M → ∞. Therefore, the limit Y = lim Yn exists and it is an easy exercise to show that Sn /bn → 0 (Kronecker’s lemma). 3. (Polya’ urn scheme) Let us recall the Polya urn scheme from Section 5. Let us consider a sequence Yn =

#(blue balls after n iterations) . #(total after n iterations)

Yn is a martingale because given that at step n the numbers of blue and red balls are b and r, the expected number of balls at step n + 1 will be E(Yn+1 |Bn ) =

b b+c r b b · + · = = Yn . b+r b+r+c b+r b+r+c b+r

Since Yn is bounded, by martingale convergence theorem, the limit Y = limn→∞ Yn exists. What is the distribution of Y ? Let us consider a sequence � 1 blue at step i Xi = 0 red at step i � and let Sn = i≤n Xi . Clearly, b + Sn c Sn Yn = ≈ b + r + nc n as n → ∞ and, therefore, Sn /n → Y. The sequence (Xn ) is exchangeable and by de Finetti’s theorem in Section 5 we showed that � �� 1 �b r� n P(Sn = k) = xk (1 − x)n−k dβ , (x). k c c 0 For any function u ∈ C([0, 1]), � 1 n �S � � � k ��n� � 1 �b r� �b r� n k n−k Eu = u x (1 − x) dβ , (x) = Bn (x)dβ , (x), n n k c c c c 0 0 k=0

where Bn (x) is the Bernstein polynomial that approximates u(x) uniformly on [0, 1]. Therefore, � 1 �S � �b r� n Eu → u(x)dβ , (x) n c c 0 which means that

�S � �b r� n L →β , = L(Y ), n c c �b r� i.e. the limit Y has Beta distribution β c , c . Optional stopping for martingales revisited. Let τ be a stopping time. We would like to determine when EXτ = EX1 . As we saw above in the case of two stopping times, some kind of integrability assumptions are necessary. In this simpler case, the necessary conditions are clear from the proof. Lemma 31 We have EX1 = lim EXτ I(τ ≤ n) ⇐⇒ lim EXn I(τ ≥ n) = 0. n→∞

n→∞

Proof. We can write, EXτ I(τ ≤ n) =



EXk I(τ = k)

1≤k≤n

� � since {τ ≥ k + 1} = {τ ≤ k}c ∈ Bk

� �

=

EXk I(τ ≥ k) − EXk I(τ ≥ k + 1)



1≤k≤n

� � � EXk I(τ ≥ k) − EXk+1 I(τ ≥ k + 1)

=

1≤k≤n

=

EX1 − EXn+1 I(τ ≥ n + 1). 62

Example. Given 0 < p < 1, consider i.i.d. random variables (ξi )i≥1 such that P(ξi = 1) = p, P(ξi = −1) = 1 − p and consider a random walk X0 = 0, Xn+1 = Xn + ξn+1 . Consider two integers a ≤ −1 and b ≥ 1 and define a stopping time τ = min{k ≥ 1, Xn = a or b}. If q = 1 − p then Yn = (q/p)Xn is a martingale since E(Yn+1 |Bn ) = p

� q �Xn +1 p

+q

� q �Xn −1 p

=

� q �Xn p

.

It is easy to show that P(τ ≥ n) → 0 as n → ∞ and using previous lemma, 1 = EY0 = EYτ =

� q �b p

P(Xτ = b) +

� q �a p

(1 − P(Xτ = b)).

If p �= 1/2 we can solve this P(Xτ = b) =

(q/p)a − 1 . (q/p)a − (q/p)b

If q = p = 1/2 then Xn is a martingale and 0 = bP(Xτ = b) + a(1 − P(Xτ = b)) =⇒ P(Xτ = b) =

a . a−b

Exercise. Compute Eτ. Hint: for p �= 1/2, use that Xn − nEξ1 is a martingale; for p = 1/2, use that Xn2 − n is a martingale.

Fundamental Wald’s identity. Let (Xn )n≥1 be an i.i.d. sequence of random variables and suppose that a Laplace transform ϕ(λ) = EeλX1 is defined on some nontrivial interval (λ− , λ+ ) containing 0. Since EeλSn = (EeλX1 )n = ϕ(λ)n =⇒

eλSn ϕ(λ)n

is a martingale.

Let τ be a stopping time. If E

eλSn I(τ ≥ n) → 0 ϕ(λ)n

(15.0.1)

then by the above lemma we get E

eλSτ eλX1 I(τ < ∞) = E =1 ϕ(λ)τ ϕ(λ)

which is called the fundamental Wald’s identity. In some cases one can use this to obtain Laplace transform of a stopping time and, thus, its distribution. � Example. Let X0 = 0, P(Xi = ±1) = 1/2, Sn = k≤n Xk . Given integer z ≥ 1, let τ = min{k : Sk = −z or z}. Since ϕ(λ) = ch(λ) ≥ 1, E

eλSn e|λ|z I(τ ≥ n) ≤ P(τ ≥ n) ϕ(λ)n ch(λ)n

and (15.0.1) is satisfied. Therefore, 1=E

eλSτ eλz + e−λz λz −τ −λz −τ = e Ech(λ) I(S = z) + e Ech(λ) I(S = −z) = Ech(λ)−τ τ τ ϕ(λ)τ 2 63

by symmetry. Therefore, Ech(λ)−τ =

1 1 and Eeγτ = −1 ch(λz) ch(ch (e−γ )z)

by a change of variables eγ = 1/chλ. For more general stopping times the condition (15.0.1) might not be easy to check. We will now show another approach that is helpful to verify a fundamental Wald’s identity. If P is the distribution of Xi ’s, let Pλ be a distribution with Radon-Nikodym derivative w.r.t. P given by dPλ eλx = . dP ϕ(λ) This is, indeed, a density since � R

eλx ϕ(λ) dP = = 1. ϕ(x) ϕ(λ)

We will think of (Xn ) as defined on the product space (R∞ , B ∞ , P∞ ). For example, a set {τ = n} ∈ σ(X1 , . . . , Xn ) ⊆ B n is a Borel set on Rn . We can write, ∞ � eλSτ eλSn E I(τ < ∞) = E I(τ = n) = ϕ(λ)τ ϕ(λ)n n=1

=

� ∞ � n=1 {τ =n}

� ∞ �

eλ(x1 +···+xn ) dP(x1 ) . . . dP(xn ) ϕ(λ)n dPλ (x1 ) . . . dPλ (xn ) = Pλ (τ < ∞).

n=1 {τ =n}

This means that we can think of (Xn ) as having a distribution Pλ and to prove Wald’s identity we need to show that Pλ (τ < ∞) = 1. Example. (Crossing a growing boundary) Suppose that we have a boundary given by a sequence (f (k)) that changes with time and a stopping time (crossing time): τ = min{k : Sk ≥ f (k)}. To verify (15.0.1), we need to show that Pλ (τ < ∞) = 1. Under the law Pλ , random variables Xi have expectation � eλx ϕ� (λ) Eλ Xi = x dP(x) = . ϕ(λ) ϕ(λ) By the strong law of large numbers

Sn ϕ� (λ) = n→∞ n ϕ(λ) lim

and, therefore, if the growth of the crossing boundary satisfies lim sup n→∞

f (n) ϕ� (λ) < n ϕ(λ)

then, obviously, Pλ (τ < ∞) = 1.

64

Section 16

Convergence on metric spaces. Portmanteau Theorem. Lipschitz Functions. Let (S, d) be a metric space and B - a Borel σ-algebra generated by open sets. Let us recall that Pn → P weakly on B if � � f dPn → f dP for all f ∈ Cb (S) - real-valued bounded continuous functions on S. For a set A ⊆ S, we denote by A¯ the closure of A, intA - interior of A and ∂A = A¯ \ intA - boundary of A. A is called a continuity set of P if P(∂A) = 0. Theorem 36 (Portmanteau theorem) The following are equivalent. 1. Pn → P weakly. 2. For any open set U ⊆ S, lim inf n→∞ Pn (U ) ≥ P(U ). 3. For any closed set F ⊆ S, lim supn→∞ Pn (F ) ≤ P(F ). 4. For any continuity set A of P, limn→∞ Pn (A) = P(A). Proof. 1=⇒2. Let U be an open set and F = U c . Consider a sequence of functions in Cb (S) fm (s) = min(1, md(s, F )) such that fm (s) ↑ IU (s). (This is not necessarily true if U is not open.) Since Pn → P, � � � Pn (U ) ≥ fm dPn → fm dP as n → ∞ and lim inf Pn (U ) ≥ fm dP. n→∞

Letting m → ∞, by monotone convergence theorem. � lim inf Pn (U ) ≥ n→∞

2⇐⇒3. By taking complements.

65

IU dP = P(U ).

2, 3=⇒4. Since intA is open and A¯ is closed and intA ⊆ A¯, by 2 and 3, P(intA) ≤ lim inf Pn (intA) ≤ lim sup Pn (A¯) ≤ P(A¯). n→∞

n→∞

If P(∂A) = 0 then P(A¯) = P(intA) = P(A) and, therefore, lim Pn (A) = P(A). 4=⇒1. Consider f ∈ Cb (S) and let Fy = {s ∈ S : f (s) = y} be a level set of f. There exist at most countably many y such that P(Fy ) > 0. Therefore, for any ε > 0 we can find a sequence a1 ≤ . . . ≤ aN such that max(ak+1 − ak ) ≤ ε, P(Fak ) = 0 for all k and the range of f is inside the interval (a1 , aN ). Let Bk = {s ∈ S : ak ≤ f (s) < ak+1 } and fε (s) =



ak I(s ∈ Bk ).

Since f is continuous, ∂Bk ⊆ Fak ∪ Fak+1 and P(∂Bk ) = 0. By 4, � � � � fε dPn = ak Pn (Bk ) → ak P(Bk ) = fε dP. k

k

Since, by construction, |fε (s) − f (s)| ≤ ε, letting ε → 0 proves that



f dPn →



f dP.

Lipschitz functions. For a function f : S → R, let us define a Lipschitz semi-norm by ||f ||L = sup x=y �

|f (x) − f (y)| . d(x, y)

Clearly, ||f ||L = 0 iff f is constant so ||f ||L is not a norm. Let us define a bounded Lipschitz norm by ||f ||BL = ||f ||L + ||f ||∞ , where ||f ||∞ = sups∈S |f (s)|. Let � � BL(S, d) = f : S → R : ||f ||BL < ∞ be a set of all bounded Lipschitz functions. Lemma 32 If f, g ∈ BL(S, d) then f g ∈ BL(S, d) and ||f g||BL ≤ ||f ||BL ||g||BL . Proof. First of all, ||f g||∞ ≤ ||f ||∞ ||g||∞ . We can write, |f (x)g(x) − f (y)g(y)|

≤ |f (x)(g(x) − g(y))| + |g(y)(f (x) − f (y))| ≤ ||f ||∞ ||g||L d(x, y) + ||g||∞ ||f ||L d(x, y)

and, therefore, ||f g||BL ≤ ||f ||∞ ||g||∞ + ||f ||∞ ||g||L + ||g||∞ ||f ||L ≤ ||f ||BL ||g||BL .

Let us recall the notations a ∨ b = max(a, b) and a ∧ b = min(a, b) and let ∗ = ∧ or ∨. Lemma 33 The following hold. 1. ||f1 ∗ · · · ∗ fk ||L ≤ max1≤i≤k ||fi ||L . 2. ||f1 ∗ · · · ∗ fk ||BL ≤ 2 max1≤i≤k ||fi ||BL . 66

Proof. Proof of 1. It is enough to consider k = 2. For specificity, take ∗ = ∨. Given x, y ∈ S, suppose that f1 ∨ f2 (x) ≥ f1 ∨ f2 (y) = f1 (y). Then � |f1 ∨ f2 (y) − f1 ∨ f2 (x)| = f1 ∨ f2 (x) − f1 ∨ f2 (y)

≤ ≤

f1 (x) − f1 (y), f2 (x) − f2 (y),

if f1 (x) ≥ f2 (x) otherwise

||f1 ||L ∨ ||f2 ||L d(x, y).

This finishes the proof of 1. Proof of 2. First of all, obviously, ||f1 ∗ · · · ∗ fk ||∞ ≤ max ||fi ||∞ . 1≤i≤k

Therefore, using 1, ||f1 ∗ · · · ∗ fk ||BL ≤ max ||fi ||∞ + max ||fi ||L ≤ 2 max ||fi ||BL . i

i

i

Theorem 37 (Extension theorem) Given a set A ⊆ S and a bounded Lipschitz function f ∈ BL(A, d) on A, there exists an extension h ∈ BL(S, d) such that f = h on A and ||h||BL = ||f ||BL . Proof. Let us first find an extension such that ||h||L = ||f ||L . We will start by extending f to one point x ∈ S \ A. The value y = h(x) must satisfy |y − f (s)| ≤ �f �L d(x, s) for all s ∈ A or, equivalently, inf (f (s) + ||f ||L d(x, s)) ≥ y ≥ sup(f (s) − ||f ||L d(x, s)).

s∈A

s∈A

Such y exists iff for all s1 , s2 ∈ A, f (s1 ) + ||f ||L d(x, s1 ) ≥ f (s2 ) − ||f ||L d(x, s2 ). This inequality is satisfied because by triangle inequality f (s2 ) − f (s1 ) ≤ ||f ||L d(s1 , s2 ) ≤ ||f ||L (d(s1 , x) + d(s2 , x)). It remains to apply Zorn’s lemma to show that f can be extended to the entire S. Define order by inclusion: f1 � f2 if f1 is defined on A1 , f2 - on A2 , A1 ⊆ A2 , f1 = f2 on A1 and �f1 �L = �f2 �L .  For any chain {fα }, f = fα � fα . By Zorn’s lemma there exists a maximal element h. It is defined on the entire S because, otherwise, we could extend to one more point. To extend preserving BL norm take h� = (h ∧ ||f ||∞ ) ∨ (−||f ||∞ ). By part 1 of previous lemma, it is easy to see that ||h� ||BL = ||f ||BL . Stone-Weierstrass Theorem. A set A ⊆ S is totally bounded if for any ε > 0 there exists a finite ε-cover of A, i.e. a set of points a1 , . . . , aN such that

A⊆ B(ai , ε), i≤N

where B(a, ε) = {y ∈ S : d(a, y) ≤ ε} is a ball of radius ε centered at a. Let us recall the following theorem from analysis. 67

Theorem 38 (Arzela-Ascoli) Let (S, d) be a compact metric space and let (C(S), d∞ ) be the space of con­ tinuous real-valued functions on S with uniform convergence metric d∞ (f, g) = sup |f (x) − g(x)|. x∈S

A subset F ⊆ C(S) is totally bounded in d∞ metric iff F is equicontinuous and uniformly bounded. Remark. Equicontinuous means that for any ε > there exists δ > 0 such that if d(x, y) ≤ δ then for all f ∈ F, |f (x) − f (y)| ≤ ε.

Theorem 39 (Stone-Weierstrass) Let (S, d) be a compact metric space and F ⊆ C(S) is such that

1. F is algebra, i.e. for all f, g ∈ F, c ∈ R, we have cf + g ∈ F, f g ∈ F. 2. F separates points, i.e. if x = � y ∈ S then there exists f ∈ F such that f (x) = � f (y). 3. F contains constants. Then F is dense in C(S). Corollary 3 If (S, d) is a compact space then BL(S, d) is dense in C(S). Proof. For F = BL(S, d) in the Stone-Weierstrass theorem, 3 is obvious, 1 follows from Lemma 32 and 2 follows from the extension Theorem 37, since a function defined on two points x = � y such that f (x) �= f (y) can be extended to the entire S. Proof of Theorem 39. Consider bounded f ∈ F, i.e. |f (x)| ≤ M. A function x → |x| defined on the interval [−M, M ] can be uniformly approximated by polynomials of x by the Weierstrass theorem on the real line or, for example, using Bernstein’s polynomials. Therefore, |f (x)| can be uniformly approximated by polynomials of f (x), and by properties 1 and 3, by functions in F. Therefore, if F¯ is the closure of F in d∞ ¯ Therefore, for any f, g ∈ F¯ we have norm then for any f ∈ F¯ its absolute value |f | ∈ F. min(f, g) =

1 1 1 1 (f + g) − |f − g| ∈ F¯ , max(f, g) = (f + g) + |f − g| ∈ F¯ . 2 2 2 2

(16.0.1)

Given any points x = � y and c, d ∈ R one can always find f ∈ F such that f (x) = c and f (y) = d. Indeed, by � g(y) and, as a result, a system of equations property 2 we can find g ∈ F such that g(x) = ag(x) + b = c, ag(y) + b = d has a solution a, b. Then the function f = ag + b satisfies the above and it is in F by 1. Take h ∈ C(S) and fix x. For any y let fy ∈ F be such that fy (x) = h(x), fy (y) = h(y). By continuity of fy , for any y ∈ S there exists an open neighborhood Uy of y such that fy (s) ≥ h(s) − ε for s ∈ Uy . Since (Uy ) is an open cover of the compact S, there exists a finite subcover Uy1 , . . . , UyN . Let us define a function f x (s) = max(fy1 (s), . . . , fyN (s)) ∈ F¯ by (16.0.1). By construction, it has the following properties: f x (x) = h(x), f x (s) ≥ h(s) − ε for all s ∈ S.

68

Again, by continuity of f x (s) there exists an open neighborhood Ux of x such that f x (s) ≤ h(s) + ε for s ∈ Ux . Take a finite subcover Ux1 , . . . , UxM and define � � h� (s) = min f x1 (s), . . . , f xM (s) ∈ F¯ by (16.0.1). By construction, h� (s) ≤ h(s) + ε and h� (s) ≥ h(s) − ε for all s ∈ S which means that d∞ (h� , h) ≤ ε. Since h� ∈ F¯ , this proves that F¯ is dense in C(S).

Corollary 4 If (S, d) is a compact space then C(S) is separable in d∞ . Remark. Recall that this fact was used in the proof of the Selection Theorem, which was proved for general metric spaces. Proof. By the above theorem, BL(S, d) is dense in C(S). For any integer n ≥ 1, the set {f : ||f ||BL ≤ n} is uniformly bounded and equicontinuous. By the Arzela-Ascoli theorem, it is totally bounded and, therefore, separable which can be seen by taking finite 1/m-covers for all m ≥ 1. The union

{||f ||BL ≤ n} = BL(S, d) is therefore separable in C(S) which is, as a result, also separable.

69

Section 17

Metrics for convergence of laws. Empirical measures. Levy-Prohorov metric. Consider a metric space (S, d). For a set A ⊆ S let us denote by Aε = {y ∈ S : d(x, y) < ε for some x ∈ A} its ε-neighborhood. Let B be a Borel σ-algebra on S. Definition. If P, Q are probability distributions on B then ρ(P, Q) = inf{ε > 0 : P(A) ≤ Q(Aε ) + ε for all A ∈ B} is called the Levy-Prohorov distance between P and Q. Lemma 34 ρ is a metric on the set of probability laws on B. Proof. 1. First, let us show that ρ(Q, P) = ρ(P, Q). Suppose that ρ(P, Q) > ε. Then there exists a set A such that P(A) > Q(Aε ) + ε. Taking complements gives Q(Aεc ) > P(Ac ) + ε ≥ P(Aεcε ) + ε, where the last inequality follows from the fact that Ac ⊇ Aεcε : a ∈ Aεcε =⇒ d(a, Aεc ) < ε =⇒ d(a, b) < ε for some b ∈ Aεc � � since b ∈ / Aε , d(b, A) ≥ ε =⇒ d(a, A) > 0 =⇒ a ∈ / A =⇒ a ∈ Ac . Therefore, for a set B = Aεc , Q(B) > P(B ε ) + ε. This means that ρ(Q, P) > ε and, therefore, ρ(Q, P) ≥ ρ(P, Q). By symmetry, ρ(Q, P) ≤ ρ(P, Q) and ρ(Q, P) = ρ(P, Q). 2. Next, let us show that if ρ(P, Q) = 0 then P = Q. For any set F and any n ≥ 1, 1

P(F ) ≤ Q(F n ) + 1

1 . n

If F is closed then F n ↓ F as n → ∞ and by continuity of measure �� 1 � P(F ) ≤ Q F n = Q(F ). Similarly, P(F ) ≥ Q(F ) and, therefore, P(F ) = Q(F ). 70

3. Finally, let us prove the triangle inequality ρ(P, R) ≤ ρ(P, Q) + ρ(Q, R). If ρ(P, Q) < x and ρ(Q, R) < y then for any set A, � � � � P(A) ≤ Q(Ax ) + x ≤ R (Ax )y + y + x ≤ R Ax+y + x + y, which means that ρ(P, R) ≤ x + y. Bounded Lipschitz metric. Given probability distributions P, Q on the metric space (S, d) we define a bounded Lipschitz distance between them by � � ��� � � � β(P, Q) = sup � f dP − f dQ� : ||f ||BL ≤ 1 . Lemma 35 β is a metric on the set of probability laws on B. Proof. β(P, Q) = β(Q, P) and the triangle inequality are obvious. It remains to prove that β(P, Q) = 0 implies P = Q. Given a closed set F, the sequence of functions fm (x)� = md(x, F ) ∧ 1 converges fm ↑ IU , � where U = F c . Obviously, ||fm ||BL ≤ m + 1 and, therefore, fm dP = fm dQ. Letting m → ∞ proves that P(U ) = Q(U ). The law P on (S, d) is tight if for any ε > 0 there exists a compact K ⊆ S such that P(S \ K) ≤ ε. Theorem 40 (Ulam) If (S, d) is separable then for any law P on B there exists a closed totally bounded set K ⊆ S such that P(S \ K) ≤ ε. If (S, d) is complete and separable then K is compact and, therefore, every law is tight. � ∞ ¯ � 1 ¯ Proof. Consider a sequence {x1 , x2 , . . .} that is dense in S. For any m ≥ 1, S = i=1 B xi , m , where B denotes a closed ball, and by continuity of measure, for large enough n(m), n(m) � � ��

ε

¯ xi , 1 ≤ m .

P S\ B m 2 i=1

If we take K=

� � n(m)

� ¯ xi , 1 B m i=1

m≥1

then P(S \ K) ≤

� ε = ε. 2m

m≥1

K is closed and totally bounded by construction. If S is complete, K is compact.

Theorem 41 Suppose that either (S, d) is separable or P is tight. Then the following are equivalent. 1. Pn → P. 2. For all f ∈ BL(S, d),



f dPn →



f dP.

3. β(Pn , P) → 0. 4. ρ(Pn , P) → 0.

71

Proof. 1=⇒2. Obvious. 3=⇒4. In fact, we will prove that � ρ(Pn , P) ≤ 2 β(Pn , P).

(17.0.1)

Given a Borel set A ⊆ S, consider a function � � 1 f (x) = 0 ∨ 1 − d(x, A) such that IA ≤ f ≤ IAε . ε Obviously, ||f ||BL ≤ 1 + ε−1 and we can write � � � �� � Pn (A) ≤ f dPn = f dP + f dPn − f dP � � ��� � � � ≤ P(Aε ) + (1 + ε−1 ) sup � f dPn − f dP� : ||f ||BL ≤ 1 =

P(Aε ) + (1 + ε−1 )β(Pn , P) ≤ P(Aδ ) + δ,

where δ = max(ε, (1 + ε−1 )β(P√ n , P)). This implies √ that ρ(P √ n , P) ≤ δ.√Since ε is arbitrary we can minimize δ = δ(ε) over ε. If we take ε = β then δ = max( β, β + β) = β + β and � � β ≤ 1 =⇒ ρ ≤ 2 β; β ≥ 1 =⇒ ρ ≤ 1 ≤ 2 β. 4=⇒1. Suppose that ρ(Pn , P) → 0 which means that there exists a sequence εn ↓ 0 such that Pn (A) ≤ P(Aεn ) + εn for all measurable A ⊆ S. If A is closed, then



n≥1

Aεn = A and, by continuity of measure, � � lim sup Pn (A) ≤ lim sup P(Aεn ) + εn = P(A). n→∞

n→∞

By the portmanteau theorem, Pn → P. 2=⇒3. If P is tight, let K be a compact such that P(S \ K) ≤ ε. If (S, d) is separable, by Ulam’s theorem, let K be a closed totally bounded set such that P(S \ K) ≤ ε. If we consider a function � � 1 1 f (x) = 0 ∨ 1 − d(x, K) with ||f ||BL ≤ 1 + ε ε then ε

Pn (K ) ≥



� f dPn →

f dP ≥ P(K) ≥ 1 − ε,

which implies that for n large enough, Pn (K ε ) ≥ 1 − 2ε. This means that all Pn are essentially concentrated on K ε . Let � � � � � B = f : ||f ||BL(S,d) ≤ 1 , BK = f �K : f ∈ B ⊆ C(K), � where f �K denotes the restriction of f to K. If K is compact then, by the Arzela-Ascoli theorem, BK is totally bounded with respect to d∞ . If K is totally bounded then we can isometrically identify functions in BK with their unique extensions to the completion K � of K and, by the Arzela-Ascoli theorem for the compact K � , BK is again totally bounded with respect to d∞ . In any case, given ε > 0, we can find f1 , . . . , fk ∈ B such that for all f ∈ B sup |f (x) − fj (x)| ≤ ε for some j ≤ k. x∈K

This uniform approximation can also be extended to K ε . Namely, for any x ∈ K ε take y ∈ K such that d(x, y) ≤ ε. Then |f (x) − fj (x)|

≤ |f (x) − f (y)| + |f (y) − fj (y)| + |fj (y) − fj (x)| ≤ ||f ||L d(x, y) + ε + ||fj ||L d(x, y) ≤ 3ε. 72

Therefore, for any f ∈ B, � � �� � �� � � � � � � � f dPn − f dP� + ||f ||∞ Pn (K εc ) + P(K εc ) � f dPn − f dP� ≤ � ε Kε �K �� � � � ≤ � f dPn − f dP� + 2ε + ε ε ε K �K �� � � � ≤ � fj dPn − fj dP� + 3ε + 3ε + 2ε + ε ε ε K � K � �� � � ≤ � fj dPn − fj dP� + 3ε + 3ε + 3ε + 2ε + ε � �� � � � ≤ max � fj dPn − fj dP� + 12ε. 1≤j≤k

Finally, � � �� � �� � � � � � β(Pn , P) = sup � f dPn − f dP� ≤ max � fj dPn − fj dP� + 12ε 1≤j≤k

f ∈B

and, using assumption 2, lim supn→∞ β(Pn , P) ≤ 12ε. Letting ε → 0 finishes the proof. Convergence of empirical measures. Let (Ω, P) be a probability space and X1 , X2 , . . . : Ω → S be an i.i.d. sequence of random variables with values in a metric space (S, d). Let µ be the law of Xi on S. Let us define the random empirical measures µn on the Borel σ-algebra B on S by n

µn (A)(ω) =

1� I(Xi (ω) ∈ A), A ∈ B. n i=1

By the strong law of large numbers, for any f ∈ Cb (S), n

� f dµn =

1� f (Xi ) → Ef (X1 ) = n i=1

� f dµ a.s.

However, the set of measure zero where this convergence is violated depends on f and it is not obvious that the convergence holds for all f ∈ Cb (S) with probability one. Theorem 42 (Varadarajan) Let (S, d) be a separable metric space. Then µn converges to µ weakly almost surely, � � P ω : µn (·)(ω) → µ weakly = 1. Proof. Since (S, d) is separable, by Theorem 2.8.2 in R.A.P., there exists a metric e on S such that (S, e) is totally bounded and e and d define the same topology, i.e. e(sn , s) → 0 if and only if d(sn , s) → 0. This, of course, means that Cb (S, d) = Cb (S, e) and weak convergence of measures does not change. If (T, e) is the completion of (S, e) then (T, e) is compact. By the Arzela-Ascoli theorem, BL(T, e) is separable with respect to the d∞ norm and, therefore, BL(S, e) is also separable. Let (fm ) be a dense subset of BL(S, e). Then, by the strong law of large number, n

� fm dµn =

1� fm (Xi ) → Efm (X1 ) = n i=1

� fm dµ a.s.

� � Therefore, on the set of probability � one, fm � dµn → fm dµ for all m ≥ 1. Since (fm ) is dense in BL(S, e), on the same set of probability one, f dµn → f dµ for all f ∈ BL(S, e). Since (S, e) is separable, the previous theorem implies that µn → µ weakly.

73

Section 18

Convergence and uniform tightness.

In this section, we will make several connections between convergence of measures and uniform tightness on general metric spaces, which are similar to the results in the Euclidean setting. First, we will show that, in some sense, uniform tightness is necessary for convergence of laws. Theorem 43 If Pn → P0 on S and each Pn is tight for n ≥ 0, then (Pn )n≥0 is uniformly tight.

Proof. Since Pn → P0 and P0 is tight, by Theorem 41, the Levy-Prohorov metric ρ(Pn , P0 ) → 0. Given

ε > 0, let us take a compact K such that P0 (K) > 1 − ε. By definition of ρ, � � 1 1 1 − ε < P0 (K) ≤ Pn K ρ(Pn ,P0 )+ n + ρ(Pn , P0 ) + n and, therefore, � � a(n) = inf δ > 0 : Pn (K δ ) > 1 − ε → 0. By regularity of measure Pn , any measurable set A can be approximated by its closed subset F . Since Pn is tight, we can choose a compact of measure close to one, and intersecting it with the closed subset F, we can approximate any set A by its compact subset. Therefore, there exists a compact Kn ⊆ K 2a(n) such that Pn (Kn ) > 1 − ε. Let

L = K (∪n≥1 Kn ). Then Pn (L) ≥ Pn (Kn ) > 1 − ε. It remains to show that L is compact. Consider a sequence (xn ) on L. There are two possibilities. First, if there exists an infinite subsequence (xn(k) ) that belongs to one of the compacts Kj then it has a converging subsubsequence in Kj and as a result in L. If not, then there exists a subsequence (xn(k) ) such that xn(k) ∈ Km(k) and m(k) → ∞ as k → ∞. Since Km(k) ⊆ K 2a(m(k)) there exists yk ∈ K such that d(xn(k) , yk ) ≤ 2a(m(k)). Since K is compact, the sequence yk ∈ K has a converging subsequence yk(r) → y ∈ K which implies that d(xn(k(r)) , y) → 0, i.e. xn(k(r)) → y ∈ L. Therefore, L is compact. We already know from the Selection Theorem in Section 8 that any uniformly tight sequence of laws on any metric space has a converging subsequence. Under additional assumptions on (S, d) we can complement the Selection Theorem and make some connections to the metrics defined in the previous section. Theorem 44 Let (S, d) be a complete separable metric space and A be a subset of probability laws on S. Then the following are equivalent.

74

1. A is uniformly tight. 2. For any sequence Pn ∈ A there exists a converging subsequence Pn(k) → P where P is a law on S. 3. A has the compact closure on the space of probability laws equipped with the Levy-Prohorov or bounded Lipschitz metrics ρ or β. 4. A is totally bounded with respect to ρ or β. Remark. Implications 1 =⇒ 2 =⇒ 3 =⇒ 4 hold without completeness assumption and the only implication where completeness will be used is 4 =⇒ 1. Proof. 1=⇒2. Any sequence Pn ∈ A is uniformly tight and, by selection theorem, there exists a converging subsequence. 2=⇒ 3. Since (S, d) is separable, by Theorem 41, Pn → P if and only if ρ(Pn , P) or β(Pn , P) → 0. Every sequence in the closure A¯ can be approximated by a sequence in A. That sequence has a converging subsequence that, obviously, converges to an element in A¯ which means that the closure of A is compact. 3=⇒4. Compact sets are totally bounded and, therefore, if the closure A¯ is compact, the set A is totally bounded. √ 4=⇒1. Since ρ ≤ 2 β, we will only deal with ρ. For any ε > 0, there exists a finite subset B ⊆ A such that A ⊆ B ε . Since (S, d) is complete and separable, by Ulam’s theorem, for each P ∈ B there exists a compact KP such that P(KP ) > 1 − ε. Therefore,

KB = KP is a compact and P(KB ) > 1 − ε for all P ∈ B. P∈B

For any ε > 0, let F be a finite set such that KB ⊆ F ε (here we will denote by F ε the closed ε-neighborhood of F ). Since A ⊆ B ε , for any Q ∈ A there exists P ∈ B such that ρ(Q, P) < ε and, therefore, 1 − ε ≤ P(KB ) ≤ P(F ε ) ≤ Q(F 2ε ) + ε. Thus, 1 − 2ε ≤ Q(F 2ε ) for all Q ∈ A. Given δ > 0, take εm = δ/2m+1 and find Fm as above, i.e. � � δ δ/2m 1 − m ≤ Q Fm . 2 �� � � � δ/2m δ/2m Then Q m≥1 Fm ≥ 1 − m≥1 2δ m = 1 − δ. Finally, L = m≥1 Fm is compact because it is closed and totally bounded by construction, and S is complete.

Corollary 5 (Prohorov) The set of laws on a complete separable metric space is complete with respect to metrics ρ or β. Proof. If a sequence of laws is Cauchy w.r.t. ρ or β then it is totally bounded and by previous theorem it has a converging subsequence. Obviously, Cauchy sequence will converge to the same limit. Finally, let us state as a result the idea which appeared in Lemma 19 in Section 9. Lemma 36 Suppose that (Pn ) is uniformly tight on a metric space (S, d). Suppose that all converging sub­ sequences (Pn(k) ) converge to the same limit, i.e. if Pn(k) → P0 then P0 is independent of (n(k)). Then Pn → P0 . Proof. Any subsequence (Pn(k) ) is uniformly tight and, by the selection theorem, it has a converging subsubsequence (Pn(k(r)) ) which has to converge to P0 . Lemma 13 in Section 8 finishes the proof. This will be very useful when proving convergence of laws on metric spaces, such as C([0, 1]), for example. If we can prove that (Pn ) is uniformly tight and, assuming that a subsequence converges, can identify the unique limit, then the sequence Pn must converge to the same limit. 75

Section 19

Strassen’s Theorem. Relationships between metrics. Metric for convergence in probability. Let (Ω, B, P) be a probability space, (S, d) - a metric space and X, Y : Ω → S - random variables with values in S. The quantity α(X, Y ) = inf{ε ≥ 0 : P(d(X, Y ) > ε) ≤ ε}

is called the Ky Fan metric on the set L0 (Ω, S) of classes of equivalences of such random variables, where two r.v.s are equivalent if they are equal a.s. If we take a sequence εk ↓ α = α(X, Y ) then P(d(X, Y ) > εk ) ≤ εk and since I(d(X, Y ) > εk ) ↑ I(d(X, Y ) > α), by monotone convergence theorem, P(d(X, Y ) > α) ≤ α. Thus, the infimum in the definition of α(X, Y ) is attained. Lemma 37 α is a metric on L0 (Ω, S) which metrizes convergence in probability. Proof. First of all, clearly, α(X, Y ) = 0 iff X = Y almost surely. To prove the triangle inequality, P(d(X, Z) > α(X, Y ) + α(Y, Z)) ≤ ≤

P(d(X, Y ) > α(X, Y )) + P(d(Y, Z) > α(Y, Z)) α(Y, Z) + α(Y, Z)

so that α(X, Z) ≤ α(X, Y ) + α(Y, Z). This proves that α is a metric. Next, if αn = α(Xn , X) → 0 then for any ε > 0 and large enough n such that αn < ε, P(d(Xn , X) > ε) ≤ P(d(Xn , X) > αn ) ≤ αn → 0. Conversely, if Xn → X in probability then for any m ≥ 1 and large enough n ≥ n(m), � 1� 1 ≤ P d(Xn , X) > m m which means that αn ≤ 1/m so that αn → 0. Lemma 38 For X, Y ∈ L0 (Ω, S), the Levy-Prohorov metric ρ satisfies ρ(L(X), L(Y )) ≤ α(X, Y ). 76

Proof. Take ε > α(X, Y ) so that P(d(X, Y ) ≥ ε) ≤ ε. For any set A ⊆ S, P(X ∈ A) = P(X ∈ A, d(X, Y ) < ε) + P(X ∈ A, d(X, Y ) ≥ ε) ≤ P(Y ∈ Aε ) + ε which means that ρ(L(X), L(Y )) ≤ ε. Letting ε ↓ α(X, Y ) proves the result. We will now prove that, in some sense, the opposite is also true. Let (S, d) be a metric space and P, Q be probability laws on S. Suppose that these laws are close in the Levy-Prohorov metric ρ. Can we construct random variables s1 and s2 , with laws P and Q, that are define on the same probability space and are close to each other in the Ky Fan metric α? We will construct a distribution on the product space S × S such that the coordinates s1 and s2 have marginal distributions P and Q and the distribution is concentrated in the neighborhood of the diagonal s1 = s2 , where s1 and s2 are close in metric d, and the size of the neighborhood is controlled by ρ(P, Q). Consider two sets X and Y. Given a subset K ⊆ X × Y and A ⊆ X we define a K-image of A by AK = {y ∈ Y : ∃x ∈ A, (x, y) ∈ K}. A K-matching f of X into Y is a one-to-one function f : X → Y such that (x, f (x)) ∈ K. We will need the following well known matching theorem. Theorem 45 If X, Y are finite and for all A ⊆ X, card(AK ) ≥ card(A)

(19.0.1)

then there exists a K-matching f of X into Y . Proof. We will prove the result by induction on m = card(X). The case of m = 1 is obvious. For each x ∈ X there exists y ∈ Y such that (x, y) ∈ K. If there is a matching f of X \{x} into Y \{y} then defining f (x) = y extends f to X. If not, then since card(X \{x}) < m, by induction assumption, condition (19.0.1) is violated, i.e. there exists a set A ⊆ X \ {x} such that card(AK \ {y}) < card(A). But because we also know that card(AK ) ≥ card(A) this implies that card(AK ) = card(A). Since card(A) < m, by induction there exists a matching of A onto AK . If there is a matching of X \ A into Y \ AK we can combine it with a matching of A and AK . If not, again by induction assumption, there exists D ∈ X \ A such that card(DK \ AK ) < card(D). But then � � card (A ∪ D)K = card(DK \ AK ) + card(AK ) < card(D) + card(A) = card(D ∪ A), which contradicts the assumption (19.0.1).

Theorem 46 (Strassen) Suppose that (S, d) is a separable metric space and α, β > 0. Suppose that laws P and Q are such that for all measurable sets F ⊆ S, P(F ) ≤ Q(F α ) + β Then for any ε > 0 there exist two non-negative measures η, γ on S × S such that 1. µ = η + γ is a law on S × S with marginals P and Q. 2. η(d(x, y) > α + ε) = 0. 3. γ(S × S) ≤ β + ε. 4. µ is a finite sum of product measures.

77

(19.0.2)

Remark. Condition (19.0.2) is a relaxation of the definition of the Levy-Prohorov metric, one can take any α, β > ρ(P, Q). Conditions 1 - 3 mean that we can construct a measure µ on S × S such that coordinates x, y have marginal distributions P, Q, concentrated within distance α + ε of each other (condition 2) except for the set of measure at most β + ε (condition 3). Proof. The proof will proceed in several steps. Case A. We will start with the simplest case which is, however, at the core of everything else. Given small ε > 0, take n ≥ 1 such that nε > 1. Suppose that laws P, Q are uniform on finite subsets M, N ⊆ S of equal cardinality, card(M ) = card(N ) = n, P(x) = Q(y) =

1 < ε, x ∈ M, y ∈ N. n

Using condition (19.0.2), we would like to match as many points from M and N as possible, but only points that are within distance α from each other. To use the matching theorem, we will introduce some auxiliary sets U and V that are not too big, with size controlled by parameter β, and the union of these sets with M and N satisfies a certain matching condition. Take integer k such that βn ≤ k < (β + ε)n. Let us take sets U and V such that k = card(U ) = card(V ) and U, V are disjoint from M, N . Define X = M ∪ U, Y = N ∪ V. Let us define a subset K ⊆ X × Y such that (x, y) ∈ K if and only if one of the following holds: 1. x ∈ U, 2. y ∈ V, 3. d(x, y) ≤ α if x ∈ M, y ∈ N. This means that small auxiliary sets can be matched with any points but only close points, d(x, y) ≤ α, can be matched in the main sets M and N. Consider a set A ⊆ X with cardinality card(A) = r. If A �⊆ M then by 1, AK = Y and card(AK ) ≥ r. Suppose now that A ⊆ M and we would like to show that again card(AK ) ≥ r. By (19.0.2), r 1 1 = P(A) ≤ Q(Aα ) + β = card(Aα ∩ N ) + β ≤ card(AK ∩ N ) + β n n n since by 3, Aα ⊆ AK . Therefore, r = card(A) ≤ nβ + card(AK ∩ N ) ≤ k + card(AK ∩ N ) = card(AK ), since k = card(V ) and AK = V ∪ (AK ∩ N ). By matching theorem, there exists a K-matching f of X and Y . Let T = {x ∈ M : f (x) ∈ N }, i.e. close points, d(x, y) ≤ α, from M that are matched with points in N . Clearly, card(T ) ≥ n − k and for x ∈ T, by 3, d(x, f (x)) ≤ α. For x ∈ M \ T, redefine f (x) to match x with arbitrary points in N that are not matched with points in T. This defines a matching of M onto N. We define measures η and γ by η=

1 � 1 � δ(x, f (x)), γ = δ(x, f (x)), n n x∈T

x∈M \T

and let µ = η + γ. First of all, obviously, µ has marginals P and Q because each point in M or N appears in the sum η + γ only once with weight 1/n. Also, η(d(x, f (x)) > α) = 0, γ(S × S) ≤

78

card(M \ T ) k ≤ < β + ε. n n

(19.0.3)

Finally, both η and γ are finite sums of point masses which are product measures of point masses. Case B. Suppose now that P and Q are concentrated on finitely many points with rational probabilities. Then we can artificially split all points into ”smaller” points of equal probabilities as follows. Let n be such that nε > 1 and nP(x), nQ(x) ∈ J = {1, 2, . . . , n}. Define a discrete metric on J by f (i, j) = ε I(i = � j) and define a metric on S × J by � � e (x, i), (y, j) = d(x, y) + f (i, j). Define a measure P� on S × J as follows. If P(x) =

j n

then

� � 1 P� (x, i) = for i = 1, . . . , j. n Define Q� similarly. Let us check that laws P� , Q� satisfy the assumptions of Case A. Given a set F ⊆ S × J, define F1 = {x ∈ S : (x, j) ∈ F for some j}. Using (19.0.2), P� (F ) ≤ P(F1 ) ≤ Q(F1α ) + β ≤ Q� (F α+ε ) + β, because f (i, j) ≤ ε. By Case A in (19.0.3), we can construct µ� = η � + γ � with marginals P� and Q� such that � � � � η � e((x, i), (y, j)) > α + ε = 0, γ � (S × J) × (S × J) < β + ε. Let µ, η, γ be the projections of µ� , η � , γ � back onto S × S by the map ((x, i), (y, j)) → (x, y). Then, clearly, µ = η + γ, µ has marginals P and Q and γ(S × S) < β + ε. Finally, since � � e (x, i), (y, j) = d(x, y) + f (i, j) ≥ d(x, y), we get � � � � η d(x, y) > α + ε ≤ η � e((x, i), (y, j)) > α + ε = 0. Case C. (General case) Let P, Q be the laws on a separable metric space (S, d). Let A be a maximal set such that for all x, y ∈ A, d(x, y) ≥ ε. The set A is countable, A = {xi }i≥1 , because S is separable, and since A is maximal, for all x ∈ S there exists y ∈ A such that d(x, y) < ε. Such set A is usually called an ε-packing. Let us create a partition of S using ε-balls around {xi } : B1 = {x ∈ S : d(x, x1 ) < ε}, B2 = {d(x, x2 ) < ε} \ B1 and, iteratively for k ≥ 2, Bk = {d(x, xk ) < ε} \ (B1 ∪ · · · ∪ Bk−1 ). {Bk }k≥1 is a partition of S. Let us discretize measures P and Q by projecting them onto {xi }i≥1 : P� (xk ) = P(Bk ), Q� (xk ) = Q(Bk ). Consider any set F ⊆ S. For any point x ∈ F, if x ∈ Bk then d(x, xk ) < ε, i.e. xk ∈ F ε and, therefore, P(F ) ≤ P� (F ε ). Also, if xk ∈ F then Bk ⊆ F ε and, therefore, P� (F ) ≤ P(F ε ). To apply Case B, we need to approximate P� by a measure on a finite number of points with rationals probabilities. For large enough n ≥ 1, let P�� (xk ) =

�nP� (xk )� . n

79

Clearly, as n → ∞, P�� (xk ) ↑ P� (xk ). Since only a finite number of points carry non-zero weights P�� (xk ) > 0, let x0 be one of the other points in the sequence {xk }. Let us assign to it a probability � P�� (x0 ) = 1 − P�� (xk ). k≥1

If we take n large enough so that P�� (x0 ) < ε/2 then � |P�� (xk ) − P� (xk )| ≤ ε. k≥0

All the relations above also hold true for Q, Q� and Q�� that are defined similarly. We can write for F ⊆ S P�� (F ) ≤ P� (F ) + ε ≤ P(F ε ) + ε ≤ Q(F ε+α ) + β + ε ≤ Q� (F α+2ε ) + β + ε ≤ Q�� (F α+2ε ) + β + 2ε. By Case B, there exists a decomposition µ�� = η �� + γ �� on S × S with marginals P�� and Q�� such that � � η �� d(x, y) > α + 3ε = 0, γ �� (S × S) ≤ β + 3ε. Let us also assume that the points (x0 , xi ) and (xi , x0 ) for i ≥ 0 are included in the support of γ �� . Since the total weight of these points is at most ε, the total weight of γ �� does no increase much: γ �� (S × S) ≤ β + 5ε. It remains to redistribute these measures from sequence {xi }i≥0 to S in a way that recovers marginal distributions P and Q and so that not much accuracy is lost. Define a sequence of measures on S by Pi (C) =

P(CBi ) if P(Bi ) > 0 and Pi (C) = 0 otherwise P(Bi )

and define Qi similarly. The measures Pi and Qi are concentrated on Bi . Define � η= η �� (xi , xj )(Pi × Qj ) i,j≥1

The marginals of η satisfy u(C) = η(C × S) ≤







η �� (xi , xj )Pi (C) =

i,j≥1



η �� (xi , S)Pi (C)

i≥1

��

P (xi )Pi (C) ≤

i≥1



P� (xi )Pi (C) =



i≥1

P(Bi )Pi (C) = P(C)

i≥1

and, similarly, v(C) = η(S × C) ≤ Q(C). Since η �� (xi , xj ) = 0 unless d(xi , xj ) ≤ α + 3ε, the measure � η= η �� (xi , xj )(Pi × Qj ) i,j≥1

is concentrated on the set {d(x, y) ≤ α + 5ε} because for x ∈ Bi , y ∈ Bj , d(x, y) ≤ d(x, xi ) + d(xi , xj ) + d(xj , y) ≤ ε + α + 3ε + ε = α + 5ε. If u(S) = v(S) = 1 then η(S × S) = 1 and η has marginals P and Q so we can take γ = 0. Otherwise, take t = 1 − u(S) and define 1 γ = (P − u) × (Q − v). t 80

It is easy to check that µ = η + γ has marginals P and Q. Also, γ(S × S) = t = 1 − η(S × S) = 1 − η �� (S × S) = γ �� (S × S) ≤ β + 5ε.

Relationships between metrics. The following relationship between Ky Fan and Levy-Prohorov metrics is an immediate consequence of Strassen’s theorem. We already saw that ρ(L(X), L(Y )) ≤ α(X, Y ). Theorem 47 If (S, d) is a separable metric space and P, Q are laws on S then for any ε > 0 there exist random variables X and Y with distributions L(X) = P and L(Y ) = Q such that α(X, Y ) ≤ ρ(P, Q) + ε. If P and Q are tight, one can take ε = 0. Proof. Let us take α = β = ρ(P, Q). Then, by definition of the Levy-Prohorov metric, for any ε > 0 and for any set A, P(A) ≤ Q(Aρ+ε ) + ρ + ε. By Strassen’s theorem, there exists a measure µ on S × S with marginals P, Q such that � � µ d(x, y) > ρ + 2ε ≤ ρ + 2ε.

(19.0.4)

Therefore, if X and Y are the coordinates of S × S, i.e. X, Y : S × S → S, X(x, y) = x, Y (x, y) = y, then by definition of the Ky Fan metric, α(X, Y ) ≤ ρ + 2ε. If P and Q are tight then there exists a compact K such that P(K), Q(K) ≥ 1 − δ. For ε = 1/n find µn as in (19.0.4). Since µn has marginals P and Q, µn (K × K) ≥ 1 − 2δ, which means that (µn )n≥1 are uniformly tight. By selection theorem, there exists a convergent subsequence µn(k) → µ. Obviously, µ has marginals P and Q. Since by construction, � 2� 2 µn d(x, y) > ρ + ≤ρ+ n n and {d(x, y) > ρ + 2/n} is an open set on S × S, by portmanteau theorem, � � 2� 2 � µ d(x, y) > ρ + ≤ lim inf µn(k) d(x, y) > ρ + ≤ ρ. k→∞ n n(k) Letting n → ∞ we get µ(d(x, y) > ρ) ≤ ρ and, therefore, α(X, Y ) ≤ ρ. This also implies the relationship between the Bounded Lipschitz metric β and Levy-Prohorov metric ρ. Lemma 39 If (S, d) is a separable metric space then � 1 β(P, Q) ≤ ρ(P, Q) ≤ 2 β(P, Q). 2 Proof. We already proved the second inequality. To prove the first one, given ε > 0 take random variables X and Y such that α(X, Y ) ≤ ρ + ε. Consider a bounded Lipschitz function f , ||f ||BL < ∞. Then � �� � � � � f dP − f dQ� = |Ef (X) − Ef (Y )| ≤ E|f (X) − f (Y )| � � ≤ �f �L (ρ + ε) + 2�f �∞ P d(X, Y ) > ρ + ε ≤

�f �L (ρ + ε) + 2�f �∞ (ρ + ε) ≤ 2�f �BL (ρ + ε).

Thus, β(P, Q) ≤ 2(ρ(P, Q) + ε) and letting ε → 0 finishes the proof.

81

Section 20

Kantorovich-Rubinstein Theorem. Let (S, d) be a separable metric space. Denote by P1 (S) the set of all laws on S such that for some z ∈ S (equivalently, for all z ∈ S), � d(x, z)P(x) < ∞. S

Let us denote by � � M (P, Q) = µ : µ is a law on S × S with marginals P and Q . Definition. For P, Q ∈ P1 (S), the quantity �� � W (P, Q) = inf d(x, y)dµ(x, y) : µ ∈ M (P, Q) is called the Wasserstein distance between P and Q. A measure µ ∈ M (P, Q) represents a transportation between measures P and Q. We can think of the conditional distribution µ(y|x) as a way to redistribute the mass in the neighborhood of a point x so that the distribution P will be redistributed to the distribution Q. If the distance d(x, y) represents the cost of moving x to y then the Wasserstein distance gives the optimal total cost of transporting P to Q. Given any two laws P and Q on S, let us define � � ��� � � � γ(P, Q) = sup � f dP − f dQ� : ||f ||L ≤ 1 and

� �� � md (P, Q) = sup f dP + gdQ : f, g ∈ C(S), f (x) + g(y) < d(x, y) .

Lemma 40 We have γ(P, Q) = md (P, Q). Proof. Given a function f such that ||f ||L ≤ 1 let us take a small ε > 0 and g(y) = −f (y) − ε. Then f (x) + g(y) = f (x) − f (y) − ε ≤ d(x, y) − ε < d(x, y) and



� f dP +



� f dP −

gdQ =

f dQ − ε.

Combining with the choice of −f (x) and g(y) = f (y) − ε we get � � �� � �� � � � f dP + gdQ : f (x) + g(y) < d(x, y) + ε � f dP − f dQ� ≤ sup 82

which, of course, proves that � �� � γ(P, Q) ≤ sup f dP + gdQ : f (x) + g(y) < d(x, y) . Let us now consider functions f, g such that f (x) + g(y) < d(x, y). Define e(x) = inf (d(x, y) − g(y)) = − sup(g(y) − d(x, y)) y

y

Clearly, f (x) ≤ e(x) ≤ d(x, x) − g(x) = −g(x) and, therefore, �

� f dP +

� gdQ ≤

� edP −

edQ.

Function e satisfies e(x) − e(x� )

� � � � sup g(y) − d(x� , y) − sup g(y) − d(x, y) y y � � � ≤ sup d(x, y) − d(x , y) ≤ d(x, x� ) =

y

which means that ||e||L = 1. This finishes the proof. We will need the following version of the Hahn-Banach theorem. Theorem 48 (Hahn-Banach) Let V be a normed vector space, E - a linear subspace of V and U - an open convex set in V such that U ∩ E �= ∅. If r : E → R is a linear non-zero functional on E then there exists a linear functional ρ : V → R such that ρ|E = r and supU ρ(x) = supU ∩E r(x). Proof. Let t = sup{r(x) : x ∈ U ∩ E} and let B = {x ∈ E : r(x) > t}. Since B is convex and U ∩ B = ∅, the Hahn-Banach separation theorem implies that there exists a linear functional q : V → R such that supU q(x) ≤ inf B q(x). For any x0 ∈ U ∩ E let F = {x ∈ E : q(x) = q(x0 )}. Since q(x0 ) < inf B q(x), F ∩ B = ∅. This means that the hyperplanes {x ∈ E : q(x) = q(x0 )} and {x ∈ E : r(x) = t} in the subspace E are parallel and this implies that q(x) = αr(x) on E for some α = � 0. Let ρ = q/α. Then r = ρ|E and sup ρ(x) = U

1 1 sup q(x) ≤ inf q(x) = inf r(x) = t = sup r(x). B α U α B U ∩E

Since r = ρ|E , this finishes the proof.

Theorem 49 If S is a compact metric space then W (P, Q) = md (P, Q) for P, Q ∈ P1 (S). Proof. Consider a vector space V = C(S × S) equipped with � · �∞ norm and let � � U = f ∈ V : f (x, y) < d(x, y) . Obviously, U is convex and open because S × S is compact and any continuous function on a compact achieves its maximum. Consider a linear subspace E of V defined by � � E = φ ∈ V : φ(x, y) = f (x) + g(y) so that � � U ∩ E = f (x) + g(y) < d(x, y) .

83

Define a linear functional r on E by � r(φ) =

� f dP +

gdQ if φ = f (x) + g(y).

By the above Hahn-Banach theorem, r can be extended to ρ : V → R such that ρ|E = r and sup ρ(φ) = sup r(φ) = md (P, Q). U

U ∩E

Let us look at the properties of this functional. First of all, if a(x, y) ≥ 0 then ρ(a) ≥ 0. Indeed, for any c ≥ 0 U � d(x, y) − c · a(x, y) − ε < d(x, y) and, therefore, for all c ≥ 0 ρ(d − ca − ε) = ρ(d) − cρ(a) − ρ(ε) ≤ sup ρ < ∞. U

This can hold only if ρ(a) ≥ 0. This implies that if φ1 ≤ φ2 then ρ(φ1 ) ≤ ρ(φ2 ). For any function φ, both −φ, φ ≤ �φ�∞ · 1 and, by monotonicity of ρ, |ρ(φ)| ≤ ||φ||∞ ρ(1) = ||φ||∞ . Since S × S is compact and ρ is a continuous functional on (C(S × S), � · �∞ ), by the Reisz representation theorem there exists a unique measure µ on the Borel σ-algebra on S × S such that � ρ(f ) = f (x, y)dµ(x, y). Since ρ|E = r, �

� (f (x) + g(y))dµ(x, y) =

� f dP +

gdQ

which implies that µ ∈ M (P, Q). We have �� � � md (P, Q) = sup ρ(φ) = sup f (x, y)dµ(x, y) : f (x, y) < d(x, y) = d(x, y)dµ(x, y) ≥ W (P, Q). U

The opposite inequality is easy because for any f, g such that f (x) + g(y) < d(x, y) and any ν ∈ M (P, Q), � � � � f dP + gdQ = (f (x) + g(y))dν(x, y) ≤ d(x, y)dν(x, y). (20.0.1) This finishes the proof and, moreover, it shows that the infimum in the definition of W is achieved on µ. Remark. Notice that in the proof of this theorem we never used the fact that d is a metric. Theorem holds for any d ∈ C(S × S) under the corresponding integrability assumptions. For example, one can consider loss functions of the type d(x, y)p for p > 1, which are not necessarily metrics. However, in Lemma 40, the fact that d is a metric was essential. Our next goal will be to show that W = γ on separable and not necessarily compact metric spaces. We start with the following. Lemma 41 If (S, d) is a separable metric space then W and γ are metrics on P1 (S). Proof. Since for a bounded Lipschitz metric β we have β(P, Q) ≤ γ(P, Q), γ is also a metric because if γ(P, Q) = 0 then β(P, Q) = 0 and, therefore, P = Q. As in (20.0.1), it should be obvious that γ(P, Q) = md (P, Q) ≤ W (P, Q) and if W (P, Q) = 0 then γ(P, Q) = 0 and P = Q. Symmetry of W is obvious. It remains to show that W (P, Q) satisfies the triangle inequality. The idea will be rather simple, but to have well-defined 84

conditional distributions we will need to approximate distributions on S × S with given marginals by a more regular disributions with the same marginals. Let us first explain the main idea. Consider three laws P, Q, T on S and let µ ∈ M (P, Q) and ν ∈ M (Q, T) be such that � � d(x, y)dµ(x, y) ≤ W (P, Q) + ε and d(y, z)dν(y, z) ≤ W (Q, T) + ε. Let us generate a distribution γ on S ×S ×S with marginals P, Q and T and marginals on pairs of coordinates (x, y) and (y, z) given by µ and ν by ”gluing” µ and ν in the following way. Let us generate y from distribution Q and, given y, generate x and z according to conditional distributions µ(x|y) and ν(z|y) independently of each other, i.e. γ(x, z|y) = µ(x|y) × ν(z|y). Obviously, by construction, (x, y) has distribution µ and (y, z) has distribution ν. Therefore, the marginals of x and z are P and T which means that the pair (x, z) has distribution η ∈ M (P, T). Finally, � � � � W (P, T) ≤ d(x, z)dη(x, z) = d(x, z)dγ(x, y, z) ≤ d(x, y)dγ + d(y, z)dγ � � = d(x, y)dµ + d(y, z)dν ≤ W (P, Q) + W (Q, T) + 2ε. Letting ε → 0 proves the triangle inequality for W . It remains to explain how the conditional distributions can be well defined. Let us modify µ by ’discretizing’ it without losing much in the transportation cost integral. Given ε > 0, consider a partition (Sn )n≥1 of S such that diameter(Sn ) < ε for all n. This can be done as in the proof of Strassen’s theorem, Case C. On each box Sn × Sm let µ1nm (C) =

µ((C ∩ Sn ) × Sm ) µ(Sn × (C ∩ Sm )) , µ2nm (C) = µ(Sn × Sm ) µ(Sn × Sm )

be the marginal distributions of the conditional distribution of µ on Sn × Sm . Define � µ� = µ(Sn × Sm ) µ1nm × µ2nm . n,m

In this construction, locally on each small box Sn × Sm , measure µ is replaced by the product measure with the same marginals. Let us compute the marginals of µ� . Given a set C ⊆ S, � µ� (C × S) = µ(Sn × Sm ) µ1nm (C) × µ2nm (S) n,m

=



µ((C ∩ Sn ) × Sm ) =

n,m

� n

µ((C ∩ Sn ) × S) =



P(C ∩ Sn ) = P(C).

n

Similarly, µ� (S × C) = Q(C), so µ� has the same marginals as µ, µ� ∈ M (P, Q). It should be obvious that transportation cost integral does not change much by replacing µ with µ� . One can visualize this by looking at what happens locally on each small box Sn × Sm . Let (Xn , Ym ) be a random pair with distribution µ restricted to Sn × Sm so that � 1 Ed(Xn , Ym ) = d(x, y)dµ(x, y). µ(Sn × Sm ) Sn ×Sm Let Ym� be an independent copy of Ym , also independent of Xn , i.e. the joint distribution of (Xn , Ym� ) is µ1nm × µ2nm and � 1 Ed(Xn , Ym� ) = d(x, y)d(µnm × µ2nm )(x, y). Sn ×Sm

Then

� d(x, y)dµ(x, y) =



µ(Sn × Sm )Ed(Xn , Ym ),

n,m

85



d(x, y)dµ� (x, y) =



µ(Sn × Sm )Ed(Xn , Ym� ).

n,m

Finally,

d(Ym , Ym� )

≤ diam(Sm ) ≤ ε and these two integrals differ by at most ε. Therefore, � d(x, y)dµ� (x, y) ≤ W (P, Q) + 2ε.

Similarly, we can define ν� =



1 2 ν(Sn × Sm ) νnm × νnm

n,m

such that



d(x, y)dν � (x, y) ≤ W (Q, T) + 2ε.

We will now show that this special simple form of the distributions µ� (x, y), ν � (y, z) ensures that the condi­ tional distributions of x and z given y are well defined. Let Qm be the restriction of Q to Sm , � Qm (C) = Q(C ∩ Sm ) = µ(Sn × Sm ) µ2nm (C). n

Obviously, if Qm (C) = 0 then µ2nm (C) = 0 for all n, which means that µ2nm are absolutely continuous with respect to Qm and the Radon-Nikodym derivatives fnm (y) =

� dµ2nm (y) exist and µ(Sn × Sm )fnm (y) = 1 a.s. for y ∈ Sm . dQm n

Let us define a conditional distribution of x given y by � µ� (A|y) = µ(Sn × Sm )fnm (y)µ1nm (A). n,m

Notice that for any A ∈ B, µ� (A|y) is measurable in y and µ� (A|y) is a probability distribution on B, Q-a.s. over y because � µ� (S|y) = µ(Sn × Sm )fnm (y) = 1 a.s. n,m

Let us check that for Borel sets A, B ∈ B, µ� (A × B) =



µ� (A|y)dQ(y).

B

Indeed, since fnm (y) = 0 for y �∈ Sm , � � � µ� (A|y)dQ(y) = µ(Sn × Sm )µ1nm (A) fnm (y)dQ(y) B

B

n,m

=



µ(Sn × Sm )µ1nm (A)



fnm (y)dQm (y) B

n,m

=



µ(Sn × Sm )µ1nm (A)µ2nm (B) = µ� (A × B).

n,m

Conditional distribution ν � (·|y) can be defined similarly. Next lemma shows that on a separable metric space any law with the ”first moment”, i.e. P ∈ P1 (S), can be approximated in metrics W and γ by laws concentrated on finite sets.

86

Lemma 42 If (S, d) is separable and P ∈ P1 (S) then there exists a sequence of laws Pn such that Pn (Fn ) = 1 for some finite sets Fn and W (Pn , P), γ(Pn , P) → 0. Proof. For each n ≥ 1, let (Snj )j≥1 be a partition of S such that diam(Snj ) ≤ 1/n. Take a point xnj ∈ Snj in each set Snj and for k ≥ 1 define a function � xnj , if x ∈ Snj for j ≤ k, fnk (x) = xn1 , if x ∈ Snj for j > k. We have, � � � d(x, fnk (x))dP(x) =

1� d(x, fnk (x))dP(x) ≤ P(Snj ) + n Snj

j≥1



j≤k

d(x, xn1 )dP(x) ≤ S\(Sn1 ∪···∪Snk )

2 n

� for k large enough because P ∈ P1 (S), i.e. d(x, xn1 )dP(x) < ∞, and the set S \ (Sn1 ∪ · · · ∪ Snk ) ↓ ∅. Let µn be the image on S × S of the measure P under the map x → (fnk (x), x) so that µn ∈ M (Pn , P) for some Pn concentrated on the set of points {xn1 , . . . , xnk }. Finally, � � 2 W (Pn , P) ≤ d(x, y)dµn (x, y) = d(fnk (x), x)dP(x) ≤ . n Since γ(Pn , P) ≤ W (Pn , P), this finishes the proof. We are finally ready to extend Theorem 49 to separable metric spaces. Theorem 50 (Kantorovich-Rubinstein) If (S, d) is a separable metric space then for any two distributions P, Q ∈ P1 (S) we have W (P, Q) = γ(P, Q). Proof. By previous lemma, we can approximate P and Q by Pn and Qn concentrated on finite (hence, compact) sets. By Theorem 49, W (Pn , Qn ) = γ(Pn , Qn ). Finally, since both W, γ are metrics, W (P, Q)

≤ W (P, Pn ) + W (Pn , Qn ) + W (Qn , Q) =

W (P, Pn ) + γ(Pn , Qn ) + W (Qn , Q)

≤ W (P, Pn ) + W (Qn , Q) + γ(Pn , P) + γ(Qn , Q) + γ(P, Q). Letting n → ∞ proves that W (P, Q) ≤ γ(P, Q). Wasserstein’s� distance Wp (P, Q). Given p ≥ 1, let us define the Wasserstein distance Wp (P, Q) on Pp (Rn ) = {P : |x|p dP(x) < ∞} corresponding to the cost function d(x, y) = |x − y|p by �� � p Wp (P, Q) := inf |x − y|p dµ(x, y) : µ ∈ M (P, Q) � �� � = sup f dP + gdQ : f (x) + g(y) < |x − y|p . (20.0.2) Even though for p > 1 the function d(x, y) is not a metric, equality in (20.0.2) for compactly supported measures P and Q follows from the proof of Theorem 49, which does not require that d is a metric. Then one can easily extend (20.0.2) to the entire space Rn . Moreover, Wp is a metric on Pp (Rn ) which can be shown the same way as in Lemma 41. Namely, given nearly optimal µ ∈ M (P, Q) and ν ∈ M (Q, T) we can construct (X, Y, Z) ∼ M (P, Q, T) such that (X, Y ) ∼ µ and (Y, Z) ∼ ν and, therefore, 1

1

1

1

1

Wp (P, T) ≤ (E|X − Z|p ) p ≤ (E|X − Y |p ) p + (E|Y − Z|p ) p ≤ (Wpp (P, Q) + ε) p + (Wpp (Q, T) + ε) p . Let ε ↓ 0.

87

Section 21

Prekopa-Leindler inequality, entropy and concentration. In this section we will make several connections between the Kantorovich-Rubinstein theorem and other classical objects. Let us start with the following classical inequality. Theorem 51 (Prekopa-Leindler) Consider nonnegative integrable functions w, u, v : Rn → [0, ∞) such that for some λ ∈ (0, 1), w(λx + (1 − λ)y) ≥ u(x)λ v(y)1−λ for all x, y ∈ Rn . Then, � wdx ≥

��

�λ �� �1−λ udx vdx .

Proof. The proof will proceed by induction on n. Let us first show the induction step. Suppose the statement holds for n and we would like to show it for n + 1. By assumption, for any x, y ∈ Rn and a, b ∈ R w(λx + (1 − λ)y, λa + (1 − λ)b) ≥ u(x, a)λ v(y, b)1−λ . Let us fix a and b and consider functions w1 (x) = w(x, λa + (1 − λ)b), u1 (x) = u(x, a), v1 (x) = v(x, b) on Rn that satisfy

w1 (λx + (1 − λ)y) ≥ u1 (x)λ v1 (y)1−λ .

By induction assumption, � w1 dx ≥ Rn

��

�λ �� u1 dx

Rn

v1 dx

�1−λ

.

Rn

These integrals still depend on a and b and we can define � � w2 (λa + (1 − λ)b) = w1 dx = Rn

w(x, λa + (1 − λ)b)dx Rn

and, similarly, � u2 (a) =

� u1 (x, a)dx, v2 (b) =

Rn

v1 (x, b)dx Rn

so that w2 (λa + (1 − λ)b) ≥ u2 (a)λ v2 (b)1−λ .

88

These functions are defined on R and, by induction assumption, � � �� �λ �� �1−λ �� w2 ds ≥ u2 ds v2 ds =⇒ wdz ≥ R

R

Rn+1

R

udz

�λ ��

Rn+1

vdz

�1−λ

,

Rn+1

which finishes the proof of the induction step. It remains to prove the case n = 1. Let us show two different proofs. 1. One approach is based on the Brunn-Minkowski inequality on the real line which says that, if γ is the Lebesgue measure and A, B are Borel sets on R, then γ(λA + (1 − λ)B) ≥ λγ(A) + (1 − λ)γ(B), where A+B is the set addition, i.e. A+B = {a+b : a ∈ A, b ∈ B}. We can also assume that u, v, w : R → [0, 1] because the inequality is homogeneous to scaling. We have {w ≥ a} ⊇ λ{u ≥ a} + (1 − λ){v ≥ a} because if u(x) ≥ a and v(y) ≥ a then, by assumption, w(λx + (1 − λ)y) ≥ u(x)λ v(y)1−λ ≥ aλ a1−λ = a. The Brunn-Minkowski inequality implies that γ(w ≥ a) ≥ λγ(u ≥ a) + (1 − λ)γ(v ≥ a). Finally, �

1

� � w(z)dz

R

R

=

γ(w ≥ x)dx

0 1

0

� 1 γ(u ≥ x)dx + (1 − λ) γ(v ≥ x)dx 0 0 � � �� �λ �� �1−λ λ u(z)dz + (1 − λ) v(z)dz ≥ u(z)dz v(z)dz . �



1

� I(x ≤ w(z))dxdz =

= λ

R

R

R

R

� � 2. Another approach is based on the transportation of measure. We can assume that u = v = 1 by rescaling u v w u → � , v → � , w → � λ � 1−λ . u v ( u) ( v) � Then we need to show that w ≥ 1. Without loss of generality, let us assume that u, v ≥ 0 are smooth and strictly positive, since one can easily reduce to this case. Define x(t), y(t) for 0 ≤ t ≤ 1 by �

x(t)



y(t)

u(s)ds = t,

v(s)ds = t.

−∞

−∞

Then u(x(t))x� (t) = 1, u(y(t))y � (t) = 1 and the derivatives x� (t), y � (t) > 0. Define z(t) = λx(t) + (1 − λ)y(t). Then �

+∞

� w(s)ds =

−∞

1

� w(z(s))dz(s) =

0

1

w(λx(s) + (1 − λ)y(s))z � (s)ds.

0

By arithmetic-geometric mean inequality z � (s) = λx� (s) + (1 − λ)y � (s) ≥ (x� (s))λ (y � (s))1−λ

89

and, by assumption, w(λx(s) + (1 − λ)y(s)) ≥ u(x(s))λ v(y(s))1−λ . Therefore, � w(s)ds ≥

� 1�

� �λ � �1−λ u(x(s))x� (s) v(y(s))y � (s) ds =

0

1

1ds = 1.

0

This finishes the proof of theorem. Entropy and the Kullback-Leibler divergence. Consider a probability measure P on Rn and a nonneg­ ative measurable function u : Rn → [0, ∞). Definition (Entropy)We define the entropy of u with respect to P by � � � EntP (u) = u log udP − udP · log udP. One can give a different representation of entropy by � �� � EntP (u) = sup uvdP : ev dP ≤ 1 .

(21.0.1)

� Indeed, if we consider a convex set V = {v : ev dP ≤ 1} then the above supremum is obviously a solution of the following saddle point problem: � �� � L(v, λ) = uvdP − λ ev dP − 1 → sup inf . v

λ≥0

The functional L is linear in λ and concave in v. Therefore, by the minimax theorem, a saddle point solution exists and sup inf = inf sup . The integral � � � uvdP − λ ev dP = (uv − λev )dP can be maximized pointwise by taking v such that u = λev . Then � � u L(v, λ) = u log dP − udP + λ λ � � and maximizing over λ gives λ = u and v = log(u/ u). This proves (21.0.1). Suppose now that a law Q is absolutely continuous with respect to P and denote its Radon-Nikodym derivative by u=

dQ . dP

(21.0.2)

Definition (Kullback-Leibler divergence) The quantity � � dQ D(Q||P) := log u dQ = log dQ dP is called the Kullback-Leibler divergence between P and Q. Clearly, D(Q||P) = EntP (u), since � � � � dQ dQ dQ dQ dQ EntP (u) = log · dP − dP · log dP = log dQ. dP dP dP dP dP The variational characterization (21.0.1) implies that � � � if ev dP ≤ 1 then vdQ = uvdP ≤ D(Q||P ). 90

(21.0.3)

Transportation inequality for log-concave measures. Suppose that a probability distribution P on Rn has the Lebesgue density e−V (x) where V (x) is strictly convex in the following sense: tV (x) + (1 − t)V (y) − V (tx + (1 − t)y) ≥ Cp (1 − t + o(1 − t))|x − y|p

(21.0.4)

as t → 1 for some p ≥ 2 and Cp > 0. Example. One example of the distribution that satisfies (21.0.4) is the non-degenerate normal distri­ bution N (0, C) that corresponds to 1 V (x) = (C −1 x, x) + const 2 for some covariance matrix C, det C �= 0. If we denote A = C −1 /2 then t(Ax, x) + (1 − t)(Ay, y) − (A(tx + (1 − t)y), (tx + (1 − t)y)) 1 = t(1 − t)(A(x − y), (x − y)) ≥ t(1 − t)|x − y|2 , 2λmax (C)

(21.0.5)

where λmax (C) is the largest eigenvalue of C. Thus, (21.0.4) holds with p = 2 and Cp = 1/(2λmax (C)). Let us prove the following useful inequality for the Wasserstein distance. Theorem 52 If P satisfies (21.0.4) and Q is absolutely continuous w.r.t. P then Wp (Q, P)p ≤

1 D(Q�P). Cp

Proof. Take functions f, g ∈ C(Rn ) such that f (x) + g(y) ≤

1 Cp (1 − t + o(1 − t))|x − y|p . t(1 − t)

Then, by (21.0.4), f (x) + g(y) ≤

� 1 � tV (x) + (1 − t)V (y) − V (tx + (1 − t)y) t(1 − t)

and t(1 − t)f (x) − tV (x) + t(1 − t)g(y) − (1 − t)V (y) ≤ −V (tx + (1 − t)y). This implies that w(tx + (1 − t)y) ≥ u(x)t v(y)1−t for u(x) = e(1−t)f (x)−V (x) , v(y) = etg(y)−V (y) and w(z) = e−V (z) . By the Prekopa-Leindler inequality, �� �t �� �1−t � (1−t)f (x)−V (x) tg(x)−V (x) e dx e dx ≤ e−V (x) dx and since e−V is the density of P we get 1 �� �� �t �� �1−t �� � 1−t � 1t (1−t)f tg (1−t)f etg dP ≤ 1. e dP e dP ≤ 1 and e dP It is a simple calculus exercise to show that lim

��

s→0

� 1s R esf dP = e f dP ,

91

and, therefore, letting t → 1 proves that �

p

if f (x) + g(y) ≤ Cp |x − y| If we denote v = g +



then

eg dP · e

R

f dP

≤ 1.

� f dP then the last inequality is ev dP ≤ 1 and (21.0.3) implies that � � � vdQ = f dP + gdQ ≤ D(Q||P).

Finally, using the Kantorovich-Rubinstein theorem, (20.0.2), we get �� � 1 p p Wp (Q, P) = inf Cp |x − y| dµ(x, y) : µ ∈ M (P, Q) Cp �� � � 1 1 p = sup f dP + gdQ : f (x) + g(y) ≤ Cp |x − y| ≤ D(Q||P) Cp Cp and this finishes the proof. Concentration of Gaussian measure. Applying this result to the example before Theorem 52 gives that for the non-degenerate Gaussian distribution P = N (0, C), � W2 (P, Q) ≤ 2λmax (C)D(Q||P). (21.0.6) Given a measurable set A ⊆ Rn with P(A) > 0, define the conditional distribution PA by PA (C) =

P(CA) . P(A)

Then, obviously, the Radon-Nikodym derivative dPA 1 = IA dP P(A) and the Kullback-Leibler divergence � D(PA ||P) =

log A

1 1 dPA = log . P(A) P(A)

Since W2 is a metric, for any two Borel sets A and B W2 (PA , PB ) ≤ W2 (PA , P) + W2 (PB , P) ≤



2λmax (C)





1 log + P(A)

� log

1 � . P(B)

Suppose that the sets A and B are apart from each other by a distance t, i.e. d(A, B) ≥ t > 0. Then any two points in the support of measures PA and PB are at a distance at least t from each other and the transportation distance W2 (PA , PB ) ≥ t. Therefore, � � � � � 1 1 � 1 ≤ 4λmax (C) log t ≤ W2 (PA , PB ) ≤ 2λmax (C) log + log . P(A) P(B) P(A)P(B) Therefore, P(B) ≤

� � 1 t2 exp − . P(A) 4λmax (C)

In particular, if B = {x : d(x, A) ≥ t} then � � P d(x, A) ≥ t ≤

� � t2 1 exp − . P(A) 4λmax (C) 92

If the set A is not too small, e.g. P(A) ≥ 1/2, this implies that � � � P d(x, A) ≥ t ≤ 2 exp −

t2



4λmax (C)

.

This shows that the Gaussian measure is exponentially concentrated near any ”large enough” set. The constant 1/4 in the exponent is not optimal and can be replaced by 1/2; this is just an example of application of the above ideas. The optimal result is the famous Gaussian isoperimetry, if P(A) = P(B) for some half-space B then P(At ) ≥ P(B t ). Gaussian concentration via the Prekopa-Leindler inequality. If we denote c = 1/λmax (C) then setting t = 1/2 in (21.0.5), �x + y� c V (x) + V (y) − 2V ≥ |x − y|2 . 2 4 Given a function f on Rn let us define its infimum-convolution by � � c g(y) = inf f (x) + |x − y |2 . x 4 Then, for all x and y, g(y) − f (x) ≤

�x + y� c |x − y|2 ≤ V (x) + V (y) − 2V . 2 4

(21.0.7)

If we define u(x) = e−f (x)−V (x) , v(y) = eg(y)−V (y) , w(z) = e−V (z) then (21.0.7) implies that �x + y�

≥ u(x)1/2 v(y)1/2 . 2 The Prekopa-Leindler inequality with λ = 1/2 implies that � � eg dP e−f dP ≤ 1. w

(21.0.8)

Given a measurable set A, let f be equal to 0 on A and +∞ on the complement of A. Then g(y) =

c d(x, A)2 4

and (21.0.8) implies �

�c � 1

exp d(x, A)2 dP(x) ≤ .

4 P(A)

By Chebyshev’s inequality, � � P d(x, A) ≥ t ≤

� ct2 � � � 1 1 t2 exp − = exp − . P(A) P(A) 4 4λmax (C)

Trivial metric and total variation. Definition A total variation distance between probability measure P and Q on a measurable space (S, B) is defined by TV(P, Q) = sup |P(A) − Q(A)|. A∈B

Using the Hahn-Jordan decomposition, we can represent a signed measure µ = P − Q as µ = µ+ − µ− such that for some set D ∈ B and for any set E ∈ B, µ+ (E) = µ(ED) ≥ 0 and µ− (E) = −µ(EDc ) ≥ 0. 93

Therefore, for any A ∈ B, P(A) − Q(A) = µ+ (A) − µ− (A) = µ+ (AD) − µ− (ADc ) which makes it obvious that sup |P(A) − Q(A)| = µ+ (D). A∈B

Let us describe some connections of the total variation distance to the Kullback-Leibler divergence and the Kantorovich-Rubinstein theorem. Let us start with the following simple observation. � Lemma 43 If f is a measurable function on S such that |f | ≤ 1 and f dP = 0 then for any λ ∈ R, � 2 eλf dP ≤ eλ /2 . Proof. Since (1 + f )/2, (1 − f )/2 ∈ [0, 1] and λf =

1+f 1−f λ+ (−λ), 2 2

by convexity of ex we get eλf ≤

1 + f λ 1 − f −λ e + e = ch(λ) + f sh(λ). 2 2

Therefore, �

eλf dP ≤ ch(λ) ≤ eλ

2

/2

,

where the last inequality is easy to see by Taylor’s expansion. Let us now consider a trivial metric on S given by d(x, y) = I(x = � y).

(21.0.9)

Then a 1-Lipschitz function f w.r.t. d, �f �L ≤ 1, is defined by the condition that for all x, y ∈ S, |f (x) − f (y)| ≤ 1.

(21.0.10)

Formally, the Kantorovich-Rubinstein theorem in this case would state that �� � W (P, Q) := inf I(x = � y)dµ(x, y) : µ ∈ M (P, Q) � � ��� � � � = sup � f dQ − f dP� : �f �L ≤ 1 =: γ(P, Q). However, since any uncountable set S is not separable w.r.t. a trivial metric d, we can not apply the Kantorovich-Rubinstein theorem directly. In this case one can use the Hahn-Jordan decomposition to show that γ coincides with the total variation distance, γ(P, Q) = TV(P, Q) and it is easy to construct a measure µ ∈ M (P, Q) explicitly that witnesses the above equality. We leave this as an exercise. Thus, for the trivial metric d, W (P, Q) = γ(P, Q) = TV(P, Q). We have the following analogue of the KL divergence bound. Theorem 53 If Q is absolutely continuous w.r.t. P then � TV(P, Q) ≤ 2D(Q||P). 94

� � Proof. Take f such that (21.0.10) holds. If we define g(x) = f (x)− f dP then, clearly, |g | ≤ 1 and gdP = 0. The above lemma implies that for any λ ∈ R, � R 2 eλf −λ f dP−λ /2 dP ≤ 1. The variational characterization of entropy (21.0.3) implies that � � λ f dQ − λ f dP − λ2 /2 ≤ D(Q||P) and for λ > 0 we get �

� f dQ −

f dP ≤

λ 1 + D(Q||P). λ 2

Minimizing the right hand side over λ > 0, we get � � � f dQ − f dP ≤ 2D(Q||P). Applying this to f and −f yields the result.

95

Section 22

Stochastic Processes. Brownian Motion. We have developed a general theory of convergence of laws on (separable) metric spaces and in the following two sections we will look at some specific examples of convergence on the spaces of continuous functions (C[0, 1], � · �∞ ) and (C(R+ ), d), where d is a metric metrizing uniform convergence on compacts. These examples will describe a certain central limit theorem type results on these spaces and in this section we will define the corresponding limiting Gaussian laws, namely, the Brownian motion and Brownian bridge. We will start with basic definitions and basic regularity results in the presence of continuity. Given a set T and a probability space (Ω, F, P), a stochastic process is a function Xt (ω) = X(t, ω) : T × Ω → R such that for each t ∈ T , Xt : Ω → R is a random variable, i.e. a measurable function. In other words, a stochastic process is a collection of random variables Xt indexed by a set T. A stochastic process is often defined by specifying finite dimensional (f.d.) distributions PF = L({Xt }t∈F ) for all finite subsets F ⊆ T. Kolmogorov’s theorem then guarantees the existence of a probability space on which the process is defined, under the natural consistency condition � � F1 ⊆ F2 ⇒ PF1 = PF2 � F . R

1

One can also think of a process as a function on Ω with values in RT = {f : T → R}, because for a fixed ω ∈ Ω, Xt (ω) ∈ RT is a (random) function of t. In Kolmogorov’s theorem, given a family of consistent f.d. distributions, a process was defined on the probability space (RT , BT ), where BT is the cylindrical σ-algebra generated by the algebra of cylinders B×RT \F for Borel sets B in RF and all finite F. When T is uncountable, some very natural sets such as � �

sup Xt > 1 = {Xt > 1} t∈T

t∈T

might be not measurable on BT . However, in our examples we will deal with continuous processes that possess additional regularity properties. Definition. If (T, d) is a metric space then a process Xt is called sample continuous if for all ω ∈ Ω, Xt (ω) ∈ C(T, d) - the space of continuous function on (T, d). The process Xt is called continuous in probability if Xt → Xt0 in probability whenever t → t0 . Example. Let T = [0, 1], (Ω, P) = ([0, 1], λ) where λ is the Lebesgue measure. Let Xt (ω) = I(t = ω) and Xt� (ω) = 0. F.d. distributions of these processes are the same because for any fixed t ∈ [0, 1], P(Xt = 0) = P(Xt� = 0) = 1. However, P(Xt is continuous) = 0 but for Xt� this probability is 1.

96

Definition. Let (T, d) be a metric space. The process Xt is measurable if Xt (ω) : T × Ω → R is jointly measurable on the product space (T, B) × (Ω, F), where B is the Borel σ-algebra on T. Lemma 44 If (T, d) is a separable metric space and Xt is sample continuous then Xt is measurable. Proof. Let (Sj )j≥1 be a measurable partition of T such that diam(Sj ) ≤ take a point tj ∈ Sj and define Xtn (ω) = Xtj (ω) for t ∈ Sj .

1 n.

For each non-empty Sj , let us

Xtn (ω) is, obviously, measurable on T × Ω because for any Borel set A on R, � � � � (ω, t) : Xtn (ω) ∈ A = ω : Xtj (ω) ∈ A × Sj . j≥1

Xt (w) is sample continuous and, therefore, Xtn (ω) → Xt (ω) for all (ω, t). Hence, X is also measurable. If (T, d) is a compact metric space and Xt is a sample continuous process indexed by T then we can think of Xt as an element of the metric space of continuous functions (C(T, d), || · ||∞ ), rather then simply an element of RT . We can define measurable events on this space in two different ways. On one hand, we have the natural Borel σ-algebra B on C(T ) generated by the open (or closed) balls Bg (ε) = {f ∈ C(T ) : ||f − g||∞ < ε}. On the other hand, if we think of C(T ) as a subspace of RT , we can consider a σ-algebra � � ST = B ∩ C(T ) : B ∈ BT which is the intersection of the cylindrical σ-algebra BT with C(T ). It turns out that these two definitions coincide. An important implication of this is that the law of any sample continuous process Xt on (T, d) is completely determined by its finite dimensional distributions. Lemma 45 If (T, d) is a separable metric space then B = ST . Proof. Let us first show that ST ⊆ B. Any element of the cylindrical algebra that generates the cylindrical σ-algebra BT is given by B × RT \F for a finite F ⊆ T and for some Borel set B ⊆ RF . Then

� � � � �� � B × RT \F C(T ) = x ∈ C(T ) : (xt )t∈F ∈ B = πF (x) ∈ B

where πF : C(T ) → RF is the finite dimensional projection such that πF (x) = (xt )t∈F . Projection πF is, measurable on the Borel σ-algebra B generated by obviously, continuous in the || · ||∞ norm and, therefore, � � open sets in the � · �∞ norm. This implies that πF (x) ∈ B ∈ B and, thus, ST ⊆ B. Let us now show that B ⊆ ST . Let T � be a countable dense subset of T . Then, by continuity, any closed ε-ball in C(T ) can be written as �� � � � f ∈ C(T ) : ||f − g||∞ ≤ ε = f ∈ C(T ) : |f (t) − g(t)| ≤ ε ∈ ST t∈T �

and this finished the proof.

In the remainder of the section we will define two specific sample continuous stochastic processes.

97

Brownian motion. Brownian motion is a sample continuous process Xt on T = R+ such that (a) the distribution of Xt is centered Gaussian for each t ≥ 0; (b) X0 = 0 and EX12 = 1; (c) if t < s then Xt and Xs − Xt are independent and L(Xs − Xt ) = L(Xs−t ). If we denote σ 2 (t) = Var(Xt ) then these properties imply �t� 1 σ 2 (nt) = nσ 2 (t), σ 2 = σ 2 (t) and σ 2 (qt) = qσ 2 (t) m m for all rational q. Since σ 2 (1) = 1, σ 2 (q) = q for all rational q and, by sample continuity, σ 2 (t) = t for all t ≥ 0. Therefore, for s < t, EXs Xt = EXs (Xs + (Xt − Xs )) = s = min(t, s). As a result, we can give an equivalent definition. Definition. Brownian motion is a sample continuous centered Gaussian process Xt for t ∈ [0, ∞) with the covariance Cov(Xt , Xs ) = min(t, s). Without the requirement of sample continuity, the existence of such process follows from Kolmogorov’s theorem since all finite dimensional distributions are consistent by construction. However, we still need to prove that there exists a sample continuous version of the process. We start with a simple estimate. Lemma 46 If f (c) = N (0, 1)(c, ∞) is the tail probability of the standard normal distribution then f (c) ≤ e− Proof. We have

for all c > 0.

� ∞ 1 x − x2 1 1 − c2 e dx ≤ √ e 2 dx = √ e 2. 2π c c 2π c c √ √ If c > 1/ 2π then f (c) ≤ exp(−c2 /2). If c ≤ 1/ 2π then a simpler estimate gives the result � 1 � 1 �2 � c2 1 f (c) ≤ f (0) = ≤ exp − √ ≤ e− 2 . 2 2 2π 1 f (c) = √ 2π





c2 2

2

− x2

Theorem 54 There exists a continuous version of the Brownian motion. Proof. It is obviously enough to define Xt on the interval [0, 1]. Given a process Xt that has f.d. distributions of the Brownian process but is not necessarily continuous, let us define for n ≥ 1, Vk = X k+1 −X n 2

k 2n

for k = 0, . . . , 2n − 1.

The variance Var(Vk ) = 1/2n and, by the above lemma, � � � 2n−1 � 1� 1� P max |Vk | ≥ 2 ≤ 2n P |V1 | ≥ 2 ≤ 2n+1 exp − 4 . k n n n The right hand side is summable over n ≥ 1 and, by the Borel-Cantelli lemma, �� � 1� P max |Vk | ≥ 2 i.o. = 0. (22.0.1) k n �∞ t �n t Given t ∈ [0, 1] and its dyadic decomposition t = j=1 2jj for tj ∈ {0, 1}, let us define t(n) = j=1 2jj so that Xt(n) − Xt(n−1) ∈ {0} ∪ {Vk : k = 0, . . . , 2n − 1}. Then, the sequence Xt(n) = 0 +



(Xt(j) − Xt(j−1) )

1≤j≤n

98

converges almost surely to some limit Zt because by (22.0.1) with probability one |Xt(n) − Xt(n−1) | ≤ n−2 for large enough (random) n ≥ n0 (ω). By construction, Zt = Xt on the dense subset of all dyadic t ∈ [0, 1]. If we can prove that Zt is sample continuous then all f.d. distributions of Zt and Xt will coincide, which means that Zt is a continuous version of the Brownian motion. Take any t, s ∈ [0, 1] such that |t − s| ≤ 2−n . If t(n) = 2kn and s(n) = 2mn , then |k − m| ∈ {0, 1}. As a result, |Xt(n) − Xs(n) | is either equal to 0 or one of the increments |Vk | and, by (22.0.1), |Xt(n) − Xs(n) | ≤ n−2 for large enough n. Finally, |Zt − Zs |

≤ ≤

|Zt − Xt(n) | + |Xt(n) − Xs(n) | + |Xs(n) − Zs | � 1 � 1 1 c + 2+ ≤ 2 2 l n l n l≥n

l≥n

which proves the continuity of Zt . On the event in (22.0.1) of probability zero we set Zt = 0. Definition. A sample continuous centered Gaussian process Bt for t ∈ [0, 1] is called a Brownian bridge if EBt Bs = s(1 − t) for s < t. Such process exists because if Xt is a Brownian motion then Bt = Xt − tX1 is a Brownian bridge, since for s < t, EBs Bt = E(Xt − tX1 )(Xs − sX1 ) = s − st − ts + st = s(1 − t). Notice that B0 = B1 = 0.

99

Section 23

Donsker Invariance Principle. In this section we show how the Brownian motion Wt arises in a classical central limit theorem on the space of continuous functions on R+ . When working with continuous processes defined on R+ , such as the Brownian motion, the metric � · �∞ on C(R+ ) is too strong. A more appropriate metric d can be defined by dn (f, g) = sup |f (t) − g(t)| and d(f, g) = 0≤t≤n

� 1 dn (f, g) . 2n 1 + dn (f, g)

n≥1

It is obvious that d(fj , f ) → 0 if and only if dn (fj , f ) → 0 for all n ≥ 1, i.e. d metrizes uniform convergence on compacts. (C(R+ ), d) is also a complete separable space, since any sequence is Cauchy in d if and only if it is Cauchy for each dn . When proving uniform tightness of laws on (C(R+ ), d), we will need a characterization of compacts via the Arzela-Ascoli theorem, which in this case can be formulated as follows. For a function x ∈ C[0, T ], its modulus of continuity is defined by � � mT (x, δ) = sup |xa − xb | : |a − b| ≤ δ, a, b ∈ [0, T ] . Theorem 55 (Arzela-Ascoli) A set K is compact in (C(R+ ), d) if and only if K is closed, uniformly bounded and equicontinuous on each interval [0, n]. In other words, sup |x0 | < ∞ and lim sup mT (x, δ) = 0 for all T > 0. δ →0 x∈K

x∈K

Here is the main result about the uniform tightness of laws on (C(R+ ), d), which is simply a translation of the Arzela-Ascoli theorem into probabilistic language. Theorem 56 The sequence of laws (Pn )n≥1 on (C(R+ ), d) is uniformly tight if and only if lim sup Pn (|x0 | > λ) = 0

λ→+∞ n≥1

(23.0.1)

and lim sup Pn (mT (x, δ) > ε) = 0 δ ↓0 n≥1

(23.0.2)

for any T > 0 and any ε > 0. Proof. =⇒ . For any γ > 0, there exists a compact K such that Pn (K) > 1 − γ for all n ≥ 1. By the Arzela-Ascoli theorem, |x0 | ≤ λ for some λ > 0 and for all x ∈ K and, therefore, sup Pn (|x0 | > λ) ≤ sup Pn (K c ) ≤ γ. n

n

Also, by equicontinuity, for any ε > 0 there exists δ0 > 0 such that for δ < δ0 and for all x ∈ K we have mT (x, δ) < ε. Therefore, sup Pn (mT (x, δ) > ε) ≤ sup Pn (K c ) ≤ γ. n

n

100

⇐= . Given γ > 0, find λT > 0 such that sup Pn (|x0 | > λT ) ≤ n

γ 2T +1

.

For each k ≥ 1, find δk > 0 such that � γ 1� ≤ T +k+1 . sup Pn mT (x, δk ) > 2 k n Define a set

� � 1 AT = x : |x0 | ≤ λT , mT (x, δk ) ≤ for all k ≥ 1 . k

Then for all n ≥ 1, Pn (AT ) ≥ 1 − By the Arzela-Ascoli theorem, the set A =



γ 2T +1



� k

γ γ =1− T. 2 2T +k+1

T ≥1 AT is compact on (C(R+ ), d) and for all n ≥ 1,

Pn (A) ≥ 1 −

� γ = 1 − γ. 2T

T ≥1

This proves that the sequence (Pn ) is uniformly tight. Of course, for the uniform tightness on (C[0, 1], � · �∞ ) we only need the second condition (23.0.2) for T = 1. Also, it will be convenient to slightly relax (23.0.2) and replace it with lim lim sup Pn (mT (x, δ) > ε) = 0.

δ →0 n→∞

(23.0.3)

Indeed, given γ > 0, find δ0 , n0 such that for δ < δ0 and n > n0 , Pn (mT (x, δ) > ε) < γ. For each n ≤ n0 we can find δn such that for δ < δn , Pn (mT (x, δ) > ε) < γ because mT (x, δ) → 0 as δ → 0 for all x ∈ C(R+ ). Therefore, if δ < min(δ0 , δ1 , . . . , δn0 ) then Pn (mT (x, δ) > ε) < γ for all n ≥ 1. Donsker invariance principle. Let us now give a classical example of convergence on (C(R+ ), d) to the Brownian motion Wt . Consider a sequence (Xi )i≥1 of i.i.d. random variables such that EXi = 0 and σ 2 = EXi2 < ∞. Let us consider a continuous partial sum process on [0, ∞) defined by � X�nt�+1 1 Wtn = √ Xi + (nt − �nt�) √ , nσ nσ i≤�nt�

where �nt� is the integer part of nt, �nt� ≤ nt < �nt� + 1. Since the last term in Wtn is of order n−1/2 , for simplicity of notations, we will simply write 1 � Wtn = √ Xi nσ i≤nt

and treat nt as an integer. By the central limit theorem, √ 1 � 1 � √ Xi = t √ Xi → N (0, t). nσ ntσ i≤nt

i≤nt

101

Given t < s, we can represent 1 Wsn = Wtn + √ nσ



Xi

nt
and since Wtn and Wsn − Wtn are independent, it should be obvious that the f.d. distributions of Wtn converge to the f.d. distributions of the Brownian motion Wt . By Lemma 45, this identifies Wt as the unique possible � � limit of Wtn and, if we can show that the sequence of laws (L(Wtn ))n≥1 is uniformly tight on C[0, ∞), d , Lemma 36 in Section 18 will imply that Wtn → Wt weakly. Since W0n = 0, we only need to show equicontinuity (23.0.3). Let us write the modulus of continuity as � 1 � � 1 � � � � � � � mT (W n , δ) = sup Xi � ≤ max Xi �. �√ �√ 0≤k≤nT,0<j≤nδ nσ nσ |t−s|≤δ,t,s∈[0,T ] ns
k
If instead of maximizing over all 0 ≤ k ≤ nT, we maximize over k = lnδ for 0 ≤ l ≤ m − 1, m := T /δ, i.e. in increments of nδ, then it is easy to check that the maximum will decrease by at most a factor of 3, because the second maximum over 0 < j ≤ nδ is taken over intervals of the same size nδ. As a consequence, if mT (W n , δ) > ε then one of the events � 1 � ε� � � � � max � √ Xi � > 0<j≤nδ 3 nσ lnδ
must occur for some 0 ≤ l ≤ m − 1. Since the number of these events is m = T /δ, � 1 � � ε� � � � � � P mT (W n , δ) > ε ≤ m P max � √ Xi � > . 0<j≤nδ 3 nσ 0
Kolmogorov’s inequality, Theorem 11 in Section 6, implies that if Sn = X1 + · · · + Xn and max P(|Sn − Sj | > α) ≤ p < 1

0<j≤n

then � � P max |Sj | > 2α ≤ 0<j≤n

√ If we take α = ε nσ/6 then, by Chebyshev’s inequality,

1 P(|Sn | > α). 1−p

�� � � 1 √ � 62 δnσ 2 P � Xi � > ε nσ ≤ 2 2 = 36δε−2 6 ε nσ j
−2

and, therefore, if 36δε < 1, � � � 1 √ � � � ε√ � � �−1 ��� � � � � P max � Xi � > ε nσ ≤ 1 − 36δε−2 P � Xi � > nσ . 0<j≤nδ 3 6 0
0
Finally, using (23.0.4) and the central limit theorem, � ε√ � �� � � �−1 � � lim sup P(mT (Wtn , δ) > ε) ≤ m 1 − 36δε−2 lim sup P � Xi � > nσ 6 n→∞ n→∞ 0 ε) = 0,

for all T > 0 and ε > 0 and, thus,

δ →0 n→∞ Wtn → Wt weakly

in (C[0, ∞), d).

102

(23.0.4)

Section 24

Empirical process and Kolmogorov’s chaining. Empirical process and the Kolmogorov-Smirnov test. In this sections we show how the Brownian bridge Bt arises in another central limit theorem on the space of continuous functions on [0, 1]. Let us start with a motivating example from statistics. Suppose that x1 , . . . , xn are i.i.d. �nuniform random variables on [0, 1]. By the law of large numbers, for any t ∈ [0, 1], the empirical c.d.f. n−1 i=1 I(xi ≤ t) converges to the true c.d.f. P(x1 ≤ t) = t almost surely and, moreover, by the CLT, Xtn =

n � √ �1 � n I(xi ≤ t) − t → N (0, t(1 − t)). n i=1

The stochastic process Xtn is called the empirical process. The covariance of this process, EXtn Xsn = E(I(x1 ≤ t) − t)(I(x1 ≤ s) − s) = s − ts − ts + ts = s(1 − t), is the same as the covariance of the Brownian bridge and, by the multivariate CLT, finite dimensional distributions of the empirical process converge to f.d. distributions of the Brownian bridge, � � � � L (Xtn )t∈F → L (Bt )t∈F . (24.0.1) However, we would like to show the convergence of Xtn to Bt in some stronger sense that would imply weak convergence of continuous functions of the process on the space (C[0, 1], � · �∞ ). The Kolmogorov-Smirnov test in statistics provides one possible motivation. �n Suppose that i.i.d. (Xi )i≥1 have continuous distribution with c.d.f. F (t) = P(X1 ≤ t). Let Fn (t) = n−1 i=1 I(Xi ≤ t) be the empirical c.d.f. It is easy to see the equality in distribution √ d sup n|Fn (t) − F (t)| = sup |Xtn | t∈R

t∈[0,1]

because F (Xi ) have uniform distribution on [0, 1]. In order to test whether (Xi )i≥1 come from the distribution with c.d.f. F, the statisticians need to know the distribution of the above supremum or, as approximation, the distribution of its limit. Equation (24.0.1) suggests that L(sup |Xtn |) → L(sup |Bt |). t

(24.0.2)

t

Since Bt is sample continuous, its distribution is the law on the metric space (C[0, 1], � · �∞ ). Even though

Xtn is not continuous, its jumps are of order n−1/2 so it has a ”close” continuous version Ytn . Since � · �∞ is a

continuous functional on C[0, 1], (24.0.2) would hold if we can prove weak convergence L(Ytn )→L(Bt ) on the space (C[0, 1], � · �∞ ). Lemma 36 in Section 18 shows that we only need to prove uniform tightness of L(Ytn ) 103

because, by Lemma 45, (24.0.1) already identifies the law of the Brownian motion as the unique possible limit. Thus, we need to address the question of uniform tightness of (L(Xtn )) on the complete separable space (C[0, 1], || · ||∞ ) or equivalently, by the result of the previous section, the equicontinuity of Xtn , � � lim lim sup P m(X n , δ) > ε = 0. δ →0 n→∞

By Chebyshev’s inequality, � � 1 P m(X n , δ) > ε ≤ Em(X n , δ) ε n and we need to learn how to control Em(X , δ). The modulus of continuity of X n can be written as n

m(X , δ) = sup |t−s|≤δ

|Xtn



Xsn |

= =

n �1 � � √ � � n sup � I(s < xi ≤ t) − (t − s)� n |t−s|≤δ i=1 n � � � √ �1 � n sup � f (xi ) − Ef �, f ∈F n i=1

where we introduced the class of functions � � F = f (x) = I(s < x ≤ t) : |t − s| < δ .

(24.0.3)

(24.0.4)

We will develop one approach to control the expectation of (24.0.3) for general classes of functions F and we will only use the specific definition (24.0.4) at the very end. This will be done in several steps. Symmetrization. At the first step, we will replace the empirical process (24.0.3) by a symmetrized version, called Rademacher process, that will be easier to control. Let x�1 , . . . , x�n be independent copies of x1 , . . . , xn and let ε1 , . . . , εn be i.i.d. Rademacher random variables, such that P(εi = 1) = P(εi = −1) = 1/2. Let us define n n 1� 1� Pn f = f (xi ) and P�n f = f (x�i ). n i=1 n i=1 Notice that EP�n f = Ef. Consider the random variables n �1 � � � � � � Z = sup �Pn f − Ef � and R = sup � εi f (xi )�. f ∈F f ∈F n i=1

Then, using Jensen’s inequality and then triangle inequality, we can write � � � � � � � � EZ = E sup �Pn f − Ef � = E sup �Pn f − EP�n f � f ∈F

f ∈F

n n �1 � � �1 � � � � � � ≤ E sup � (f (xi ) − f (x�i ))� = E sup � εi (f (xi ) − f (x�i ))� f ∈F n i=1 f ∈F n i=1 n n �1 � � �1 � � � � � � ≤ E sup � εi f (xi )� + E sup � εi f (x�i )� = 2ER. f ∈F n i=1 f ∈F n i=1

Equality in the second line holds because switching xi ↔ x�i arbitrarily does not change the expectation, so the equality holds for any fixed (εi ) and, therefore, for any random (εi ). �n Hoeffding’s inequality. The first step to control the supremum in R is to control the sum i=1 εi f (xi ) for a fixed function f. Consider an arbitrary sequence a1 , . . . , an ∈ R. Then the following holds. Theorem 57 (Hoeffding) For t ≥ 0, n �� � � � t2 P εi ai ≥ t ≤ exp − �n . 2 i=1 a2i i=1

104

Proof. Given λ > 0, by Chebyshev’s inequality, n n n �� � � � � � � � P εi ai ≥ t ≤ e−λt E exp λ εi ai = e−λt E exp λεi ai . i=1

i=1

Using the inequality (ex + e−x )/2 ≤ ex

2

/2

i=1

, we get

� λ2 a2 � � � eλai + e−λai i ≤ exp . E exp λεi ai = 2 2 Hence, P

n ��

n � � λ2 � 2 � εi ai ≥ t ≤ exp −λt + a 2 i=1 i i=1

and minimizing over λ > 0 gives the result. Covering numbers, Kolmogorov’s chaining and Dudley’s entropy integral. To control ER for general classes of functions F, we will need to use some measures of complexity of F. First, we will show how to control the Rademacher process R conditionally on x1 , . . . , xn . Definition. Suppose that (F, d) is a totally bounded metric space. For any u > 0, a u-packing number of F with respect to d is defined by � � D(F, u, d) = max card Fu ⊆ F : d(f, g) > u for all f, g ∈ Fu and a u-covering number is defined by � � N (D, u, d) = min card Fu ⊆ F : ∀f ∈ F ∃ g ∈ Fu such that d(f, g) ≤ u . � Both packing and covering numbers measure how many points are needed to approximate any element in the set F within distance u. It is a simple exercise to show that N (F, u, d) ≤ D(F, u, d) ≤ N (F, u/2, d) and, in this sense, packing and covering numbers are closely related. Let F be a subset of the cube [−1, 1]n equipped with a scaled Euclidean metric d(f, g) =

n �1/2 �1 � (fi − gi )2 . n i=1

Consider the following Rademacher process on F , n

1 � R(f ) = √ εi fi . n i=1 Then we have the following version of the classical Kolmogorov’s chaining lemma. Theorem 58 (Kolmogorov’s chaining) For any u > 0, � d(0,f ) � √ � P ∀f ∈ F, R(f ) ≤ 29/2 log1/2 D(F, ε, d)dε + 27/2 d(0, f ) u ≥ 1 − e−u . 0

Proof. Without loss of generality, assume that 0 ∈ F . Define a sequence of subsets {0} = F0 ⊆ F1 . . . ⊆ Fj ⊆ . . . ⊆ F such that Fj satisfies 105

1. ∀f, g ∈ Fj , d(f, g) > 2−j , 2. ∀f ∈ F we can find g ∈ Fj such that d(f, g) ≤ 2−j . F0 obviously satisfies these properties for j = 0. To construct Fj+1 given Fj : • Start with Fj+1 := Fj . • If possible, find f ∈ F such that d(f, g) > 2−(j+1) for all g ∈ Fj+1 . • Let Fj+1 := Fj+1 ∪ {f } and repeat until you cannot find such f . Define projection πj : F → Fj as follows: for f ∈ F find g ∈ Fj with d(f, g) ≤ 2−j and set πj (f ) = g. Any f ∈ F can be decomposed into the telescopic series f

= =

π0 (f ) + (π1 (f ) − π0 (f )) + (π2 (f ) − π1 (f )) + . . . ∞ � (πj (f ) − πj−1 (f )). j=1

Moreover, d(πj−1 (f ), πj (f )) ≤ d(πj−1 (f ), f ) + d(f, πj (f )) ≤ 2−(j−1) + 2−j = 3 · 2−j ≤ 2−j+2 . As a result, the jth term in the telescopic series for any f ∈ F belongs to a finite set of possible links � � Lj−1,j = f − g : f ∈ Fj , g ∈ Fj−1 , d(f, g) ≤ 2−j+2 . Since R(f ) is linear, R(f ) =

∞ �

R(πj (f ) − πj−1 (f )).

j=1

We first show how to control R on the set of all links. Assume that � ∈ Lj−1,j . By Hoeffding’s inequality, n � � � � � � 1 � t2 t2 P R(�) = √ εi �i ≥ t ≤ exp − −1 �n 2 ≤ exp − . 2 · 2−2j+4 n i=1 2n i=1 �i

If |F | denotes the cardinality of the set F then |Lj−1,j | ≤ |Fj−1 | · |Fj | ≤ |Fj |2 and, therefore, � � � P ∀� ∈ Lj−1,j , R(�) ≤ t ≥ 1 − |Fj |2 exp −

t2 2−2j+5



=1−

1 −u e |Fj |2

after making a change of variables � �1/2 √ t = 2−2j+5 (4 log |Fj | + u) ≤ 27/2 2−j log1/2 |Fj | + 25/2 2−j u. Hence,

� √ � P ∀� ∈ Lj−1,j , R(�) ≤ 27/2 2−j log1/2 |Fj | + 25/2 2−j u ≥ 1 − 106

1 −u e . |Fj |2

If Fj−1 = Fj then we can define πj−1 (f ) = πj (f ) and, since in this case Lj−1,j = {0}, there is no need to control these links. Therefore, we can assume that |Fj−1 | < |Fj | and taking a union bound for all steps, � √ � P ∀j ≥ 1 ∀� ∈ Lj−1,j , R(�) ≤ 27/2 2−j log1/2 |Fj | + 25/2 2−j u ≥1−

∞ � j=1

∞ � 1 −u 1 e ≥ 1 − e−u = 1 − (π 2 /6 − 1)e−u ≥ 1 − e−u . 2 |Fj |2 (j + 1) j=1

Given f ∈ F, let integer k be such that 2−(k+1) < d(0, f ) ≤ 2−k . Then in the above construction we can assume that π0 (f ) = . . . = πk (f ) = 0, i.e. we will project f on 0 if possible. Then with probability at least 1 − e−u , R(f )

=



∞ �

R(πj (f ) − πj−1 (f ))

j=k+1 ∞ � �

√ � 27/2 2−j log1/2 |Fj | + 25/2 2−j u

j=k+1 7/2

≤ 2

∞ �

√ 2−j log1/2 D(F, 2−j , d) + 25/2 2−k u.

j=k+1

Note that 2−k < 2d(f, 0) and 25/2 2−k < 27/2 d(f, 0). Finally, since packing numbers D(F, ε, d) are decreasing in ε, we can write (see figure 24) 9/2

2

∞ �

−(j+1)

2

log

1/2

−j

D(F, 2

, d) ≤ 2

9/2



2−(k+1)

log1/2 D(F, ε, d)dε

0

j=k+1

≤ 29/2



d(0,f )

log1/2 D(F, ε, d)dε

(24.0.5)

0

since 2−(k+1) < d(0, f ). This finishes the proof.

1/2

log D

!("+1)

!("+2)

2

2

The integral in (24.0.5) is called Dudley’s entropy integral. We would like to apply the bound of the above theorem to n � 1 � � √ � � nR = sup� √ εi f (xi )� n f ∈F i=1 for a class of functions F in (24.0.4). Suppose that x1 , . . . , xn ∈ [0, 1] are fixed and let �� � �� � �� F = fi 1≤i≤n = I s < xi ≤ t : |t − s| ≤ δ and t, s ∈ [0, 1] ⊆ {0, 1}n . 1≤i≤n

Then the following holds. 107

Lemma 47 N (F, u, d) ≤ Ku−4 for some absolute K > 0 independent of the points x1 , . . . , xn . Proof. We can assume that x1 ≤ . . . ≤ xn . Then the class F consists of all vectors of the type (0 . . . 1 . . . 1 . . . 0), i.e. the coordinates equal to 1 come in blocks. Given u, let Fu be a subset of such vectors with blocks of 1’s starting and ending at the coordinates k�nu�. Given any vector f ∈ F, let us approximate it by a vector in f � ∈ Fu by choosing the closest starting and ending coordinates for the blocks of 1’s. The number of different coordinates will be bounded by 2�nu� and, therefore, the distance between f and f � will be bounded by � √ d(f, f � ) ≤ 2n−1 �nu� ≤ 2u. √ −2 −2 The cardinality √ of Fu is, obviously, of order u . This proves that N (F, 2u, d) ≤ Ku . Making the change of variables 2u → u proves the result. To apply the Kolmogorov chaining bound to this class F let us make a simple observation that if a random 2 variable X ≥ 0 satisfies P(X ≥ a + bt) ≤ Ke−t for all t ≥ 0 then � ∞ � ∞ � ∞ t2 EX = P(X ≥ t)dt ≤ a + P(X ≥ a + t)dt ≤ a + K e− b2 dt ≤ a + Kb ≤ K(a + b). 0

0

0

Theorem 58 then implies that n � 1 � � �� Dn � � K � � Eε sup� √ εi fi � ≤ K log du + Dn u n i=1 F 0

(24.0.6)

where Eε is the expectation with respect to (εi ) only and Dn2 = sup d(0, f )2 F

n n 1� 1� f (xi )2 = sup I(s < xi ≤ t) F n i=1 |t−s|≤δ n i=1 � � n n �1 � � 1� � � = sup � I(xi ≤ t) − I(xi ≤ s)� . � n i=1 |t−s|≤δ � n i=1

=

sup

Since the integral on the right hand side of (24.0.6) is concave in Dn , by Jensen’s inequality, n � 1 � � �� EDn � � K � � E sup� √ εi f (xi )� ≤ K log du + EDn . u n i=1 F 0 By the symmetrization inequality, this finally proves that �� EDn � � K n Em(X , δ) ≤ K log du + EDn . u 0 The strong law of large numbers easily implies that � n � �1 � � � � sup � I(xi ≤ t) − t� → 0 a.s. � � n t∈[0,1] i=1 √ and, therefore, Dn2 → δ a.s. and EDn → δ. This implies that √ �� δ � √ � K n lim sup Em(Xt , δ) ≤ K log du + δ . u n→∞ 0 The right-hand side goes to zero as δ → 0 and this finishes the proof of equicontinuity of X n . As a result, for any continuous function Φ on (C[0, 1], � · �∞ ) the distibution of Φ(Xtn ) converges to the distribution of Φ(Bt ). For example, �1 � � √ � � n sup � I(xi ≤ t) − t� → sup |Bt | 0≤t≤1 n 0≤t≤1 in distribution. We will find the distribution of the right hand side in the next section. 108

Section 25

Markov property of Brownian motion. Reflection principles. We showed that the empirical process converges to Brownian bridge on (C([0, 1]), � · �∞ ). As a result, the distribution of a continuous function of the process will also converge, for example, sup |Xtn |→ sup |Bt |

0≤t≤1

0≤t≤1

weakly. We will compute the distribution of this supremum in Theorem 60 below but first we will start with a simpler example to illustrate the so called strong Markov property of the Brownian motion. Given a process Wt on (C[0, ∞), d), let Ft = σ(Ws ; s ≤ t). A random variable τ is called a stopping time if {τ ≤ t} ∈ Ft for all t ≥ 0. For example, a hitting time τc = inf{t > 0, Wt = c}, c > 0, is a stopping time because, by sample continuity, �

{τc ≤ t} = {Wr > q} q
where the intersection and union are over rational numbers q, r. If Wt is the Brownian motion then strong Markov property of Wt states, informally, that the increment process Wτ +t − Wτ after the stopping time is independent of the σ-algebra Fτ generated by Wt up to the stopping time τ and, moreover, Wτ +t − Wτ has the same distribution as Wt . This property is very similar to the property of stopping times for sums of i.i.d. random variables, in Section 7. However, to avoid subtle measure theoretic considerations, we will simply approximate arbitrary stopping times by dyadic stopping times for which Markov property can be used more directly, by summing over all possible values. If τ is a stopping time then τn =

�2n τ � + 1 2n

is also a stopping time. Indeed, if k k+1 k+1 ≤τ < then τn = n n 2 2 2n and, therefore, for any t ≥ 0, if

l 2n

≤t<

l+1 2n

then � � l {τn ≤ t} = τ < n = 2

109

q
{τ ≤ q} ∈ Ft .

By construction, τn ↓ τ and, by continuity, Wτn → Wτ a.s. Let us demonstrate how to use Markov property for these dyadic approximations in the computation of the following probability, P(sup Wt ≥ c) = P(τc ≤ b) t≤b

for c > 0. For dyadic approximation τn of τc , we can write indep. of Fk/2n

in Fk/2n

P(τn ≤ b, Wb − Wτn ≥ 0)

=

� �� � � �� � � �� P τn = k/2n ≤ b, Wb − Wk/2n ≥ 0 k≥0

=

� 1 1 � � k P τn = n ≤ b = P(τn ≤ b). 2 2 2 k≥0

Letting n → ∞ and applying the portmanteau theorem, P(τc ≤ b, Wb − Wτc ≥ 0) =

1 P(τc ≤ b), 2

since both sets {τc = b} and {Wb − Wτc = 0} are the sets of continuity because {τc = b} ⊆ {Wb = c} and {Wb − Wτc = 0} ⊆ {Wb = c} and P(Wb = c) = 0. Finally, this implies that P(Wb ≥ c) = P(τc ≤ b, Wb − Wτc ≥ 0) =

� 1 1 � P(τc ≤ b) = P sup Wt ≥ c 2 2 t≤b

and, therefore, � � � P sup Wt ≥ c = P(τc ≤ b) = 2N (0, b)(c, ∞) = 2 t≤b

The p.d.f. of τc satisfies

∞ √

c/ b

x2 1 √ e− 2 dx. 2π

(25.0.1)

c2 1 c fτc (b) = √ e− 2b · 3/2 = O(b−3/2 ) b 2π

as b → +∞, which means that Eτc = ∞. Reflection principles. If xt is the Brownian motion then yt = xt − tx1 is the Brownian bridge for t ∈ [0, 1]. The next lemma shows that we can think of the Brownian bridge as the Brownian motion conditioned to be equal to zero at time t = 1 (pinned down Brownian motion). Lemma 48 Conditional distribution of xt given |x1 | < ε converges to the law of yt , � � � L xt � |x1 | < ε → L(yt ) as ε ↓ 0. Proof. Notice that yt = xt − tx1 is independent of x1 because their covariance Eyt x1 = Ext x1 − tEx21 = t − t = 0. Therefore, the Brownian motion can be written as a sum xt = yt + tx1 of the Brownian bridge � �and inde­� pendent process tx1 . Therefore, if we define a random variable ηε with distribution L(ηε ) = L x1 �|x1 | < ε independent of yt then � � � L xt � |x1 | < ε = L(yt + tηε ) → L(yt ). as ε ↓ 0. 110

2b

A b!"

B !"

Figure 25.1: Reflecting the Brownian motion.

Theorem 59 If yt is the Brownian bridge then for all b > 0, � � 2 P sup yt ≥ b = e−2b . t∈[0,1]

Proof. Since yt = xt − tx1 and x1 are independent, we can write P(∃t : yt = b) =

P(∃t : xt − tx1 = b, |x1 | < ε) P(∃t : xt = b + tx1 , |x1 | < ε) = . P(|x1 | < ε) P(|x1 | < ε)

We can estimate the numerator from below and above by � � � � � � P ∃t : xt > b + ε, |x1 | < ε ≤ P ∃t : xt = b + tx1 , |x1 | < ε ≤ P ∃t : xt ≥ b − ε, |x1 | < ε . Let us first analyze the upper bound. If we define a hitting time τ = inf{t : xt = b − ε} then xτ = b − ε and � � � � � � P ∃t : xt ≥ b − ε, |x1 | < ε = P τ ≤ 1, |x1 | < ε = P τ ≤ 1, x1 − xτ ∈ (−b, −b + 2ε) . For dyadic approximation as above P(τn ≤ 1, x1 − xτn ∈ (−b, −b + 2ε)) = = =

� � � k P τn = n ≤ 1, x1 − xk/2n ∈ (−b, −b + 2ε) 2 k≥0 � � � � � k P τn = n ≤ 1 P x1 − xk/2n ∈ (−b, −b + 2ε) 2 k≥0 � � � � � k P τn = n ≤ 1 P x1 − xk/2n ∈ (b − 2ε, b) 2 k≥0

=

P(τn ≤ 1, x1 − xτn ∈ (b − 2ε, b))

where in the third line we used the fact that the distribution of x1 − xk/2n is symmetric around zero and, thus, we ”reflected” the Brownian motion after stopping time τ as in figure 25.1. Therefore, in the limit n → ∞ we get � � � � � � P ∃t : xt ≥ b − ε, |x1 | < ε = P τ ≤ 1, x1 − xτ ∈ (b − 2ε, b) = P x1 ∈ (2b − 3ε, 2b − ε) because the fact that x1 ∈ (2b − 3ε, 2b − ε) automatically implies that τ ≤ 1 for b > 0 and ε small enough. Finally, this proves that 2 P(x1 ∈ (2b − 3ε, 2b − ε)) → e−2b P(∃t : xt = b) ≤ P(x1 ∈ (−ε, ε)) as ε → 0. The lower bound can be analyzed similarly.

111

Theorem 60 (Kolmogorov-Smirnov) If yt is the Brownian bridge then for all b > 0, � � � 2 2 P sup |yt | ≥ b = 2 (−1)n−1 e−2n b . 0≤t≤1

n≥1

Proof. For n ≥ 1, consider an event � � An = ∃t1 < · · · < tn ≤ 1 : ytj = (−1)j−1 b and let τb and τ−b be the hitting times of b and −b. By symmetry of the distribution of the process yt , � � � � P sup |yt | ≥ b = P τb or τ−b ≤ 1 = 2P(A1 , τb < τ−b ). 0≤t≤1

Again, by symmetry, P(An , τb < τ−b ) = P(An ) − P(An , τ−b < τb ) = P(An ) − P(An+1 , τb < τ−b ). By induction, P(A1 , τb < τ−b ) = P(A1 ) − P(A2 ) + . . . + (−1)n−1 P(An , τb < τ−b ). As in Theorem 59, reflecting the Brownian motion each time we hit b or −b, one can show that � � P x1 ∈ (2nb − ε, 2nb + ε) 2 2 2 1 � � P(An ) = lim = e− 2 (2nb) = e−2n b ε→ 0 P x1 ∈ (−ε, ε) and this finishes the proof. Given a, b > 0, let us compute the probability that a Brownian bridge crosses one of the levels −a or b. Theorem 61 (Two-sided boundary) If a, b > 0 then � � �� 2 2 2 2 P(∃t : yt = −a or b) = e−2(na+(n+1)b) + e−2((n+1)a+nb) − 2e−2n (a+b) . n≥0

(25.0.2)

n≥1

Proof. We have P(∃t : yt = −a or b) = P(∃t : yt = −a, τ−a < τb ) + P(∃t : yt = b, τb < τ−a ). If we introduce the events � � Bn = ∃t1 < . . . < tn : yt1 = b, yt2 = −a, . . . and � � An = ∃t1 < . . . < tn : yt1 = −a, yt2 = b, . . . then, as in the previous theorem, P(Bn , τb < τ−a ) = P(Bn ) − P(Bn , τ−a < τb ) = P(Bn ) − P(An+1 , τ−a < τb ) and, similarly, P(An , τ−a < τb ) = P(An ) − P(Bn+1 , τb < τ−a ). By induction, P(∃t : yt = −a or b) =

∞ �

(−1)n−1 (P(An ) + P(Bn )).

n=1

Probabilities of the events An and Bn can be computed using the reflection principle as above, P(A2n ) = P(B2n ) = e−2n

2

(a+b)2

2

, P(B2n+1 ) = e−2(na+(n+1)b) , P(A2n+1 ) = e−2((n+1)a+nb)

and this finishes the proof. If X = − inf yt and Y = sup yt then the spread of the process yt is ξ = X + Y. 112

2

Theorem 62 (Distribution of the spread) For any t > 0, � � � 2 2 P ξ ≤t =1− (8n2 t2 − 2)e−2n t . n≥1

Proof. First of all, (25.0.2) gives the joint c.d.f. of (X, Y ) because F (a, b) = P(X < a, Y < b) = P(−a < inf yt , sup yt < b) = 1 − P(∃t : yt = −a or b). If f (a, b) = ∂ 2 F/∂a∂b is the joint p.d.f. of (X, Y ) then the c.d.f of the spread X + Y is � � P Y +X ≤t =

� t�

t−a

f (a, b) db da. 0 0

The inner integral is �

t−a

f (a, b)db = 0

∂F ∂F (a, t − a) − (a, 0). ∂a ∂a

Since ∂F (a, b) = ∂a +



� � 2 4n na + (n + 1)b e−2(na+(n+1)b)

n≥0



� � 2 4(n + 1) (n + 1)a + nb e−2((n+1)a+nb)

n≥0





8n2 (a + b) e−2n

2

(a+b)2

,

n≥1

plugging in the values b = t − a and b = 0 gives � t−a � � � 2 2 2 2 f (a, b)db = 4n((n + 1)t − a)e−2((n+1)t−a) + 4(n + 1)(nt + a)e−2(nt+a) − 8n2 te−2n t . 0

n≥0

n≥0

n≥1

Integrating over a ∈ [0, t], � � � � � � 2 2 2 2 2 2 P Y + X ≤ t = (2n + 1) e−2n t − e−2(n+1) t − 8n2 t2 e−2n t n≥0

=

1+2

n≥1



e

−2n2 t2



n≥1

� n≥1

and this finishes the proof.

113

2 2 −2n2 t2

8n t e

,

Section 26

Laws of Brownian motion at stopping times. Skorohod’s imbedding. Let Wt be the Brownian motion.

Theorem 63 If τ is a stopping time such that Eτ < ∞ then EWτ = 0 and EWτ2 = Eτ.

Proof. Let us start with the case when a stopping time τ takes finite number of values τ ∈ {t1 , . . . , tn }. If Ftj = σ{Wt ; t ≤ tj } then (Wtj , Ftj ) is a martingale since E(Wtj |Ftj−1 ) = E(Wtj − Wtj−1 + Wtj−1 |Ftj−1 ) = Wtj−1 . By optional stopping theorem for martingales, EWτ = EWt1 = 0. Next, let us prove that EWτ2 = Eτ by induction on n. If n = 1 then τ = t1 and EWτ2 = EWt21 = t1 = Eτ. To make an induction step from n − 1 to n, define a stopping time α = τ ∧ tn−1 and write EWτ2 = E(Wα + Wτ − Wα )2 = EWα2 + E(Wτ − Wα )2 + 2EWα (Wτ − Wα ). First of all, by induction assumption, EWα2 = Eα. Moreover, τ �= α only if τ = tn in which case α = tn−1 . The event {τ = tn } = {τ ≤ tn−1 }c ∈ Ftn−1 and, therefore, EWα (Wτ − Wα ) = EWtn−1 (Wtn − Wtn−1 )I(τ = tn ) = 0. Similarly, E(Wτ − Wα )2 = EE(I(τ = tn )(Wtn − Wtn−1 )2 |Ftn−1 ) = (tn − tn−1 )P(τ = tn ). Therefore, EWτ2 = Eα + (tn − tn−1 )P(τ = tn ) = Eτ and this finishes the proof of the induction step. Next, let us consider the case of a uniformly bounded stopping time τ ≤ M < ∞. In the previous lecture we defined a dyadic approximation τn =

�2n τ � + 1 2n

114

which is also a stopping time, τn ↓ τ, and by sample continuity Wτn → Wτ a.s. Since (τn ) are uniformly bounded, Eτn → Eτ. To prove that EWτ2n → EWτ2 we need to show that the sequence (Wτ2n ) is uniformly integrable. Notice that τn < 2M and, therefore, τn takes possible values of the type k/2n for k ≤ k0 = �2n (2M )�. Since the sequence � � W1/2n , . . . , Wk0 /2n , W2M is a martingale, adapted to a corresponding sequence of Ft , and τn and 2M are two stopping times such that τn < 2M, by Optional Stopping Theorem 31, Wτn = E(W2M |Fτn ). By Jensen’s inequality, 4 Wτ4n ≤ E(W24M |Fτn ), EWτ4n ≤ EW2M = 6M.

and uniform integrability follows by H¨older’s and Chebyshev’s inequalities, EWτ2n I(|Wτn | > N ) ≤ (EWτ4n )1/2 (P(|Wτn | > N ))1/2 ≤

6M →0 N2

as N → ∞, uniformly over n. This proves that EWτ2n → EWτ2 . Since τn takes finite number of values, by the previous case, EWτ2n = Eτn and letting n → ∞ proves EWτ2 = Eτ.

(26.0.1)

Before we consider the general case, let us notice that for two bounded stopping times τ ≤ ρ ≤ M one can similarly show that E(Wρ − Wτ )Wτ = 0. (26.0.2) Namely, one can approximate the stopping times by dyadic stopping times and using that by the optional stopping theorem (Wτn , Fτn ), (Wρn , Fρn ) is a martingale, � E(Wρn − Wτn )Wτn = EWτn E(Wρn |Fτn ) − Wτn ) = 0. Finally, we consider the general case. Let us define τ (n) = min(τ, n). For m ≤ n, τ (m) ≤ τ (n) and E(Wτ (n) − Wτ (m) )2 = EWτ2(n) − EWτ2(m) − 2EWτ (m) (Wτ (n) − Wτ (m) ) = Eτ (n) − Eτ (m) using (26.0.1), (26.0.2) and the fact that τ (n), τ (m) are bounded stopping times. Since τ (n) ↑ τ, Fatou’s lemma and the monotone convergence theorem imply E(Wτ − Wτ (m) )2 ≤ lim inf (Eτ (n) − Eτ (m)) = Eτ − Eτ (m). n→∞

Letting m → ∞ shows that lim E(Wτ − Wτ (m) )2 = 0

m→∞

which means that EWτ2(m) → EWτ2 . Since EWτ2(m) = Eτ (m) by the previous case and Eτ (m) → Eτ by the monotone convergence theorem, this implies that EWτ2 = Eτ. Theorem 64 (Skorohod’s imbedding) Let Y be a random variable such that EY = 0 and EY 2 < ∞. There exists a stopping time τ < ∞ such that L(Wτ ) = L(Y ). Proof. Let us start with the simplest case when Y takes only two values, Y ∈ {−a, b} for a, b > 0. The condition EY = 0 determines the distribution of Y, pb + (1 − p)(−a) = 0 and p =

a . a+b

(26.0.3)

Let τ = inf{t > 0, Wt = −a or b} be a hitting time of the two-sided boundary −a, b. The tail probability of τ can be bounded by � � P(τ > n) ≤ P |Wj+1 − Wj | < a + b, 0 ≤ j ≤ n − 1 = P(|W1 | < a + b)n = γ n . 115

Therefore, Eτ < ∞ and by the previous theorem, EWτ = 0. Since Wτ ∈ {−a, b} we must have L(Wτ ) = L(Y ). Let us now consider the general case. If µ is the law of Y, let us define Y by the identity Y = Y (x) = x on its sample probability space (R, B, µ). Let us construct a sequence of σ-algebras B1 ⊆ B2 ⊆ . . . ⊆ B as follows. Let B1 be generated by the set (−∞, 0), i.e. � � B1 = ∅, R, (−∞, 0), [0, +∞) . Given Bj , let us define Bj+1 by splitting each finite interval [c, d) ∈ Bj into two intervals [c, (c + d)/2) and [(c + d)/2, d) and splitting infinite interval (−∞, −j) into (−∞, −(j + 1)) and [−(j + 1), −j) and similarly splitting [j, +∞) into [j, j + 1) and [j + 1, ∞). Consider a right-closed martingale Yj = E(Y |Bj ).  It is almost obvious that B = σ( Bj ), which we leave as an exercise. Then, by the Levy martingale conver­ gence, Lemma 35, Yj → E(Y |B) = Y a.s. Since Yj is measurable on Bj , it must be constant on each simple set [c, d) ∈ Bj . If Yj (x) = y for x ∈ [c, d) then, since Yj = E(Y |Bj ), � yµ([c, d)) = EYj I[c,d) = EY I[c,d) = xdµ(x) [c,d)

and

1 y= µ([c, d))

� xdµ(x).

(26.0.4)

[c,d)

Since in the σ-algebra Bj+1 the interval [c, d) is split into two intervals, the random variable Yj+1 can take only two values, say y1 < y2 , on the interval [c, d) and, since (Yj , Bj ) is a martingale, E(Yj+1 |Bj ) − Yj = 0.

(26.0.5)

We will define stopping times τn such that L(Wτn ) = L(Yn ) iteratively as follows. Since Y1 takes only two values −a and b, let τ1 = inf{t > 0, Wt = −a or b} and we proved above that L(Wτ1 ) = L(Y1 ). Given τj define τj+1 as follows: if Wτj = y for y in (26.0.4) then τj+1 = inf{t > τj , Wt = y1 or y2 }. Let us explain why L(Wτj ) = L(Yj ). First of all, by construction, Wτj takes the same values as Yj . If Cj is the σ-algebra generated by the disjoint sets {Wτj = y} for y as in (26.0.4), i.e. for possible values of Yj , then Wτj is Cj measurable, Cj ⊆ Cj+1 , Cj ⊆ Fτj and at each step simple sets in Cj are split in two, {Wτj = y} = {Wτj+1 = y1 } ∪ {Wτj+1 = y2 }. By Markov’s property of the Brownian motion and Theorem 63, E(Wτj+1 − Wτj |Fτj ) = 0 and, therefore, E(Wτj+1 |Cj ) − Wτj = 0. Since on each simple set {Wτj = y} in Cj , the random variable Wτj+1 takes only two values y1 and y2 , this equation allows us to compute the probabilities of these simple sets recursively as in (26.0.3), P(Wτj+1 = y2 ) =

y2 − y P(Wτj = y). y2 − y1

By (26.0.5), Yj ’s satisfy the same recursive equations and this proves that L(Wτn ) = L(Yn ). The sequence 116

τn is monotone, so it converges τn ↑ τ to some stopping time τ. Since Eτn = EWτ2n = EYn2 ≤ EY 2 < ∞, we have Eτ = lim Eτn ≤ EY 2 < ∞ and, therefore, τ < ∞ a.s. Then Wτn → Wτ a.s. by sample continuity and since L(Wτn ) = L(Yn ) → L(Y ), this proves that L(Wτ ) = L(Y ).

117

Section 27

Laws of the Iterated Logarithm. For convenience of notations let us denote �(t) = log log t. Theorem 65 (LIL) Let Wt be the Brownian motion and u(t) = lim sup t→∞



2t�(t). Then

Wt = 1. u(t)

Let us briefly describe the main idea that gives origin to the function u(t). For a > 1, consider a geometric sequence t = ak and take a look at the probabilities of the following events � � �W k Lu(ak ) � P Wak ≥ Lu(ak ) = P √ a ≥ √ ∼ ak ak ∼

� 1 L2 2ak �(ak ) � 1 1 � √ exp − 2 ak 2π L 2�(ak ) 2 � � L 1 1 1 √ � . 2π L 2�(ak ) k log a

(27.0.1)

This series will converge or diverge depending on whether L > 1 or L < 1. Even though these events are not independent in some sense they are ”almost independent” and the Borel-Cantelli lemma would imply that the upper limit of Wak behaves like u(ak ). Some technical work will complete this main idea. Let us start with the following. Lemma 49 For any ε > 0, � � |W − W | √ t s lim sup sup : s ≤ t ≤ (1 + ε)s ≤ 4 ε a.s. u(s) s→∞ Proof. Let ε, α > 0, tk = (1 + ε)k and Mk = αu(tk ). By symmetry, (25.0.1) and the Gaussian tail estimate in Lemma 46 � � � � P sup |Wt − Wtk | ≥ Mk ≤ 2P sup Wt ≥ Mk

tk ≤t≤tk+1

0≤t≤tk+1 −tk

= ≤

� 1 � Mk2 4N (0, tk+1 − tk )(Mk , ∞) ≤ 4 exp − 2 (tk+1 − tk ) � α2 2t �(t ) � � � αε2 1 k k 4 exp − =4 . k log(1 + ε) 2εtk

If α2 > ε, the sum of these probabilities converges and by the Borel-Cantelli lemma, for large enough k, sup tk ≤t≤tk+1

|Wt − Wtk | ≤ αu(tk ).

118

It is easy to see that for small enough ε, u(tk+1 )/u(tk ) < 1 + ε ≤ 2. If k is such that tk ≤ s ≤ tk+1 then, clearly, tk ≤ s ≤ t ≤ tk+2 and, therefore, for large enough k, |Wt − Ws | ≤ 2αu(tk ) + αu(tk+1 ) ≤ (2α + α(1 + ε))u(s) ≤ 4αu(s). Letting α →



ε over some sequence finishes the proof.

Proof of Theorem 65. For L = 1 + γ > 1, (27.0.1) and the Borel-Cantelli lemma imply that Wtk ≤ (1 + γ)u(tk ) for large enough k. If tk = (1 + ε)k then Lemma 49 implies that with probability one for large enough t (if tk ≤ t < tk+1 ) √ Wt Wtk u(tk ) Wt − Wtk u(tk ) = + ≤ (1 + γ) + 4 ε. u(t) u(tk ) u(t) u(tk ) u(t) Letting ε, γ → 0 over some sequences proves that with probability one lim sup t→∞

Wt ≤ 1.

u(t)

To prove that upper limit is equal to one we will use the Borel-Cantelli lemma for independent increments Wak − Wak−1 for large values of the parameter a > 1. If 0 < γ < 1 then, similarly to (27.0.1), � � � �(1−γ)2 1 1 1 � P Wak − Wak−1 ≥ (1 − γ)u(ak − ak−1 ) ∼ √ . 2π (1 − γ) 2�(ak − ak−1 ) log(ak − ak−1 ) The series diverges and, since these events are independent, they occur infinitely often with probability one. We already proved (by (27.0.1)) that for ε > 0 for large enough k, Wak /u(ak ) ≤ 1 + ε and, therefore, by symmetry Wak /u(ak ) ≥ −(1 + ε). This gives Wak u(ak )

u(ak − ak−1 ) Wak−1

+ u(ak ) u(ak )

u(ak − ak−1 ) u(ak−1 )

− (1 − γ) (1 + ε) u(ak ) u(ak )

� � (ak − ak−1 )�(ak − ak−1 ) ak−1 �(ak−1 ) (1 − γ) − (1 + ε) ak �(ak ) ak �(ak )

≥ (1 − γ) ≥ =

and Wt W k lim sup ≥ lim sup ak ≥ (1 − γ) t→∞ u(t) k→∞ u(a )

��

1� 1− − (1 + ε) a



1 . a

Letting γ → 0 and a → ∞ over some sequences proves that the upper limit is equal to one. The LIL for Brownian motion will imply the LIL for sums of independent random variables via Skorohod’s imbedding. Theorem 66 Suppose that Y1 , . . . , Yn are i.i.d. and EYi = 0, EYi2 = 1. If Sn = Y1 + . . . + Yn then lim sup √ n→∞

Sn = 1 a.s. 2n log log n L

Proof. Let us define a stopping time τ (1) such that Wτ (1) = Y1 . By Markov property, the increment of the process after stopping time is independent of the process before stopping time and has the law of the L Brownian motion. Therefore, we can define τ (2) such that Wτ (1)+τ (2) − Wτ (1) = Y2 and, by independence, 119

L

Wτ (1)+τ (2) = Y1 + Y2 and τ (1), τ (2) are i.i.d. By induction, we can define i.i.d. τ (1), . . . , τ (n) such that L

Sn = WT (n) where T (n) = τ (1) + . . . + τ (n). We have WT (n) − Wn Sn L WT (n) Wn = = + . u(n) u(n) u(n) u(n) By the LIL for the Brownian motion, lim sup n→∞

Wn = 1. u(n)

By the strong law of large numbers, T (n)/n → Eτ (1) = EY12 = 1 a.s. For any ε > 0, Lemma 49 implies that for large n √ |WT (n) − Wn | ≤4 ε u(n) and letting ε → 0 finishes the proof. LIL for Brownian motion also implies a local LIL: Wt lim sup � = 1. t→0 2t�(1/t) It is easy to check that if Wt is a Brownian motion then tW1/t is also the Brownian motion and the result follows by a change of variable t → 1/t. To check that tW1/t is a Brownian motion notice that for t < s, � � 1 1 EtW1/t sW1/s − tW1/t = st − t2 = t − t = 0 s t and

� �2 E tW1/t − sW1/s = t + s − 2t = s − t.

120

Related Documents