1
Expected Values of Random Variables
Often it is convenient to describe a random variable using some location measure. The most important location measure is the expected value (a.k.a. the mean or the weighted average). For a discrete random variable X its expected value, denote E[X], is given by X E[X] = xk p(xk ). k
For a continuous random variable
Z
∞
E[X] =
xf (x)dx. −∞
Remark: Strictly speaking E[X] exists only if both E[X + ] and E[X − ] exists and are not both ∞ where X = max(X, 0) and X − = max(−X, 0). Example: Suppose X takes values {0, 1, 2, 3} with probabilities {1/8, 3/8, 3/8, 1/8}. Then, E[X] = 0(1/8)+ 1(3/8) + 2(3/8) + 3(1/8) = 12/8 = 1.5. Remark: E[X] need not be a possible outcome of X. Frequency, or long run average interpretation of the expected value: Example: Suppose a random variable X represents the profit associated with the production of some item that can be defective or nondefective. Suppose that the profit is −2 when the item is defective and 10 when the item is non-defective. Finally, assume that p(−2) = 0.1, p(10) = 0.9. Then E[X] = −2(0.1) + 10(.9) = 8.8. Suppose that a very large number n of items are produced and let n(G) be the number of good items and n(D) is the number of defective items. Then the average profit per items is +
−2
n(D) n(G) + 10 n n
The frequency interpretation is that n(G)/n converges in some sense to be defined later to p(10) = 0.9. This convergence is know as the Law of Large Numbers.
1.1
Markov’s Inequality
Proposition: Suppose X is a nonnegative random variable and c > 0. Then P (X ≥ c) ≤
E[X] . c
Notice that the inequality is non-trivial if c > EX. Proof (Discrete case): X X X E[X] = kp(k) ≥ kp(k) ≥ c p(k) = cP (X ≥ c). k≥0
k≥c
k≥c
Alternative format: c = kE[X] then P (X ≥ kE[X]) ≤
1
1 . k
Relevant for k ≥ 1. Example: Expected life of a machine is 10 years. 10 = 0.5. 20
P (L ≥ 20) ≤
1.2
Expected Value of Functions of Random Variables
Suppose X is a discrete random variable and G is a function of X, then by definition X E[G(X)] = gk pG (gk ) k
where pG is the probability mass function of the random variable G(X). This requires computing pG the probability mass function of the random variable G(X). Example: Suppose X take values {−2, −1, 0, 1, 2} with probabilities {1/8, 1/4, 1/4, 1/4, 1/8} and Y = X 2 . Then Y takes values {0, 1, 4} with probabilities {1/4, 1/2, 1/4}. Therefore, E[Y ] = E[X 2 ] = 0(1/4) + 1(1/2) + 4(1/4) = 1.5. Is there an easier way? Proposition: If X is a discrete random variable and G is a function of X, then X G(xk )p(xk ), E[G(X)] =
(1)
k
where p(xk ) = P (X = xk ). Example: Applying formula (1) E[X 2 ] = (−2)2 (1/8) + (−1)2 (1/4) + 02 (1/4) + 12 (1/4) + 22 (1/8) = 4/8 + 1/4 + 0 + 1/4 + 4/8 = 1.5. A similar formula exists for continuous random variables: Z ∞ E[G(X)] = g(x)f (x)dx. −∞
1.3
Variance
Consider G(X) = (X − E[X])2 . Notice that G(X) is a r.v. so we can calculate its expected value. Definition: E[(X − E[X])2 ] is known as the variance of X and written Var[X] or σ 2 . For discrete random variables we have X Var[X] = E[G(X)] = (xk − E[X])2 p(xk ), k
and for continuous random variables we have Z
∞
Var[X] = E[G(X)] =
(x − E[X])f (x)dx. −∞
2
(2)
p Note σ = + Var[X] is called the standard deviation of X. Variance as a measure of spread of a distribution: Example: p(a) = p(−a) = 0.5, then E[X] = 0 and Var[X] = a2 and σX = a. In this case the standard deviation gives us a good idea of how the distribution is spread around the mean. Example: p(0) = 1−1/a, p(a) = 1/a. Then E[X] = 1 and Var[X] = a−1. In this example the variance increases to infinity while the random variable becomes more and more concentrated on zero.
2
Useful properties of Expected Values • E(aX + b) = aEX + b • Var(aX + b) = a2 Var[X] • Var[X] = E[X 2 ] − (EX)2 .
2.1
Tchebychev’s Inequality
Proposition: Suppose µ = E[X] and σ 2 = Var[X] exist. Then P (|X − µ| ≥ a) ≤
σ2 . a2
Proof: Let Y = (X − µ)2 . Then Y ≥ 0. Therefore Markov’s inequality applies and P ((X − µ)2 ≥ a2 ) ≤
E[(X − µ)2 ] σ2 = . a2 a2
The result follows since P ((X − µ)2 ≥ a2 ) = P (|X − µ| ≥ a). 2. Special case a = kσ then 1 P (|X − µ| ≥ kσ) ≤ 2 . k One sided Tchebychev’s inequality: P (X − µ ≥ kσ) ≤
k2
1 +1
and P (X − µ ≤ −kσ) ≤
3
k2
1 +1
for all k > 0,
for all k > 0.
Expectations of Functions of Random Variables
Let X and Y be two random variables, and let Z = g(X, Y ) be a real value function. Then Z is also a random variable and we may be interested in computing E[Z]. In principle we can do this by first computing the probability mass function (in the discrete case) or the density function (in the continuous case) of Z
3
and then use the definition of expectation. Fortunately, there is no need to do this as the following formulas allow a direct calculation of E[Z]. In the discrete case XX E[g(X, Y )] = g(x, y)pX,Y (x, y). In the continuous case
Z
∞
x
y
Z
∞
E[g(X, Y )] =
g(x, y)fX,Y (x, y)dxdy. −∞
−∞
If g(X, Y ) = aX + bY the above formulas lead to E[aX + bY ] = aE[X] + bE[Y ].
3.1
Covariance and Independence
One very important function is g(X, Y ) = (X − E[X])(Y − E[Y ]) which is the product of the difference between the random variables and their means. The expectation of this random variable is called the covariance of X and Y and it is denoted by Cov(X, Y ) = E[(X − E[X])(Y − E[Y ])]. It can be shown that |Cov(X, Y )| ≤ σx σy . The term ρ=
Cov(X, Y ) σx σy
is known as the correlation coefficient. Notice that |ρ| ≤ 1. Example: If X is the number of heads in the first two tosses of a fair coin and Y is the number of heads in the first three tosses then E[X] = 1, E[Y ] = 1.5, Var[X] = 0.5, Var[Y ] = .75, Cov(X, Y ) = 0.5 and ρ = 0.82. It is easy to see that Cov(aX + bY, X) = aCov(X, X) + bCov(Y, X).
3.2
Independent Random Variables
Earlier we defined conditional probabilities p(x|y) = p(x, y)/pY (y) in the discrete case and f (x|y) = f (x, y)/fY (y) in the discrete case. Just as we defined independent events we can define independent random variables as those for which the conditional probability is equal to the unconditional probability. In other words, knowing Y = y does not change the distribution of X, so p(x|y) = p(x) for all y such that pY (y) > 0 and similarly, f (x|y) = f (x) for all y such that fY (y) > 0. Just as we did with events, we can use a symmetric definition and say that random variables X and Y are independent if p(x, y) = pX (x)pY (y) in the discrete case and f (x, y) = fX (x)fY (y) in the continuous case. The above seems to indicate that we need a separate definition of independence for discrete and continuous random variables but this is not so as the following definition of independence encompasses both the discrete and the continuous cases: X and Y are said to be independent if F (x, y) = FX (x)FY (y) 4
where F (x, y) = P (X ≤ x, Y ≤ y) is the joint cumulative distribution function and FX (x) = P (X ≤ x) and FY (y) = P (Y ≤ y) are the marginal cumulative distribution functions. Independence is a very important concept. It is easy to see that independent random variables are uncorrelated, e.g., have zero correlation. To see this notice that XX Cov(X, Y ) = (x − E[X])(y − E[Y ])p(x, y) = =
x
y
x
y
XX
(x − E[X])(y − E[Y ])pX (x)pY (y)
X X (y − E[Y ])pY (y) (x − E[X])pX (x) x
y
= (EX − EX)(EY − EY ) = 0. However, uncorrelated random variables are not necessarily independent as shown in the following example. Example: Toss two coins. Let Xi be the number of heads in the ith toss. Let D = X1 −X2 and S = X1 +X2 . Are D and S independent? No. Because if D = 0 then S 6= 1. Notice that E[D] = 0, consequently Cov(S, D) = E[SD] − E[S]E[D] = E[SD]. Now E[SD] = E[D] = 0, so S and D are uncorrelated.
4
Variance of Linear Combinations of Random Variables
Notice that when Y is replaced by X then we get the square of the deviation of X from its mean, we obtain Cov(X, X) = Var[X]. The variance of aX + bY can be written as Var(aX + bY ) = Cov(aX + bY, aX + bY ) = a2 Var(X) + 2abCov(X, Y ) + b2 Cov(Y ).
Example: Var[X − Y ] .
= Var[X] − 2Cov(X, Y ) + Var[Y ] = σx2 − 2ρσx σy + σy2 .
Example: What if σx = σy = σ? then Var[X − Y ] = 2σ 2 (1 − ρ). Example: What if ρ = 1? Then Var[X − Y ] = 0, so X − Y is deterministic, i.e., Y = a + X for some constant a.
5