Module 25 - Statistics 2

  • June 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Module 25 - Statistics 2 as PDF for free.

More details

  • Words: 4,408
  • Pages: 9
FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING Lectures MODULE 25

STATISTICS II

1. Mean and standard error of sample data 2. Binomial distribution 3. Normal distribution 4. Sampling 5. Confidence intervals for means 6. Hypothesis testing

1. Mean and standard error of sample data Two different random variables can be measured for the same object: e.g. the height and weight of a person. Both these random variables have distributions, mean values and variances. However, taller people are usually heavier than shorter people so these two variables are dependent. Variables can also be independent: independent events → P (A ∩ B) = P (A) P (B) independent discrete random variables → P (X = xi ∩ Y = yj ) = P (X = xi ) P (Y = yj ) The latter defines a joint distribution for the two random variables. Ex 1. A new plant at a manufacturing site has to be installed and then commissioned. The times required for the two steps depend upon different random factors, and can therefore be regarded as independent. Based on past experience the respective distributions for X (installation time) and Y (commissioning time), both in days, are P (X = 3) = 0.2, P (X = 4) = 0.5, P (X = 5) = 0.3 P (Y = 2) = 0.4, P (Y = 3) = 0.6 Find the joint distribution for X and Y , and the probability that the total time will not exceed 6 days. Since the factors are independent P (X = xi ∩ Y = yj ) = P (X = xi ) P (Y = yj ). Thus the joint probability table is Y /X 3 4 5 Y /X 3 4 5 2 0.2 × 0.4 0.5 × 0.4 0.3 × 0.4 i.e. 2 0.08 0.20 0.12 3 0.2 × 0.6 0.5 × 0.6 0.3 × 0.6 3 0.12 0.30 0.18 Note that the column and row totals give the individual distributions for X and Y . For the second part of the question P (X + Y ≤ 6) = P (X = 3 ∩ Y = 2) + P (X = 3 ∩ Y = 3) + P (X = 4 ∩ Y = 2) : no other combinations are possible. The joint distributions can then be read from the table giving P (X + Y ≤ 6) = 0.08 + 0.12 + 0.20 = 0.40 It would be easy X to calculate P (X + Y = wi ) for each wi , and then evaluate E(X + Y ) using the expression E(X + Y ) = wi P (X + Y = wi ) . wi

1

However, we can use a more general result. Given that E(X) =

X

xi P (X = xi ) and that E(Y ) satisfies

xi

a similar formula, then it can be shown that for independent variables E(X + Y ) = E(X) + E(Y ) and Var(X + Y ) = Var(X) + Var(Y ). Given some data we usually do not know the exact distribution – but it would be useful to try to estimate the mean and variance from the data n 1 X Def. For a sample {X1 , X2 , ..., Xn } of data the sample average is defined by X = Xi n i=1 Def.

For sample data {X1 , X2 , ..., Xn } the sample variance is defined by SX 2 =

n 1 X (Xi − X)2 . n−1 i=1

[Note that n − 1 is used in the denominator in SX 2 because the differences Xi − X sum to zero and so are not independent.] Ex 2. A die was tossed six times producing the following results 6, 2, 4, 2, 1, 5 . Find the sample average and the standard variance. 20 10 1 = ∼ 3.33 X = (6 + 2 + 4 + 2 + 1 + 5) = 6 6 3 " 2  2  2  2  2  2 # 10 1 10 10 10 10 10 2 6− + 2− + 4− + 2− + 1− + 5− SX = 6−1 3 3 3 3 3 3 "  #           2 2 2 2 2 2 4 4 7 1 8 2 5 + − + + − + − + = 5 3 3 3 3 3 3   174 58 1 64 + 16 + 4 + 16 + 49 + 25 = = ∼ 3.87 → SX ∼ 1.97 = 5 9 45 15 [For unbiased die it can be shown that the theoretical results for X and SX for a large number of tosses are 3.5 and 1.708 respectively. ] 2. Binomial distribution Specifying the exact distribution of a random variable requires a lot of information. Good estimates of mean, variance etc. can be obtained from data. Often probability distributions can be determined by formulae using the estimated values of the parameters. Consider the simple coin tossing experiment where only two outcomes are possible – success (say 1) or failure (say 0). A Bernoulli trial is a simple observation of a random variable X , say, that can take the values 1 or 0: suppose P (X = 1) = p, P (X = 0) = 1 − p . Then

E(X) = X = 1(p) + 0(1 − p) = p + 0 = p , and

Var(X) = E(X − X)2 = E(X − p)2 = (0 − p)2 P (X = 0) + (1 − p)2 P (X = 1) = p2 (1 − p) + (1 − p)2 p = p(1 − p)(p + (1 − p)) = p(1 − p) .

Now let {X1 , ..., Xn } denote n independent Bernoulli trials, each with success probability p . Then number of successes Y = X1 + X2 + . . . + Xn . Suppose Y = k , where 0 ≤ k ≤ n , then k of the Xi values equal 1 and n − k equal 0. The probability of k 1’s at first with n − k 0’s following is pk (1 − p)n−k , since the outcomes are independent.   n! n , The number of ways of distributing k successes among n trials = = k (n − k)!k!   n pk (1 − p)n−k . hence P (Y = k) = k 2

This is the general form of the binomial distribution and leads to variance = np(1 − p) .

mean = np ,

Ex 3. If on average 1 in 20 of a certain type of column fails under loading, what is the probability that among 16 such columns at most 2 will fail? Given P (column failing) = P (F ) =

1 = 0.05 , 20

P (column not failing) = P (F ) =

19 = 0.95 , 20

therefore P (0 F ) = (0.95)16 = 0.44013,   16! 16 = (0.05)(0.46329)(16) = 0.37063 = (0.05)(0.46329) P (1 F ) = (0.05)1 (0.95)15 1 15! 1!   (16)(15) 16! 16 P (2 F ) = (0.05)2 (0.95)14 = (0.0025)(0.48767) = 0.14630 = (0.0025)(0.48767) 2 14! 2! 2 → P (at most 2 fail) = 0.44013 + 0.37063 + 0.14630 = 0.9571

3. Normal distribution This occurs very frequently in practice. Def. A continuous random variable X has a normal distribution (or Gaussian distribution) with mean µX and variance σX 2 if the probability density function satisfies fX (x) =

  (x − µX )2 1 √ exp − 2σX 2 σX 2π

(−∞ < x < ∞) .

Write X ∼ N (µX , σX 2 ) to represent a random variable with normal distribution which has mean µX and variance σX 2 (see figure 3).

fX 0.4

µ =0, σ =1 X

X

µ

-6

-4

-2

2

0

X

4

=0, σ

6

X

=3

x

figure 3 The standard normal distribution has mean 0 and variance 1, and leads Z z to 2 1 Def. Standard normal cumulative distribution is Φ(z) = √ e−x /2 dx . 2π −∞ Φ(z) is usually tabulated, and a summary of values appears on the Formula Sheet. Note that Φ(z) denotes the area under the probability distribution curve to the left of x = z. Suppose that X is a normal variable with mean µX and variance σX 2 then it can be shown that X − µX Z= is also a normal variable with mean 0 and variance 1. σX (The latter result is very useful in applications.) 3

Ex 4. The burning time X of an experimental rocket is a random variable having (approximately) a normal distribution with mean 600s and standard deviation 25s. Find the probability that such a rocket will burn for (a) less than 550s, (b) more than 637.5s. (a)

Given µX = 600 and σX = 25 , therefore  P (X < 550) = P

where Z =

550 − 600 X − 600 < 25 25

 = P (Z < −2) ,

X − 600 . From symmetry (look at figure 3), and then using the results on the Formula Sheet, 25 P (Z < −2) = P (Z > 2) = 1 − P (Z ≤ 2) = 1 − Φ(2) = 1 − 0.9772 = 0.0228  637.5 − 600 X − 600 > = P (Z > 1.5) 25 25 = 1 − P (Z ≤ 1.5) = 1 − Φ(1.5) = 1 − 0.9332 = 0.0668

 (b)

P (X > 637.5) = P

When n > 20 then it can be shown that the normal distribution with mean np and variance np(1 − p) provides a very accurate approximation to the binomial distribution. 4. Sampling One of the major problems in statistics, estimating the properties of a large population from the properties of a sample of individuals chosen from that population, is considered in this section. Select at random a sample of n observations X1 , X2 , . . . , Xn taken from a population. From these n observations you can calculate the values of a number of statistical quantities, for example the sample mean X. If you choose another random sample of size n from the same population, a different value of the statistic will, in general, result. In fact, if repeated random samples are taken, you can regard the statistic itself as a random variable, and its distribution is called the sampling distribution of the statistic. For example, consider the distribution of heights of all adult men in England, which is known to conform very closely to the normal curve. Take a large number of samples of size four, drawn at random from the population, and calculate the mean height of each sample. How will these mean heights be distributed? We find that they are also normally distributed – about the same mean as the original distribution. However, a random sample of four is likely to include men both above and below average height and so the mean of the sample will deviate from the true mean less than a single observation will. This important general result can be stated as follows: If random samples of size n are taken from a distribution whose mean is µX and whose standard deviation is √ σX , then the sample means form a distribution with mean µX and standard deviation σX = σX / n. Note that the theorem holds for all distributions of the parent population. However, if the parent distribution is normal then it can be shown that the sampling distribution of the sample mean is also normal. The standard deviation of the sample mean, σX defined above, is usually called the standard error of the sample mean. Let us now present three worked examples. Ex 5. A random sample is drawn from a population with a known standard deviation of 2.0. Find the standard error of the sample mean if the sample is of size (i) 9, (ii) 100. What sample size would give a standard error equal to 0.5? Using the result stated earlier 2 σX (i) standard error = √ = √ = 0.667, n 9

to 3 decimal places, 4

2 σX = 0.2 standard error = √ = √ n 100 √ If the standard error equals 0.5, then 2/ n = 0.5. Squaring then implies that 4/n = 0.25 or n = 16, (i.e. the sample size is 16). (ii)

Ex 6. The diameters of shafts made by a certain manufacturing process are known to be normally distributed with mean 2.500 cm and standard deviation 0.009 cm. What is the distribution of the sample mean diameter of nine such shafts selected at random? Calculate the percentage of such sample means which can be expected to exceed 2.506 cm. Since the process is normal we know that the sampling distribution of the sample mean will also √ be normal, with the same mean, 2.500 cm, but with a standard error (or standard deviation) σX = 0.009/ 9 = 0.003 cm. In order to calculate the probability that the sample mean is bigger than 2.506, i.e. X > 2.506, we standardise in the usual way by putting Z = (X − 2.500)/0.003, and then   2.506 − 2.500 X − 2.500 > = P (Z > 2.0) P (X > 2.506) = P 0.003 0.003 = 1 − P (Z ≤ 2.0) = 1 − Φ(2.0) = 1 − .9772 = 0.0228,

using the formula sheet

Hence, 2.28% of the sample means can be expected to exceed 2.506 cm. Ex 7. What is the probability that an observed value of a normally distributed random variable lies within one standard deviation from the mean? Given normally distributed random variable, X , has mean µX and standard deviation σX , i.e. X ∼ X − µX N (µX , σX 2 ). We need to calculate P (µX − σX ≤ X ≤ µX + σX ). Define Z = , then Z ∼ N (0, 1). σX It follows that   X − µX (µX + σX ) − µX (µX − σX ) − µX P (µX − σX ≤ X ≤ µX + σX ) = P ≤ ≤ σX σX σX = P (−1 ≤ Z ≤ 1) = 2 P (0 ≤ Z ≤ 1), by symmetry = 2(Φ(1) − Φ(0)) = 2(0.8413 − 0.5000) = 2(0.3413) = 0.6826 It was stated above that when the parent distribution is normal then the sampling distribution of the sample mean is also normal. When the parent distribution is not normal, then obtain the following theorem (surprising result?): Central limit theorem If a random sample of size n, (n ≥ 30), is taken from ANY distribution with mean µX and standard deviation σX ,√then the sampling distribution of X is approximately normal with mean µX and standard deviation σX / n , the approximation improving as n increases. Ex 8. It is known that a particular make of light bulb has an average life of 800 hrs with a standard deviation of 48 hrs. Find the probability that a random sample of 144 bulbs will have an average life of less than 790 hrs. Since the number of bulbs in the sample is large, the sample mean will be normally distributed with mean 48 (X − 800) X − µX , then = 4 . Put Z = = = 800 and standard error σX = √ σX 4 144   790 − 800 X − 800 P (X < 790) = P < 4 4 = P (Z < −2.5) = P (Z > 2.5), by symmetry = 1 − P (Z ≤ 2.5) = 1 − 0.9938 = 0.0062 . 5

To conclude this section the main results concerning the distribution of the sample mean X are summarised. Consider a parent population with mean µX and standard deviation σX . From this population take a √ X − µX √ then random sample of size n with sample mean X and standard error σX / n . Define Z = σX / n (i) for all n — the distribution of Z is N (0, 1) if the distribution of the parent population is normal; (ii) n < 30 — the distribution of Z is approximately N (0, 1) if the distribution of the parent population is approximately normal; (iii) n ≥ 30 — the distribution of Z is a good approximation to N (0, 1) for all distributions of the parent population. 5. Confidence intervals for means Choose a sample at random from a population, then the mean, X, of the sample is said to provide a point estimator of the population mean µX . The accuracy of this estimate is measured by the confidence interval, which is the interval within which you can be reasonably sure the value of the population mean µX lies. One usually calculates k by specifying that the probability that the interval (X − k, X + k) contains the population mean is 0.95, or 0.99. For example, if k is calculated so that P (X − k ≤ µX ≤ X + k) = .95 then the interval is called the 95% confidence interval – if the interval is calculated for very many samples then 95 out of 100 intervals would contain µX . On replacing 95 by 99 you obtain the definition of a 99% confidence interval. X − µX √ is distributed To proceed further, assume that the standard deviation σX is known and that Z = σX / n with the standard normal distribution N (0, 1) (the conditions for this to hold were stated at the end of section 4). The results on the Formula Sheet show that 95% of the standard normal distribution lies between −1.96 and 1.96. Hence P (−1.96 ≤ Z ≤ 1.96) = 0.95 . Rearranging the inequalities inside the bracket to obtain conditions on X yields Z=

X − µX √ ≤ 1.96 σX / n

Z=

X − µX √ ≥ −1.96 σX / n

→ →

√ X − µX ≤ 1.96σX / n



√ X − µX ≥ −1.96σX / n

√ µX ≥ X − 1.96σX / n; →

√ µX ≤ X + 1.96σX / n.

It follows that the earlier expression can be re-written as P

  σX σX = 0.95 , X − 1.96 √ ≤ µX ≤ X + 1.96 √ n n

and hence the interval σX σX is the 95% confidence interval for µX . X − 1.96 √ ≤ µX ≤ X + 1.96 √ n n σX σX is the 99% confidence interval for µX . Similarly X − 2.58 √ ≤ µX ≤ X + 2.58 √ n n

6

Ex 9. The percentage of copper in a certain chemical is to be estimated by taking a series of measurements on randomly chosen small quantities of the chemical and using the sample mean to estimate the true percentage. From previous experience individual measurements of this type are known to have a standard deviation of 2.0%. How many measurements must be made so that the standard error of the estimate is less than 0.3%? If the sample mean w of 45 measurements is found to be 12.91%, give a 95% confidence interval for the true percentage, ω . √ Assume that n measurements √ are made. The standard2 error of the sample mean is (2/ n)%. For the required precision require 2/ n < 0.3 , i.e. n > (2/0.3) = 4/0.09 = 44.4 . Since n must be an integer, at least 45 measurements are necessary for required precision. With a sample of 45 measurements, you can use the central limit theorem √ and take the sample mean percentage W to be distributed normally with mean ω and standard error 2/ 45. Hence, if ω is the true W −ω percentage, it follows that Z = √ is distributed as N (0, 1). Since 95% of the area under the standard 2/ 45 normal curve lies between Z = −1.96 and Z = 1.96 ,   W −ω P −1.96 ≤ √ ≤ 1.96 = 0.95 . 2/ 45      2 2 √ √ ≤ ω ≤ W + 1.96 = 0.95 . Re-arranging, we obtain P W − 1.96 45 45 Hence, the 95% confidence interval for the true percentage is (12.91 − 1.96(0.298), 12.91 + 1.96(0.298)) = (12.33, 13.49) . To complete this section we define the sample variance. Def. Given a sample of n observations X1 , X2 , . . . , Xn the sample variance, s2 , is given by s2 = n 2 1 X Xi − X , where X denotes the sample mean. n − 1 i=1 In our discussion of confidence intervals for the mean it was assumed that the population variance σX 2 was known. What happens when this it is not known? For samples of size n > 30, a good estimate of σX 2 is obtained by calculating the sample variance s2 and using this value. (For small samples, n < 30, need to use the t-distribution – not considered in this module). 6. Hypothesis testing An assumption made about a population is called a statistical hypothesis. From information contained in a random sample we try to decide whether or not the hypothesis is true: if evidence from the sample is inconsistent with the hypothesis, then hypothesis is rejected; if the evidence is consistent with the hypothesis, then the hypothesis is accepted. The hypothesis being tested is called the null hypothesis (usually denoted by H0 ) – it either specifies a particular value of the population parameter or specifies that two or more parameters are equal. A contrary assumption is called the alternative hypothesis (usually denoted by H1 ) – normally specifies a range of values for the parameter. A common example of the null hypothesis is H0 : µX = µ0 . Then three alternative hypotheses are (i) H1 : µX > µ0 ,

(ii) H1 : µX < µ0 ,

(iii) H1 : µX 6= µ0 .

Types (i) and (ii) are said to be one-sided (or one-tailed, see figure 6b) – type (iii) is two-sided (or two-tailed, see figure 6a). The result of a test is a decision to choose H0 or H1 . This decision is subject to uncertainty, and two types of error are possible: 7

(i) a type I error occurs when we reject H0 on the basis of the test although it happens to be true – the probability of this happening is called the level of significance of the test and this is prescribed before testing – most commonly chosen values are 5% or 1%. (ii) a type II error occurs when you accept the null hypothesis on the basis of the test although it happens to be false. The above ideas are now applied to determine whether or not the mean, X, of a sample is consistent with a specified population mean µ0 . The null hypothesis is H0 : µX = µ0 and a suitable statistic to use is X − µ0 √ , where σX is the standard deviation of the population and n is the size of the sample. Z= σX / n Find the range of values of Z for which the null hypothesis would be accepted – known as acceptance region for the test – depends on the pre-determined significance level and the choice of H1 . Corresponding range of values of Z for which H0 is rejected (i.e. not accepted) is called the rejection region. Ex 10. A standard process produces yarn with mean breaking strength 15.8 kg and standard deviation 1.9 kg. A modification is introduced and a sample of 30 lengths of yarn produced by the new process is tested to see if the breaking strength has changed. The sample mean breaking strength is 16.5 kg. Assuming the standard deviation is unchanged, is it correct to say that there is no change in the mean breaking strength? Here H0 : µX = µ0 , H1 : µX 6= µ0 , where µ0 = 15.8 and µX is the mean breaking strength for the new process. X − µ0 √ has approximately the N (0, 1) distribution, where X is the σX / n mean breaking strength of the 30 sample values and n = 30 . If H0 is true (i.e. µX = µ0 ), then Z =

At the 5% significance level there is a rejection region of 2.5% in each tail, as shown in figure 6a (since, under H0 , P (Z < −1.96) = P (Z > 1.96) = 1 − P (Z ≤ 1.96) = 1 − Φ(1.96) = 0.025, i.e. 2.5%). This is an example of a two-sided test leading to a two-tailed rejection region.

rejection region 2 12 % -1.96

acceptance region

rejection region

0

1.96 2 1 % 2

Figure 6a

accept H0 if −1.96 ≤ Z ≤ 1.96, otherwise reject. 16.5 − 15.8 √ = 2.018. Hence, H0 is rejected at the 5% significance level: i.e. the From the data, Z = 1.9/ 30 evidence suggests that there IS a change in the mean breaking strength. The test is therefore:

Let us now consider a slightly differently worded question. Suppose the modification was specifically designed so as to increase the strength of the yarn. In this case H0 : µX = µ0 ,

H1 : µX > µ0 ,

and H0 is rejected if the value of Z is unreasonably large. In this situation the test is one-sided and acceptance and (one-tailed) rejection regions at the 5% significance level are shown below. 8

rejection region

acceptance region 0

Figure 6b

1.64

5%

At 5% significance level, test is accept H0 if Z ≤ 1.64, otherwise reject. From earlier work Z = 2.018 and again the null hypothesis is rejected. [Compare the two diagrams above, which illustrate the statement that the rejection region for a test depends on the form of both the alternative hypothesis and the significance level.]

rec/01ls2

9

Related Documents