Introduction to Hypothesis Testing, Power Analysis and Sample Size Calculations Hypothesis testing is one of the most commonly used statistical techniques ever. Most often, scientists are interested in estimating the differences between two populations and use the sample means as the statistic of interest. For this reason, the normal distribution is the one most often used to estimate the probabilities of interest. Drawing conclusions about populations from data is termed inference. Scientists wish to draw inferences about the populations they are interested in from the data they have collected. Power calculations also can be important in the case that we failed to reject the null hypothesis of no effect or no significant difference. This process can be important in the regulatory community, where failing to reject the null hypothesis of no effect, is unconvincing unless accompanied by a power analysis that shows that if there were an effect, the sample size was large enough to detect it. This lecture will explore the basic concepts behind power analysis using the normal assumption. Power and sample size calculations can be more complicated when using other distributions but the basic idea is the same.
1.
Distribution of the Sample Mean
Most hypothesis testing is conducted using the sample mean as the statistic of interest–to estimate the true population mean. Consider a sample of size n of random variables X1 , X2 , · · · , Xn , with E{Xi } = µ and V ar{Xi } = σ 2 ∀i. Let Pn Xi x¯ = i=1 . n Then, ½ Pn E{¯ x} = E
i=1
Xi
¾
n
=
1 nµ = µ n
=
1 2 1 2 nσ = σ n2 n
and ½ Pn V ar{¯ x} = V ar
i=1
Xi
n 1
¾
2
If Xi ∼ N {µ, σ 2 }, then we know from earlier results that x¯ ∼ N {µ, σn }. Additionally, even if the data do not come from a normal distribution ½√ ¾ n (¯ x − µ) lim P ≤ x = Φ(x). n→∞ σ Hence, even if our data are not normal, for a large enough sample size, we can calculate probabilities for x¯ by applying the Central Limit Theorem, and our answers will be close enough.
2.
Hypothesis Testing
Hypothesis testing is a formal statistical procedure that attempts to answer the question “Is the observed sample consistent with a given hypothesis.” This boils down to calculating the probability of the sample given the hypothesis, P {X|H}. To set up the procedure, scientists propose what is called a null hypothesis. The null hypothesis is usually of the form: these data were generated by strictly random processes, with no underlying mechanism. Always, the null hypothesis is the opposite of the hypothesis that we are interested in. The scientist will then set up a hypothesis test to compare the null hypothesis to the mechanistic, scientific hypothesis consistent with their scientific theory. Example 1 Examples of null hypotheses are: 1. no difference in the response of patients to a drug versus a placebo, 2. no difference in the contamination level of well-water near a papermill and a well some distance away, 3. no difference between the leukemia rate in Woburn, Massachusetts and the national average, and 4. the concentration of mercury in the groundwater is below the regulatory limit. A hypothesis test is usually represented as follows: H 0 : µ = µ0 vs. Ha : µ 6= µ0 2
The null and alternative hypotheses should be specified before the test is conducted and before the data are observed. The investigators also need to specify a value for P {X|H0 } at which they will reject H0 . The idea is that if the data are quite unlikely under the null hypothesis, then we conclude that they are inconsistent with the null, and hence accept the alternative. Notice that the null and the alternative are mutually exclusive and exhaustive–that is, one or the other must be true, but it’s impossible that both are. The probability the we reject the null is denoted α and is called the size of the test. It’s complement 1 − α is called the significance level of the test, though sometimes you will see these terms used interchangeably. Note that for some α = P {X|H0 } small enough, we reject H0 . Hence, α = P {we reject H0 , when H0 is true}, also called the probability of a type I error. The probability of a Type II error is given by P {we fail to reject H0 when H0 is false} = β. Table 1: Possible Results from a Hypothesis Test Truth Test H0 Ha H0 OK Type II Error Ha Type I Error OK
3
The values under the normal curve that are equal or more extreme than our test statistic constitute the rejection region. Let’s begin with an example. Say that regulators desire a high certainty that emissions are below 5 parts per billion for a particular contaminant, and the regulatory limit is 8 parts per billion. They may conduct the following test H0 : µ < 5ppb versus Ha : µ ≥ 5ppb At what value will we reject H0 ? Say, we would like to reject the null hypothesis at the 95% confidence level. This means we wish to fix the probability of falsely rejecting H0 (type I error) at no greater than 5%. Here, under H0 , µ can be fixed at µ = 5 without altering the size (confidence level) of the test. Now we need to find the rejection region, i.e. the value of x ¯ at which we can reject H0 at 95% confidence. We need to find a c such that P r{¯ x > c|µ = 5} = 0.05
(2.1)
Now, since the standard deviation is taken to be 3 and the sample size is 5, we can standardize x¯ under H0 so that it has a standard normal distribution (a mean of 0 and a standard deviation of 1), and then we can make use of the standard normal probability charts. We have ( P r{¯ x > c|H0 } = P r
x¯ − 5 √3 5
>
c−5
)
√3 5
( = Pr
z>
c−5 √3 5
) (2.2)
where z is a standard normal random variable. Then, from our probability charts, we know that c−5 √3 5
= 1.64
Solving for c, we find that we reject H0 when x¯ ≥ 7.2. 4
(2.3)
3.
Power Calculations
Let’s continue with our example. In the event that the managers fail to reject H0 , that is, they conclude that there is insufficient evidence that emissions are above 5 ppb, they and their stakeholders, may want to ask the question: “Was there sufficient information in our sample (i.e. is the lack of evidence due to insufficient sample size) to have detected a difference of 3 ppb? Hence, they need to calculate the power of the test when µ = 8 ppb. The power of a test is defined as the probability that we correctly reject the null hypothesis, given that a particular alternative is true. Power can also be defined as 1-Pr{we do not reject H0 when H0 is false}=1-Pr{type II error} In order to calculate power, we need to specify an alternative and we require an estimate of the variability of the statistic used to conduct the test, in most cases this statistic is the sample mean. The standard deviation of the sample mean is given by the sample standard deviation divided by the square root of the sample size. σx σx¯ = √ n
(3.1)
Continuing with the example, say we wish to calculate the power of the above test, for a normal sample of size 5 and with known standard deviation 3. As with the size calculation, for the purposes of this power calculation, under Ha , µ can be fixed at µ = 8. The test statistic is the sample mean. We will reject the null hypothesis for some value of x¯. This value can be easily calculated using elementary statistics since we have made the assumption that the sample mean is normally distributed. This means that we assume that if we were to repeat the experiment a large number of times and calculated the mean each time, that the resulting sample of means would show a normal distribution. We know that the rejection region was x¯ ≥ 7.2. We can calculate the power of this test at some alternative, say µ = 8.
5
We need ( P r {¯ x > 7.2|µ = 8} = P r
x¯ − 8 √3 5
>
7.2 − 8
)
√3 5
= P r {z > −0.5963} = 0.7245
(3.2) The power of this test at the specified alternative is then 0.7245. Alternatively, we can say that the probability of type II error, or the probability that we failed to reject the null when the true mean was 8 is 1 − 0.7245 = 0.2755. We can conduct a full power analysis by plotting the power at a wide variety of alternatives, or distances from µ0 , assuming that the standard deviation remains constant across all concentrations.
0.6 0.2
0.4
Power
0.8
1.0
Power Curve for n=5, sigma=3
6
8
10
12
14
Alternative Mean Concenration in parts per billion
4.
Sample Size Calculations
Even better than performing a power analysis after an experiment has been conducted is to perform it before any data are collected. Careful experimental design can save untold hours and dollars from being wasted. As Quinn& Keogh point out, too often a “post hoc” power calculation reveals nothing 6
more than our inability to design a decent experiment. If we have any reasonable estimate of the variability and a scientifically justifiable and interesting alternative, or even a range of alternatives, we can estimate before hand whether or not the experiment is worth doing given the limitations on our time and budget. Say we would like to set the probability of a type I error at no greater than 5% and of a type II error at no greater than 10%, what sample size would we need for the test shown above? We saw that we rejected H0 at µ ¶ σ x¯ ≥ z1−α √ + µ0 . n Now consider the desired power. We need to repeat the same process as we did above for the α level, but this time solving for c using the z value for the corresponding power. ¶ µ σ + µa . x¯ ≥ zβ √ n Now recall that zβ = −z1−β . Setting the two expressions for x¯ equal to one another we have ¶ ¶ µ µ σ σ z1−α √ + µ0 = −z1−β √ + µa n n Letting 1 − α be the confidence level we desire and 1 − β be the power with zα and zβ being the corresponding z values and solving for n in the above equation yields · ¸2 2 z1−α + z1−β n=σ (4.1) µa − µ0 So, for the example above, from a standard normal probability chart we have zα = 1.645 and zβ = 1.282. For this test, µa = 8 and µ0 = 5, yielding ·
1.645 + 1.282 n=9 3
¸2 = 8.56.
(4.2)
So we need a sample of size 9 to achieve the desired confidence and power for this experiment. 7
Of course, often we will have no preliminary data from which to estimate a standard deviation. In this case, we must use a conservative “best guess” for the variance. In practice, we may also not know exactly what is a scientifically meaningful alternative. However, as practitioners of science we should be working to move our community towards more careful planning of experiments and more careful thinking about our questions before we begin the experiment.
5.
References
1. Pagano M and K. Gauvreau, 1993. Principles of Biostatistics, Duxbury Press, Belmont, California. 2. Quinn, Gerry P. and Michael J. Keogh, 2002. Cambridge University Press, Cambridge.
8