RESEARCH METHODOLOGY
LESSON 17: PRINCIPLES OF STATISTICAL INFERENCE AND CONFIDENCE INTERVALS In this chapter we shall learn to apply techniques of statistical inference to business situations. I will first introduce the basic idea underlying statistical inference and hypothesis testing. Our focus will be more on understanding key concepts intuitively and less on formulas and calculations, which can be done easily on computers.
To solve problems such as this we have to learn how to use characteristics of samples to test an assumption about the population from which the sample comes from. This in effect is the process of statistical inference.
A business manager in a typical managerial situation needs to determine whether results based on samples can be generalized to a population. Different management situations require different statistical techniques to carryout tests regarding the applicability of sample statistics to a population.
thickness of .048 inches have come from a population with a average thickness of .04 inches?
The issue facing the manager in the above example is: • Could a sample of 100 aluminum sheets with average
•
Does the sample estimate of thickness differ from the population estimate due to sampling error or is it because our fundamental assumption about the mean thickness of the underlying population is not correct?
•
Suppose he believes it to be the first case and accepts the consignment, what is the risk that he runs that the consignment is flawed and does not conform to quality standards of .04 inches?
By the end of this unit you should be able to • Understand the nature of statistical inference
Understand types of statistical inference
• •
Understand the theory behind statistical inference
• Apply sampling theory concepts to confidence intervals and
estimation Nature of Statistical Inference What is Statistical Inference?
We have seen that all managerial/ business situations involve decision making in situations with incomplete information. When a particular finding emerges from data analysis the manager asks whether the empirical findings represent the true picture or have occurred as a result of sampling accident. Statistical inference is the process where we generalize from sample results to a population from which the sample has been drawn. Thus statistical inference is the process where we extend our knowledge obtained from a random sample which is only a small part of the population to the whole population. Where do we use Statistical Inference? Let us think of a typical managerial situation: Imagine you are a purchase manager. Your basic problem is to ensure that a consignment of Aluminum sheets supplied to you by a supplier correspond to the required specification of .04 inch thickness. How do you go about ensuring this? One way would be to accept blindly what ever your suppliers claim. Another option would be to audit each and every item. Clearly this would be both very time consuming and expensive and would result in an unacceptably low level of productivity. Another option is for the manger to choose a random sample of 100 aluminum and measures them for their thickness. He finds for example that the sheets in the sample have an average thickness of .048 inches. On tzhe basis of past experience with the supplier the manager believes that the sheets come from a population with a standard deviation of .004inches. On the basis of this data he has to make a decision whether the to accept or reject a consignment of 10,000 sheets.
100
This is just one example of a very typical managerial situation where the principles of statistical inference can be put to use to solving the manager’s dilemma. Types of Statistical Inference Broadly statistical inference falls into two major categories: estimation and hypothesis testing. Both are actually two sides of the same coin and can be regarded as representing different aspects of a technique. Below I briefly explain each of them. 1. Estimation
This is concerned with how we use sample statistics to estimate population parameters. It is not necessary that an estimate be based on statistical data. All managers make quick estimates based on incomplete information, gut feel and intuition. Thus an estimate of sales for the next quarter can be based on gut feel or on an analysis of past sales data for the quarter. Both represent estimates. The difference between a estimate based on intuition and one based on a random sample is that we can apply the principles of probability allows us to calculate percentage of error variation in an estimate attributable to sampling variation. The sample mean for example can be used as an estimate of the population mean. Similarly the percentage of occurrence in a sample of an attribute or event can be used to estimate the population proportion . To explain the concept a little more clearly we can look at a few examples of estimation: • University departments make estimates of next years
enrollments on the basis of last years enrollments in the same courses. • Credit managers make estimates about whether a purchaser
will pay his bills on the basis of past behaviour of customers with similar characteristics or their past repayment record
© Copy Right: Rai University
11.556
If we find a difference between two samples, we would like to know, is this a “real” difference (i.e., is it present in the population) or just a “chance” difference (i.e. it could just be the result of random sampling error). Hypothesis begins with an assumption called a hypothesis that we make about a population parameter. We then collect sample data and calculate sample statistics such as mean, standard deviation to decide how likely it is that our hypothesized population parameter is correct. Essentially the process involves judging whether a difference between a sample and assumed population value is significant or not. The smaller the difference the greater the chance that our hypothesized value for the mean is correct. Some examples of real world situations where we might want to test hypotheses: • A random sample of 100 south Indian families finds that
they consume more of a particular brand of Assam tea per family than a random sample of 100 North Indian families. It could be that the observed difference was caused by sampling accident and that there is actually no difference between the two populations. However if the results are not caused by pure sampling fluctuations then we have a case for the firm to take some further marketing action based on sampling finding. • Colgate Palmolive have decided that a new TV ad campaign
can only be justified if more than 55% of viewers see the ads. In this case the company requests a marketing research company to carryout a survey to assess viewership. The agency comes back with a ad penetration of 50% for a random sample of 1000. It is now the company’s problem to assess whether the sample viewing proportion is representative of the hypothesized level of viewership that the company desires, i.e 55%. Can differences between the two proportions be attributed to sampling error or is the ads true viewership actually lower.
many such samples from a population and calculate their mean and standard deviations. Given the existence of sampling variation it is likely that there is also going to be some variability in the different estimates of mean and standard deviations. This can be explained best with the help of an example: Suppose there is a store which sells CDS . We assume it has a regular customer base. A random sample of 100 customers is taken and we find the sample mean age of customers was equal to 42years, with a standard deviation of 5 years. However this is only one possible sample, which could have been taken. A second different sample may have had a result where mean was equal to 45 years and standard deviation of 6 years. To change a sample we need only change one of the customers. We would expect samples taken from a population to generate similar if not identical sample means. If we take repeated samples such that all possible samples are taken then we are likely to obtain a sampling distribution of means. What does this Distribution Look Like? Logically we can conceive that there is only one sample which will contain the youngest possible customers and its mean will have the lowest sample mean. Similarly there will be another couple of samples having the lowest 99 customers. These samples will have means, which are slightly higher than the lowest mean. A somewhat higher number will contain the youngest 98 customers and so on. The majority of the samples will have a cross section of all age groups and therefore there would be a clustering of sample means around what is likely to be the true population mean. The distribution of sample means will look like the normal distribution as shown in the figure1 below. Sampling distribution of sample mean values
In the next section we shall look at the theory behind statistical inference. The basis of inference remains the same irrespective of whether the managerial objective is to obtain a point or interval estimate of a population parameter or to test whether a particular hypotheses is supported by sample data or not.
µ
Activities 1. What is statistical inference? What are the different types of inference?
Figure 1
2. Why do decision makers measure samples rather than entire population? What are the disadvantages of sampling?
This result follows from the Central Limit theorem : if we take random samples of size n from a population, the distribution of sample means will approach that of a normal probability distribution. This approximation is closer the larger is n.
Theory behind Statistical Inference We now look at the underlying theoretical basis of statistical inference. The underlying basis of statistical inference is the theory of sampling distributions.
We do not actually know what form our population distribution takes: it could be normal or it could be skewed. However it doesn’t matter, as the sampling distribution will approximate a normal distribution as long as sufficiently large samples are taken.
Now we shall briefly review some concepts, which have been dealt with in more detail in the earlier chapter on sampling.
Normal Distribution We now look briefly at some of the key characteristics of the normal distribution.
What is a Sample?
A sample is a representative subset of the underlying population. For each sample that is taken from a population we can calculate various sample statistics such as mean and variance. We can take 11.556
The normal sampling distribution can be summarized by its two statistics:
© Copy Right: Rai University
101
RESEARCH METHODOLOGY
2. Hypotheses Testing
RESEARCH METHODOLOGY
• Mean F • standard deviation
Logically we can see that the mean of the sampling distribution should equal the mean of the population. The standard deviation of the sampling distribution is given by / ”n, where is the population standard deviation and n is the sample size. Thus the sampling distribution of the mean can be defined in terms of its mean and standard deviation. However we should be clear We are talking about three different statistics: Mean Standard Deviation Sample
x
s
Population
µ
s
From our earlier classes we know that irrespective of the values of and , for a normal probability distribution, the total area under the normal curve is 1.00. Further specific portions of the normal curve lie between plus/ minus any given number of standard deviations from the mean. These results are summarized below: • Approx 68% of all values in a normally distributed
population lie within ±1 standard deviation from the mean. Approximately 16% of the area lies on either side of of the population mean lies outside this range. This is illustrated in the figure 3. • Approx 95.5% of all values in a normally distributed
population lie within ± 2 standard deviation from the mean. Approximately 2.25% of area on either side of the population mean lies outside this range. This is illustrated in the figure4. • Approx 99.7% of all values in a normally distributed
Sampling distribution of mean µ The three distributions are illustrated below: Sampling distribution of the Population The two distributions are illustrated in figure2 below. As can be seen the sampling distribution of the sample mean is far more concentrated than the population distribution. However both distributions have the same mean m.
population lie within ±3 standard deviation from the mean . This is illustrated in the figure5. Only .15% of the area under the curve on either side of the mean lies outside this range.
Figure 3,4,5
Standard Normal Distribution
Figure 2 Application of sampling theory concepts to confidence intervals.Once we have calculated our sample mean we need to know where it lies in the sampling distribution of the mean in relation to the true mean of the sampling distribution or the population mean. It might be higher than the population mean or lower, or it might be identical with the population mean. While we cannot know for certain where the sample mean lies in relation to the population mean we can use probability to assess its likely position vis a vis the population mean.
102
However we rarely need intervals involving only one, two or three standard deviations. Statistical tables provide areas under the normal curve that are contained by any number of standard deviations (plus/ minus ) from the mean. We do this by constructing the standard normal distribution which is standardized .Thus all normal distributions with mean and standard deviation can be transformed into a standard normal distribution with µ =0 and s =1. This transformation is done using the z statistic where Z= x - µ/ s s The distribution of the z statistic represents the standard normal distribution with mean µ =0 and standard deviation s =1.
© Copy Right: Rai University
11.556
RESEARCH METHODOLOGY
With the normal table we can determine the probability that the sample mean c lies within a certain distance from the population mean.. The distance from the mean is defined in terms of number of standard deviations away from the mean. How do we do this? This follows from the result that, irrespective of the shape of the normal curve, the area under the normal curve for a distance of one ,two or three or any given number of standard deviations is the same across all curves. Therefore all intervals containing the same number of standard deviations from the mean contain the same proportion of the total area under the curve. Hence we can make use of only one standard normal distribution. Using the Standard normal probability distribution The figure 6 below shows the raw scale and the standard normal transformation. The standardized normal variable is z= x-FII/
s As can be seen from the figure5 below, z actually represents a transformation or change in measurement scale on the horizontal axis. Thus in the raw scale the m=50. In the standard normal scale this value is transformed to m=0.
Figure 5 The Standard normal probability table is organized in terms of z values . It only gives the z values for half the area under the curve. Because the distribution is symmetric: values which hold for one half of the distribution are true for the other. So far we have tried to understand the theory of sampling underlying confidence intervals. We now turn to defining what exactly is a confidence interval.
References Aaker D A , Kumar V & Day G S - Marketing Research (John Wiley &Sons Inc, 6th ed.) Diamantopoulos A and Schlegelmilch A- Taking the fear out of Data Analysis (Dryden Press, 1997) Communication Service Kothari C R – Quantitative Techniques (Vikas Publishing House 3rd ed.) Levin R I & Rubin DS - Statistics for Management (Prentice Hall of India, 2002) Notes
11.556
© Copy Right: Rai University
103