Hypothesis Test Setting up and testing hypotheses is an essential part of statistical inference. In order to formulate such a test, usually some theory has been put forward, either because it is believed to be true or because it is to be used as a basis for argument, but has not been proved, for example, claiming that a new drug is better than the current drug for treatment of the same symptoms. In each problem considered, the question of interest is simplified into two competing claims / hypotheses between which we have a choice; the null hypothesis, denoted H0, against the alternative hypothesis, denoted H1. These two competing claims / hypotheses are not however treated on an equal basis: special consideration is given to the null hypothesis. We have two common situations: The experiment has been carried out in an attempt to disprove or reject a particular hypothesis, the null hypothesis, thus we give that one priority so it cannot be rejected unless the evidence against it is sufficiently strong. For example, H0: there is no difference in taste between coke and diet coke against H1: there is a difference. If one of the two hypotheses is 'simpler' we give it priority so that a more 'complicated' theory is not adopted unless there is sufficient evidence against the simpler one. For example, it is 'simpler' to claim that there is no difference in flavour between coke and diet coke than it is to say that there is a difference. The hypotheses are often statements about population parameters like expected value and variance; for example H0 might be that the expected value of the height of ten year old boys in the Scottish population is not different from that of ten year old girls. A hypothesis might also be a statement about the distributional form of a characteristic of interest, for example that the height of ten year old boys is normally distributed within the Scottish population. The outcome of a hypothesis test test is "Reject H0 in favour of H1" or "Do not reject H0". Null Hypothesis The null hypothesis, H0, represents a theory that has been put forward, either because it is believed to be true or because it is to be used as a basis for argument, but has not been proved. For example, in a clinical trial of a new drug, the null hypothesis might be that the new drug is no better, on average, than the current drug. We would write H0: there is no difference between the two drugs on average. We give special consideration to the null hypothesis. This is due to the fact that the null hypothesis relates to the statement being tested, whereas the alternative hypothesis relates to the statement to be accepted if / when the null is rejected. The final conclusion once the test has been carried out is always given in terms of the null hypothesis. We either "Reject H0 in favour of H1" or "Do not reject H0"; we never conclude "Reject H1", or even "Accept H1". If we conclude "Do not reject H0", this does not necessarily mean that the null hypothesis is true, it only suggests that there is not sufficient evidence against H0 in favour of H1. Rejecting the null hypothesis then, suggests that the alternative hypothesis may be true. Alternative Hypothesis The alternative hypothesis, H1, is a statement of what a statistical hypothesis test is set up to establish. For example, in a clinical trial of a new drug, the alternative hypothesis might be that the new drug has a different effect, on average, compared to that of the current drug. We would write H1: the two drugs have different effects, on average. The alternative hypothesis might also be that the new drug is better, on average, than the current drug. In this case we would write H1: the new drug is better than the current drug, on average. The final conclusion once the test has been carried out is always given in terms of the null hypothesis. We either "Reject H0 in favour of H1" or "Do not reject H0". We never conclude "Reject H1", or even "Accept H1". If we conclude "Do not reject H0", this does not necessarily mean that the null hypothesis is true, it only suggests that there is not sufficient evidence against H0 in favour of H1. Rejecting the null hypothesis then, suggests that the alternative hypothesis may be true. Type I Error In a hypothesis test, a type I error occurs when the null hypothesis is rejected when it is in fact true; that is, H0 is wrongly rejected. For example, in a clinical trial of a new drug, the null hypothesis might be that the new drug is no better, on average, than the current drug; i.e. H0: there is no difference between the two drugs on average. A type I error would occur if we concluded that the two drugs produced different effects when in fact there was no difference between them. The following table gives a summary of possible results of any hypothesis test: Decision Reject H0 Don't reject H0 Truth H0 Type I Error Right decision H1 Right decision Type II Error A type I error is often considered to be more serious, and therefore more important to avoid, than a type II error. The hypothesis test procedure is therefore adjusted so that there is a guaranteed 'low' probability of rejecting the null hypothesis wrongly; this probability is never 0. This probability of a type I error can be precisely computed as P(type I error) = significance level = The exact probability of a type II error is generally unknown. If we do not reject the null hypothesis, it may still be false (a type II error) as the sample may not be big enough to identify the falseness of the null hypothesis (especially if the truth is very close to hypothesis). For any given set of data, type I and type II errors are inversely related; the smaller the risk of one, the higher the risk of the other. A type I error can also be referred to as an error of the first kind. Type II Error In a hypothesis test, a type II error occurs when the null hypothesis H0, is not rejected when it is in fact false. For example, in a clinical trial of a new drug, the null hypothesis might be that the new drug is no better, on average, than the current drug; i.e. H0: there is no difference between the two drugs on average. A type II error would occur if it was concluded that the two drugs produced the same effect, i.e. there is no difference between the two drugs on average, when in fact they produced different ones. A type II error is frequently due to sample sizes being too small. The probability of a type II error is generally unknown, but is symbolised by and written P(type II error) = A type II error can also be referred to as an error of the second kind. Test Statistic A test statistic is a quantity calculated from our sample of data. Its value is used to decide whether or not the null hypothesis should be rejected in our hypothesis test. The choice of a test statistic will depend on the assumed probability model and the hypotheses under question. Critical Value(s) The critical value(s) for a hypothesis test is a threshold to which the value of the test statistic in a sample is compared to determine whether or not the null hypothesis is rejected. The critical value for any hypothesis test depends on the significance level at which the test is carried out, and whether the test is one-sided or two-sided. Critical Region The critical region CR, or rejection region RR, is a set of values of the test statistic for which the null hypothesis is rejected in a hypothesis test. That is, the sample space for the test statistic is partitioned into two regions; one region (the critical region) will lead us to reject the null hypothesis H0, the other will not. So, if the observed value of the test statistic is a member of the critical region, we conclude "Reject H0"; if it is not a member of the critical region then we conclude "Do not reject H0".
Significance Level The significance level of a statistical hypothesis test is a fixed probability of wrongly rejecting the null hypothesis H0, if it is in fact true. It is the probability of a type I error and is set by the investigator in relation to the consequences of such an error. That is, we want to make the significance level as small as possible in order to protect the null hypothesis and to prevent, as far as possible, the investigator from inadvertently making false claims. The significance level is usually denoted by Significance Level = P(type I error) = Usually, the significance level is chosen to be 0.05 (or equivalently, 5%). P-Value The probability value (p-value) of a statistical hypothesis test is the probability of getting a value of the test statistic as extreme as or more extreme than that observed by chance alone, if the null hypothesis H0, is true. It is the probability of wrongly rejecting the null hypothesis if it is in fact true. It is equal to the significance level of the test for which we would only just reject the null hypothesis. The p-value is compared with the actual significance level of our test and, if it is smaller, the result is significant. That is, if the null hypothesis were to be rejected at the 5% signficance level, this would be reported as "p < 0.05". Small p-values suggest that the null hypothesis is unlikely to be true. The smaller it is, the more convincing is the rejection of the null hypothesis. It indicates the strength of evidence for say, rejecting the null hypothesis H0, rather than simply concluding "Reject H0' or "Do not reject H0". One-sided Test A one-sided test is a statistical hypothesis test in which the values for which we can reject the null hypothesis, H0 are located entirely in one tail of the probability distribution. In other words, the critical region for a one-sided test is the set of values less than the critical value of the test, or the set of values greater than the critical value of the test. A one-sided test is also referred to as a one-tailed test of significance. The choice between a one-sided and a two-sided test is determined by the purpose of the investigation or prior reasons for using a one-sided test. Suppose we wanted to test a manufacturers claim that there are, on average, 50 matches in a box. We could set up the following hypotheses H0: µ = 50, against H1: µ < 50 or H1: µ > 50 Either of these two alternative hypotheses would lead to a one-sided test. Presumably, we would want to test the null hypothesis against the first alternative hypothesis since it would be useful to know if there is likely to be less than 50 matches, on average, in a box (no one would complain if they get the correct number of matches in a box or more). Yet another alternative hypothesis could be tested against the same null, leading this time to a two-sided test: H0: µ = 50, against H1: µ not equal to 50 Here, nothing specific can be said about the average number of matches in a box; only that, if we could reject the null hypothesis in our test, we would know that the average number of matches in a box is likely to be less than or greater than 50. Two-Sided Test A two-sided test is a statistical hypothesis test in which the values for which we can reject the null hypothesis, H0 are located in both tails of the probability distribution. In other words, the critical region for a two-sided test is the set of values less than a first critical value of the test and the set of values greater than a second critical value of the test. A two-sided test is also referred to as a two-tailed test of significance. The choice between a one-sided test and a two-sided test is determined by the purpose of the investigation or prior reasons for using a one-sided test. Suppose we wanted to test a manufacturers claim that there are, on average, 50 matches in a box. We could set up the following hypotheses H0: µ = 50, against H1: µ < 50 or H1: µ > 50 Either of these two alternative hypotheses would lead to a one-sided test. Presumably, we would want to test the null hypothesis against the first alternative hypothesis since it would be useful to know if there is likely to be less than 50 matches, on average, in a box (no one would complain if they get the correct number of matches in a box or more). Yet another alternative hypothesis could be tested against the same null, leading this time to a two-sided test: H0: µ = 50, against H1: µ not equal to 50 Here, nothing specific can be said about the average number of matches in a box; only that, if we could reject the null hypothesis in our test, we would know that the average number of matches in a box is likely to be less than or greater than 50. Chi-Squared Test of Homogeneity On occasion it might happen that there are several proportions in a sample of data to be tested simultaneously. An even more complex situation arises when the several populations have all been classified according to the same variable. We generally do not expect an equality of proportions for all the classes of all the populations. We do however, quite often need to test whether the proportions for each class are equal across all populations and whether this is true for each class. If this proves to be the case, we say the populations are homogeneous with respect to the variable of classification. The test used for this purpose is the ChiSquared Test of Homogeneity, with hypotheses: H0: the populations are homogeneous with respect to the variable of classification, against H1: the populations are not homogeneous. Discrete Data A set of data is said to be discrete if the values / observations belonging to it are distinct and separate, i.e. they can be counted (1,2,3,....). Examples might include the number of kittens in a litter; the number of patients in a doctors surgery; the number of flaws in one metre of cloth; gender (male, female); blood group (O, A, B, AB). Nominal Data A set of data is said to be nominal if the values / observations belonging to it can be assigned a code in the form of a number where the numbers are simply labels. You can count but not order or measure nominal data. For example, in a data set males could be coded as 0, females as 1; marital status of an individual could be coded as Y if married, N if single. Ordinal Data A set of data is said to be ordinal if the values / observations belonging to it can be ranked (put in order) or have a rating scale attached. You can count and order, but not measure, ordinal data. The categories for an ordinal set of data have a natural order, for example, suppose a group of people were asked to taste varieties of biscuit and classify each biscuit on a rating scale of 1 to 5, representing strongly dislike, dislike, neutral, like, strongly like. A rating of 5 indicates more enjoyment than a rating of 4, for example, so such data are ordinal. However, the distinction between neighbouring points on the scale is not necessarily always the same. For instance, the difference in enjoyment expressed by giving a rating of 2 rather than 1 might be much less than the difference in enjoyment expressed by giving a rating of 4 rather than 3. Interval Scale
An interval scale is a scale of measurement where the distance between any two adjacents units of measurement (or 'intervals') is the same but the zero point is arbitrary. Scores on an interval scale can be added and subtracted but can not be meaningfully multiplied or divided. For example, the time interval between the starts of years 1981 and 1982 is the same as that between 1983 and 1984, namely 365 days. The zero point, year 1 AD, is arbitrary; time did not begin then. Other examples of interval scales include the heights of tides, and the measurement of longitude. Continuous Data A set of data is said to be continuous if the values / observations belonging to it may take on any value within a finite or infinite interval. You can count, order and measure continuous data. For example height, weight, temperature, the amount of sugar in an orange, the time required to run a mile. Frequency Table A frequency table is a way of summarising a set of data. It is a record of how often each value (or set of values) of the variable in question occurs. It may be enhanced by the addition of percentages that fall into each category. Frequency table A frequency table is used to summarise categorical, nominal, and ordinal data. It may also be used to summarise continuous data once the data set has been divided up into sensible groups. When we have more than one categorical variable in our data set, a frequency table is sometimes called a contingency table because the figures found in the rows are contingent upon (dependent upon) those found in the columns. Pie Chart A pie chart is a way of summarising a set of categorical data. It is a circle which is divided into segments. Each segment represents a particular category. The area of each segment is proportional to the number of cases in that category. Bar Chart A bar chart is a way of summarising a set of categorical data. It is often used in exploratory data analysis to illustrate the major features of the distribution of the data in a convenient form. It displays the data using a number of rectangles, of the same width, each of which represents a particular category. The length (and hence area) of each rectangle is proportional to the number of cases in the category it represents, for example, age group, religious affiliation. Dot Plot A dot plot is a way of summarising data, often used in exploratory data analysis to illustrate the major features of the distribution of the data in a convenient form. For nominal or ordinal data, a dot plot is similar to a bar chart, with the bars replaced by a series of dots. Each dot represents a fixed number of individuals. For continuous data, the dot plot is similar to a histogram, with the rectangles replaced by dots. A dot plot can also help detect any unusual observations (outliers), or any gaps in the data set. A histogram is a way of summarising data that are measured on an interval scale (either discrete or continuous). It is often used in exploratory data analysis to illustrate the major features of the distribution of the data in a convenient form. It divides up the range of possible values in a data set into classes or groups. For each group, a rectangle is constructed with a base length equal to the range of values in that specific group, and an area proportional to the number of observations falling into that group. This means that the rectangles might be drawn of non-uniform height. The histogram is only appropriate for variables whose values are numerical and measured on an interval scale. It is generally used when dealing with large data sets (>100 observations), when stem and leaf plots become tedious to construct. A histogram can also help detect any unusual observations (outliers), or any gaps in the data set. A scatterplot is a useful summary of a set of bivariate data (two variables), usually drawn before working out a linear correlation coefficient or fitting a regression line. It gives a good visual picture of the relationship between the two variables, and aids the interpretation of the correlation coefficient or regression model. Skewness is defined as asymmetry in the distribution of the sample data values. Values on one side of the distribution tend to be further from the 'middle' than values on the other side. For skewed data, the usual measures of location will give different values, for example, mode<median<mean would indicate positive (or right) skewness. Positive (or right) skewness is more common than negative (or left) skewness. If there is evidence of skewness in the data, we can apply transformations, for example, taking logarithms of positive skew data. Sample Mean The sample mean is an estimator available for estimating the population mean . It is a measure of location, commonly called the average, often symbolised . Its value depends equally on all of the data which may include outliers. It may not appear representative of the central region for skewed data sets. It is especially useful as being representative of the whole sample for use in subsequent calculations. The median is the value halfway through the ordered data set, below and above which there lies an equal number of data values. It is generally a good descriptive measure of the location which works well for skewed data, or data with outliers. The median is the 0.5 quantile. Mode The mode is the most frequently occurring value in a set of discrete data. There can be more than one mode if two or more values are equally common. Range The range of a sample (or a data set) is a measure of the spread or the dispersion of the observations. It is the difference between the largest and the smallest observed value of some quantitative characteristic and is very easy to calculate. A great deal of information is ignored when computing the range since only the largest and the smallest data values are considered; the remaining data are ignored. The range value of a data set is greatly influenced by the presence of just one unusually large or small value in the sample (outlier). The inter-quartile range is a measure of the spread of or dispersion within a data set. It is calculated by taking the difference between the upper and the lower quartiles. The IQR is the width of an interval which contains the middle 50% of the sample, so it is smaller than the range and its value is less affected by outliers. Quantiles are a set of 'cut points' that divide a sample of data into groups containing (as far as possible) equal numbers of observations. Percentiles are values that divide a sample of data into one hundred groups containing (as far as possible) equal numbers of observations. For example, 30% of the data values lie below the 30th percentile. Quartiles are values that divide a sample of data into four groups containing (as far as possible) equal numbers of observations. A data set has three quartiles. References to quartiles often relate to just the outer two, the upper and the lower quartiles; the second quartile being equal to the median. The lower quartile is the data value a quarter way up through the ordered data set; the upper quartile is the data value a quarter way down through the ordered data set. Sample variance is a measure of the spread of or dispersion within a set of sample data. The sample variance is the sum of the squared deviations from their average divided by one less than the number of observations in the data set. For example, for n observations x1, x2, x3, ... , xn with sample mean Standard deviation is a measure of the spread or dispersion of a set of data. It is calculated by taking the square root of the variance and is symbolised by s.d, or s. In other words The more widely the values are spread out, the larger the standard deviation. For example, say we have two separate lists of exam results from a class of 30 students; one ranges from 31% to 98%, the other from 82% to 93%, then the standard deviation would be larger for the results of the first exam. Pearson's Product Moment Correlation Coefficient Pearson's product moment correlation coefficient, usually denoted by r, is one example of a correlation coefficient. It is a measure of the linear association between two variables that have been measured on interval or ratio scales, such as the relationship between height in inches and weight in pounds. However, it can be misleadingly small when there is a relationship between the variables but it is a non-linear one. There are procedures, based on r, for making inferences about the population correlation coefficient. However, these make the implicit assumption that the two variables are jointly normally distributed. When this assumption is not justified, a non-parametric measure such as the Spearman Rank Correlation Coefficient might be more appropriate. Spearman Rank Correlation Coefficient
The Spearman rank correlation coefficient is one example of a correlation coefficient. It is usually calculated on occasions when it is not convenient, economic, or even possible to give actual values to variables, but only to assign a rank order to instances of each variable. It may also be a better indicator that a relationship exists between two variables when the relationship is non-linear. Commonly used procedures, based on the Pearson's Product Moment Correlation Coefficient, for making inferences about the population correlation coefficient make the implicit assumption that the two variables are jointly normally distributed. When this assumption is not justified, a non-parametric measure such as the Spearman Rank Correlation Coefficient might be more appropriate. Regression Equation A regression equation allows us to express the relationship between two (or more) variables algebraically. It indicates the nature of the relationship between two (or more) variables. In particular, it indicates the extent to which you can predict some variables by knowing others, or the extent to which some are associated with others. Event An event is any collection of outcomes of an experiment. Formally, any subset of the sample space is an event. Any event which consists of a single outcome in the sample space is called an elementary or simple event. Events which consist of more than one outcome are called compound events. Relative Frequency Relative frequency is another term for proportion; it is the value calculated by dividing the number of times an event occurs by the total number of times an experiment is carried out. The probability of an event can be thought of as its long-run relative frequency when the experiment is carried out many times. If an experiment is repeated n times, and event E occurs r times, then the relative frequency of the event E is defined to be rfn(E) = r/n Probability A probability provides a quantatative description of the likely occurrence of a particular event. Probability is conventionally expressed on a scale from 0 to 1; a rare event has a probability close to 0, a very common event has a probability close to 1. The probability of an event has been defined as its long-run relative frequency. It has also been thought of as a personal degree of belief that a particular event will occur (subjective probability). In some experiments, all outcomes are equally likely. For example if you were to choose one winner in a raffle from a hat, all raffle ticket holders are equally likely to win, that is, they have the same probability of their ticket being chosen. Statistical Inference Statistical Inference makes use of information from a sample to draw conclusions (inferences) about the population from which the sample was taken. Population A population is any entire collection of people, animals, plants or things from which we may collect data. It is the entire group we are interested in, which we wish to describe or draw conclusions about. In order to make any generalisations about a population, a sample, that is meant to be representative of the population, is often studied. For each population there are many possible samples. A sample statistic gives information about a corresponding population parameter. For example, the sample mean for a set of data would give information about the overall population mean. It is important that the investigator carefully and completely defines the population before collecting the sample, including a description of the members to be included. Sample A sample is a group of units selected from a larger group (the population). By studying the sample it is hoped to draw valid conclusions about the larger group. A sample is generally selected for study because the population is too large to study in its entirety. The sample should be representative of the general population. This is often best achieved by random sampling. Also, before collecting the sample, it is important that the researcher carefully and completely defines the population, including a description of the members to be included. Statistic A statistic is a quantity that is calculated from a sample of data. It is used to give information about unknown values in the corresponding population. For example, the average of the data in a sample is used to give information about the overall average in the population from which that sample was drawn. It is possible to draw more than one sample from the same population and the value of a statistic will in general vary from sample to sample. For example, the average value in a sample is a statistic. The average values in more than one sample, drawn from the same population, will not necessarily be equal. Statistics are often assigned Roman letters (e.g. m and s), whereas the equivalent unknown values in the population (parameters ) are assigned Greek letters (e.g. µ and ). Random sampling is a sampling technique where we select a group of subjects (a sample) for study from a larger group (a population). Each individual is chosen entirely by chance and each member of the population has a known, but possibly non-equal, chance of being included in the sample. By using random sampling, the likelihood of bias is reduced Simple random sampling is the basic sampling technique where we select a group of subjects (a sample) for study from a larger group (a population). Each individual is chosen entirely by chance and each member of the population has an equal chance of being included in the sample. Every possible sample of a given size has the same chance of selection; i.e. each member of the population is equally likely to be chosen at any stage in the sampling process. There may often be factors which divide up the population into sub-populations (groups / strata) and we may expect the measurement of interest to vary among the different sub-populations. This has to be accounted for when we select a sample from the population in order that we obtain a sample that is representative of the population. This is achieved by stratified sampling. A stratified sample is obtained by taking samples from each stratum or sub-group of a population. When we sample a population with several strata, we generally require that the proportion of each stratum in the sample should be the same as in the population. Stratified sampling techniques are generally used when the population is heterogeneous, or dissimilar, where certain homogeneous, or similar, sub-populations can be isolated (strata). Simple random sampling is most appropriate when the entire population from which the sample is taken is homogeneous. Some reasons for using stratified sampling over simple random sampling are: the cost per observation in the survey may be reduced; estimates of the population parameters may be wanted for each sub-population; increased accuracy at given cost. Suppose a farmer wishes to work out the average milk yield of each cow type in his herd which consists of Ayrshire, Friesian, Galloway and Jersey cows. He could divide up his herd into the four sub-groups and take samples from these. Cluster sampling is a sampling technique where the entire population is divided into groups, or clusters, and a random sample of these clusters are selected. All observations in the selected clusters are included in the sample. Cluster sampling is typically used when the researcher cannot get a complete list of the members of a population they wish to study but can get a complete list of groups or 'clusters' of the population. It is also used when a random sample would produce a list of subjects so widely scattered that surveying them would prove to be far too expensive, for example, people who live in different postal districts in the UK. This sampling technique may well be more practical and/or economical than simple random sampling or stratified sampling. Suppose that the Department of Agriculture wishes to investigate the use of pesticides by farmers in England. A cluster sample could be taken by identifying the different counties in England as clusters. A sample of these counties (clusters) would then be chosen at random, so all farmers in those counties selected would be included in the sample. It can be seen here then that it is easier to visit several farmers in the same county than it is to travel to each farm in a random sample to observe the use of pesticides. Quota sampling is a method of sampling widely used in opinion polling and market research. Interviewers are each given a quota of subjects of specified type to attempt to recruit for example, an interviewer might be told to go out and select 20 adult men and 20 adult women, 10 teenage girls and 10 teenage boys so that they could interview them about their television viewing.
It suffers from a number of methodological flaws, the most basic of which is that the sample is not a random sample and therefore the sampling distributions of any statistics are unknown.
Statistics is a mathematical science pertaining to the collection, analysis, interpretation or explanation, and presentation of data. It is applicable to a wide variety of academic disciplines, from the natural and social sciences to the humanities, and to government and business. Statistical methods can be used to summarize or describe a collection of data; this is called descriptive statistics. In addition, patterns in the data may be modeled in a way that accounts for randomness and uncertainty in the observations, and then used to draw inferences about the process or population being studied; this is called inferential statistics. Both descriptive and inferential statistics comprise applied statistics. There is also a discipline called mathematical statistics, which is concerned with the theoretical basis of the subject. In applying statistics to a scientific, industrial, or societal problem, one begins with a process or population to be studied. This might be a population of people in a country, of crystal grains in a rock, or of goods manufactured by a particular factory during a given period. It may instead be a process observed at various times; data collected about this kind of "population" constitute what is called a time series. For practical reasons, rather than compiling data about an entire population, one usually studies a chosen subset of the population, called a sample. Data are collected about the sample in an observational or experimental setting. The data are then subjected to statistical analysis, which serves two related purposes: description and inference. Descriptive statistics can be used to summarize the data, either numerically or graphically, to describe the sample. Basic examples of numerical descriptors include the mean and standard deviation. Graphical summarizations include various kinds of charts and graphs. Inferential statistics is used to model patterns in the data, accounting for randomness and drawing inferences about the larger population. These inferences may take the form of answers to yes/no questions (hypothesis testing), estimates of numerical characteristics (estimation), descriptions of association (correlation), or modeling of relationships (regression). Other modeling techniques include ANOVA, time series, and data mining. “… it is only the manipulation of uncertainty that interests us. We are not concerned with the matter that is uncertain. Thus we do not study the mechanism of rain; only whether it will rain.” The concept of correlation is particularly noteworthy. Statistical analysis of a data set may reveal that two variables (that is, two properties of the population under consideration) tend to vary together, as if they are connected. For example, a study of annual income and age of death among people might find that poor people tend to have shorter lives than affluent people. The two variables are said to be correlated. However, one cannot immediately infer the existence of a causal relationship between the two variables (see Correlation does not imply causation). The correlated phenomena could be caused by a third, previously unconsidered phenomenon, called a lurking variable. If the sample is representative of the population, then inferences and conclusions made from the sample can be extended to the population as a whole. A major problem lies in determining the extent to which the chosen sample is representative. Statistics offers methods to estimate and correct for randomness in the sample and in the data collection procedure, as well as methods for designing robust experiments in the first place The fundamental mathematical concept employed in understanding such randomness is probability. Mathematical statistics (also called statistical theory) is the branch of applied mathematics that uses probability theory and analysis to examine the theoretical basis of statistics. The use of any statistical method is valid only when the system or population under consideration satisfies the basic mathematical assumptions of the method. Misuse of statistics can produce subtle but serious errors in description and interpretation — subtle in that even experienced professionals sometimes make such errors, and serious in that they may affect social policy, medical practice and the reliability of structures such as bridges and nuclear power plants. Even when statistics is correctly applied, the results can be difficult to interpret for a non-expert. For example, the statistical significance of a trend in the data — which measures the extent to which the trend could be caused by random variation in the sample — may not agree with one's intuitive sense of its significance. The set of basic statistical skills (and skepticism) needed by people to deal with information in their everyday lives is referred to as statistical literacy. According to the classification scheme, in statistics the kinds of descriptive statistics and significance tests that are appropriate depend on the level of measurement of the variables concerned. Relation or Operation Mathematical structure nominal mode equality (=) set ordinal median order (<) totally ordered set interval mean, standard deviation subtraction (-) and weighted average affine line ratio geometric mean, coefficient of variation addition (+) and multiplication (×) field Nominal measurement In this type of measurement, names are assigned to objects as labels. This assignment is performed by evaluating, by some procedure, the similarity of the to-be-measured instance to each of a set of named exemplars or category definitions. The name of the most similar named exemplar or definition in the set is the "value" assigned by nominal measurement to the given instance. If two instances have the same name associated with them, they belong to the same category, and that is the only significance that nominal measurements have. Variables that are measured only nominally are also called categorical variables. Nominal numbers For practical data processing the names may be numerals, but in that case the numerical value of these numerals is irrelevant, and the concept is now sometimes referred to as nominal number. The only comparisons that can be made between variable values are equality and inequality. There are no "less than" or "greater than" relations among the classifying names, nor operations such as addition or subtraction. In social research, variables measured at a nominal level include gender, marital status, race, religious affiliation, political party affiliation, college major, and birthplace. Other examples include: geographical location in a country represented by that country's international telephone access code, or the make or model of a car. Statistical measures The only kind of measure of central tendency is the mode; median and mean cannot be defined. Statistical dispersion may be measured with various indices of qualitative variation, but no notion of standard deviation exists. Ordinal measurement In this classification, the numbers assigned to objects represent the rank order (1st, 2nd, 3rd etc.) of the entities measured. The numbers are called ordinals. The variables are called ordinal variables or rank variables. Comparisons of greater and less can be made, in addition to equality and inequality. However, operations such as conventional addition and subtraction are still meaningless. Examples include the Mohs scale of mineral hardness; the results of a horse race, which say only which horses arrived first, second, third, etc. but no time intervals; and many measurements in psychology and other social sciences, for example attitudes like preference, conservatism or prejudice and social class. Statistical measures The central tendency of an ordinally measured variable can be represented by its mode or its median, but mean cannot be defined. One can define quantiles, notably quartiles and percentiles, together with maximum and minimum, but no new measures of statistical dispersion beyond the nominal ones can be defined: one cannot define range (statistics) and interquartile range, since one cannot subtract quantities. Interval measurement The numbers assigned to objects have all the features of ordinal measurements, and in addition equal differences between measurements represent equivalent intervals. That is, differences between arbitrary pairs of measurements can be meaningfully compared. Operations such as averaging and subtraction are therefore meaningful, but addition is not, and a zero point on the scale is arbitrary; negative values can be used. The formal mathematical
term is an affine space (in this case an affine line). Variables measured at the interval level are called interval variables, or sometimes scaled variables, as they have a notion of units of measurement, though the latter usage is not obvious and is not recommended. Ratios between numbers on the scale are not meaningful, so operations such as multiplication and division cannot be carried out directly. But ratios of differences can be expressed; for example, one difference can be twice another. Examples of interval measures are the year date in many calendars, and temperature in Celsius scale or Fahrenheit scale; temperature in the Kelvin scale is a ratio measurement, however. Statistical measures The central tendency of a variable measured at the interval level can be represented by its mode, its median, or its arithmetic mean. Statistical dispersion can be measured in most of the usual ways, which just involved differences or averaging, such as range, interquartile range, and standard deviation. Since one cannot divide, one cannot define measures that require a ratio, such as studentized range or coefficient of variation. More subtly, while one can define moments about the origin, only central moments are useful, since the choice of origin is arbitrary and not meaningful. One can define standardized moments, since ratios of differences are meaningful, but one cannot define coefficient of variation, since the mean is a moment about the origin, unlike the standard deviation, which is (the square root of) a central moment. Ratio measurement The numbers assigned to objects have all the features of interval measurement and also have meaningful ratios between arbitrary pairs of numbers. Operations such as multiplication and division are therefore meaningful. The zero value on a ratio scale is non-arbitrary. Variables measured at the ratio level are called ratio variables. Most physical quantities, such as mass, length or energy are measured on ratio scales; so is temperature measured in kelvins, that is, relative to absolute zero. Social variables of ratio measure include age, length of residence in a given place, number of organizations belonged to or number of church attendances in a particular time. Statistical measures All statistical measures can be used for a variable measured at the ratio level, as all necessary mathematical operations are defined. The central tendency of a variable measured at the ratio level can be represented by, in addition to its mode, its median, or its arithmetic mean, also its geometric mean. In addition to the measures of statistical dispersion defined for interval variables, such as range and standard deviation, for ratio variables one can also define measures that require a ratio, such as studentized range or coefficient of variation. In a ratio variable, unlike in an interval variable, the moments about the origin are meaningful, since the origin is not arbitrary. "True measurement" The interval and ratio measurement levels are sometimes collectively called "true measurement", although it has been argued that this usage reflects a lack of understanding of the uses of ordinal measurement. Only ratio or interval scales can correctly be said to have units of measurement. In statistics, the Pearson product-moment correlation coefficient (sometimes referred to as the MCV or PMCC) (r) is a common measure of the correlation between two variables X and Y. When measured in a population the Pearson Product Moment correlation is designated by the Greek letter rho (ρ). When computed in a sample, it is designated by the letter "r" and is sometimes called "Pearson's r." Pearson's correlation reflects the degree of linear relationship between two variables. It ranges from +1 to -1. A correlation of +1 means that there is a perfect positive linear relationship between variables. A correlation of -1 means that there is a perfect negative linear relationship between variables. A correlation of 0 means there is no linear relationship between the two variables. Correlations are rarely if ever 0, 1, or -1. If you get a certain outcome it could indicate whether correlations were negative or positive. Spearman's rank correlation coefficient, named after Charles Spearman and often denoted by the Greek letter ρ (rho) or as rs, is a non-parametric measure of correlation – that is, it assesses how well an arbitrary monotonic function could describe the relationship between two variables, without making any assumptions about the frequency distribution of the variables. Unlike the Pearson product-moment correlation coefficient, Spearman's rank correlation coefficient does not require the assumption that the relationship between the variables is linear, nor does it require the variables to be measured on interval scales; it can be used for variables measured at the ordinal level. However, Spearman's rho does assume that subsequent ranks indicate equi-distant positions on the variable measured. For example, using Spearman's rho for Likert scales often used in psychology, sociology, biology and related disciplines assumes that the (psychologically) "felt distances" between scale points are the same for all betweens of the Likert scale used. A statistical hypothesis test is a method of making statistical decisions from and about experimental data. Null-hypothesis testing just answers the question of "how well the findings fit the possibility that chance factors alone might be responsible."[1] This is done by asking and answering a hypothetical question. One use is deciding whether experimental results contain enough information to cast doubt on conventional wisdom. As an example, consider determining whether a suitcase contains some radioactive material. Placed under a Geiger counter, it produces 10 counts per minute. The null hypothesis is that no radioactive material is in the suitcase and that all measured counts are due to ambient radioactivity typical of the surrounding air and harmless objects in a suitcase. We can then calculate how likely it is that the null hypothesis produces 10 counts per minute. If it is likely, for example if the null hypothesis predicts on average 9 counts per minute and a standard deviation of 1 count per minute, we say that the suitcase is compatible with the null hypothesis (which does not imply that there is no radioactive material, we just can't determine!); on the other hand, if the null hypothesis predicts for example 1 count per minute and a standard deviation of 1 count per minute, then the suitcase is not compatible with the null hypothesis and there are likely other factors responsible to produce the measurements. The test described here is more fully the null-hypothesis statistical significance test. The null hypothesis is a conjecture that exists solely to be falsified by the sample. Statistical significance is a possible finding of the test - that the sample is unlikely to have occurred by chance given the truth of the null hypothesis. The name of the test describes its formulation and its possible outcome. One characteristic of the test is its crisp decision: reject or do not reject (which is not the same as accept). A calculated value is compared to a threshold. One may be faced with the problem of making a definite decision with respect to an uncertain hypothesis which is known only through its observable consequences. A statistical hypothesis test, or more briefly, hypothesis test, is an algorithm to choose between the alternatives (for or against the hypothesis) which minimizes certain risks. This article describes the commonly used frequentist treatment of hypothesis testing. From the Bayesian point of view, it is appropriate to treat hypothesis testing as a special case of normative decision theory (specifically a model selection problem) and it is possible to accumulate evidence in favor of (or against) a hypothesis using concepts such as likelihood ratios known as Bayes factors. There are several preparations we make before we observe the data. The null hypothesis must be stated in mathematical/statistical terms that make it possible to calculate the probability of possible samples assuming the hypothesis is correct. For example: The mean response to treatment being tested is equal to the mean response to the placebo in the control group. Both responses have the normal distribution with this unknown mean and the same known standard deviation ... (value). A test statistic must be chosen that will summarize the information in the sample that is relevant to the hypothesis. In the example given above, it might be the numerical difference between the two sample means, m1 − m2. The distribution of the test statistic is used to calculate the probability sets of possible values (usually an interval or union of intervals). In this example, the difference between sample means would have a normal distribution with a standard deviation equal to the common standard deviation times the factor where n1 and n2 are the sample sizes.
Among all the sets of possible values, we must choose one that we think represents the most extreme evidence against the hypothesis. That is called the critical region of the test statistic. The probability of the test statistic falling in the critical region when the null hypothesis is correct, is called the alpha value (or size) of the test. The probability that a sample falls in the critical region when the parameter is θ, where θ is for the alternative hypothesis, is called the power of the test at θ. The power function of a critical region is the function that maps θ to the power of θ. After the data are available, the test statistic is calculated and we determine whether it is inside the critical region. If the test statistic is inside the critical region, then our conclusion is one of the following: Reject the null hypothesis. (Therefore the critical region is sometimes called the rejection region, while its complement is the acceptance region.) An event of probability less than or equal to alpha has occurred. The researcher has to choose between these logical alternatives. In the example we would say: the observed response to treatment is statistically significant. If the test statistic is outside the critical region, the only conclusion is that there is not enough evidence to reject the null hypothesis. This is not the same as evidence in favor of the null hypothesis. That we cannot obtain using these arguments, since lack of evidence against a hypothesis is not evidence for it. On this basis, statistical research progresses by eliminating error, not by finding the truth. Simple hypothesis - Any hypothesis which specifies the population distribution completely. Composite hypothesis - Any hypothesis which does not specify the population distribution completely. Statistical test- A decision function that takes its values in the set of hypotheses. Region of acceptance - The set of values for which we fail to reject the null hypothesis. Region of rejection / Critical region - The set of values of the test statistic for which the null hypothesis is rejected. Power of a test (1-β) - The test's probability of correctly rejecting the null hypothesis. The complement of the false negative rate Size / Significance level of a test (α) - For simple hypotheses, this is the test's probability of incorrectly rejecting the null hypothesis. The false positive rate. For composite hypotheses this is the upper bound of the probability of rejecting the null hypothesis over all cases covered by the null hypothesis. Most powerful test - For a given size or significance level, the test with the greatest power. In statistics, a result is called 'statistically significant' if it is unlikely to have occurred by chance. "A statistically significant difference" simply means there is statistical evidence that there is a difference; it does not mean the difference is necessarily large, important or significant in the common meaning of the word. The significance level of a test is a traditional frequentist statistical hypothesis testing concept. In simple cases, it is defined as the probability of making a decision to reject the null hypothesis when the null hypothesis is actually true (a decision known as a Type I error, or "false positive determination"). The decision is often made using the p-value: if the p-value is less than the significance level, then the null hypothesis is rejected. The smaller the p-value, the more significant the result is said to be. In more complicated, but practically important cases, the significance level of a test is a probability such that the probablility of making a decision to reject the null hypothesis when the null hypothesis is actually true is no more than the stated probability. This allows for those applications where the probability of deciding to reject may be much smaller than the significance level for some sets of assumptions encompassed within the null hypothesis. The significance level is usually represented by the Greek symbol, α (alpha). Popular levels of significance are 5%, 1% and 0.1%. If a test of significance gives a p-value lower than the α-level, the null hypothesis is rejected. Such results are informally referred to as 'statistically significant'. For example, if someone argues that "there's only one chance in a thousand this could have happened by coincidence," a 0.1% level of statistical significance is being implied. The lower the significance level, the stronger the evidence. Different α-levels have different advantages and disadvantages. Smaller α-levels give greater confidence in the determination of significance, but run greater risks of failing to reject a false null hypothesis (a Type II error, or "false negative determination"), and so have less statistical power. The selection of an α-level inevitably involves a compromise between significance and power, and consequently between the Type I error and the Type II error. Fixed significance levels such as those mentioned above may be regarded as useful in exploratory data analyses. However, modern statistical advice is that, where the outcome of a test is essentially the final outcome of an experiment or other study, the p-value should be quoted explicitly. And, importantly, it should be quoted whether or not the p-value is judged to be significant. This is to allow maximum information to be transfered from a summary of the study into meta-analyses. In statistics, a null hypothesis (H0) is a hypothesis set up to be nullified or refuted in order to support an alternative hypothesis. When used, the null hypothesis is presumed true until statistical evidence, in the form of a hypothesis test, indicates otherwise — that is, when the researcher has a certain degree of confidence, usually 95% to 99%, that the data does not support the null hypothesis. It is possible for an experiment to fail to reject the null hypothesis. It is also possible that both the null hypothesis and the alternate hypothesis are rejected if there are more than those two possibilities. In scientific and medical applications, the null hypothesis plays a major role in testing the significance of differences in treatment and control groups. The assumption at the outset of the experiment is that no difference exists between the two groups (for the variable being compared): this is the null hypothesis in this instance. Other types of null hypothesis may be, for example, that: values in samples from a given population can be modelled using a certain family of statistical distributions. the variability of data in different groups is the same, although they may be centred around different values. The alternative hypothesis (or maintained hypothesis or research hypothesis) and the null hypothesis are the two rival hypotheses whose likelihoods are compared by a statistical hypothesis test. Usually the alternative hypothesis is the possibility that an observed effect is genuine and the null hypothesis is the rival possibility that it has resulted from random chance. The classical (or frequentist) approach is to calculate the probability that the observed effect (or one more extreme) will occur if the null hypothesis is true. If this value (sometimes called the "p-value") is small then the result is called statistically significant and the null hypothesis is rejected in favour of the alternative hypothesis. If not, then the null hypothesis is not rejected. Incorrectly rejecting the null hypothesis is a Type I error; incorrectly failing to reject it is a Type II error. the standard deviation of a probability distribution, random variable, or population or multiset of values is a measure of the spread of its values. The standard deviation is usually denoted with the letter σ (lower case sigma). It is defined as the square root of the variance. To understand standard deviation, keep in mind that variance is the average of the squared differences between data points and the mean. Variance is tabulated in units squared. Standard deviation, being the square root of that quantity, therefore measures the spread of data about the mean, measured in the same units as the data. Stated more formally, the standard deviation is the root mean square (RMS) deviation of values from their arithmetic mean. For example, in the population {4, 8}, the mean is 6 and the deviations from mean are {−2, 2}. Those deviations squared are {4, 4} the average of which (the variance) is 4. Therefore, the standard deviation is 2. In this case 100% of the values in the population are at one standard deviation of the mean. The standard deviation is the most common measure of statistical dispersion, measuring how widely spread the values in a data set are. If many data points are close to the mean, then the standard deviation is small; if many data points are far from the mean, then the standard deviation is large. If all the data values are equal, then the standard deviation is zero. For a population, the standard deviation can be estimated by a modified standard deviation (s) of a sample. The formulas are given below. In probability theory and statistics, the variance of a random variable, probability distribution, or sample is one measure of statistical dispersion, averaging the squared distance of its possible values from the expected value (mean). Whereas the mean is a way to describe the location of a distribution, the variance is a way to capture its scale or degree of being spread out. The unit of variance is the square of the unit of the original variable. The square root of the variance, called the standard deviation, has the same units as the original variable and can be easier to interpret for this reason.
The variance of a real-valued random variable is its second central moment, and it also happens to be its second cumulant. Just as some distributions do not have a mean, some do not have a variance as well. The mean exists whenever the variance exists, but not vice versa. Type I error, also known as an "error of the first kind", an α error, or a "false positive": the error of rejecting a null hypothesis when it is actually true. Plainly speaking, it occurs when we are observing a difference when in truth there is none. A false positive normally means that a test claims something to be positive, when that is not the case. For example, a pregnancy test with a positive result (indicating that the person taking the test is pregnant) has produced a false positive in the case where the person is not pregnant. Type II error, also known as an "error of the second kind", a β error, or a "false negative": the error of failing to reject a null hypothesis when the alternative hypothesis is the true state of nature. In other words, this is the error of failing to observe a difference when in truth there is one. This type of error can only occur when the statistician fails to reject the null hypothesis. In the example of a pregnancy test, a type II error occurs if the test reports false when the person is, in fact, pregnant. Type I errors (the "false positive"): the error of rejecting the null hypothesis given that it is actually true; e.g., A court finding a person guilty of a crime that they did not actually commit. Type II errors (the "false negative"): the error of failing to reject the null hypothesis given that the alternative hypothesis is actually true; e.g., A court finding a person not guilty of a crime that they did actually commit. The Z-test is a statistical test used in inference which determines if the difference between a sample mean and the population mean is large enough to be statistically significant, that is, if it is unlikely to have occurred by chance. The Z-test is used primarily with standardized testing to determine if the test scores of a particular sample of test takers are within or outside of the standard performance of test takers. In order for the Z-test to be reliable, certain conditions must be met. The most important is that since the Z-test uses the population mean and population standard deviation, these must be known. The sample must be a simple random sample of the population. If the sample came from a different sampling method, a different formula must be used. It must also be known that the population varies normally (i.e., the sampling distribution of the probabilities of possible values fits a standard normal curve). If it is not known that the population varies normally, it suffices to have a sufficiently large sample, generally agreed to be ≥ 30 or 40. In statistics, a standard score is a dimensionless quantity derived by subtracting the population mean from an individual raw score and then dividing the difference by the population standard deviation. This conversion process is called standardizing or normalizing. Standard scores are also called or z-values, z-scores, normal scores, and standardised variables. The standard score indicates how many standard deviations an observation is above or below the mean. It allows comparison of observations from different normal distributions, which is done frequently in research. The standard score is not the same as the z-factor used in the analysis of high-throughput screening data, but is sometimes confused with it.