PRINCIPLES OF STATISTICS IPTHO ABDUL SUKUR KAMSIR
OUTLINE Introduction and Definitions Sampling External and Internal Validity Sources of error Normal distribution Standard error Binomial probabilities Poisson distribution Statistical tests Confidence intervals
What is statistics? A Science where inferences are made about specific random phenomena on the basis of a relatively limited sample material The word “statistics” can also mean the analytical tools used in this science i.e. the calculated figures based on the data collected.
Two Main Areas Mathematical statistics – concerns with the development of new methods of statistical inference and requires a strong mathematics knowledge Applied statistics – applying the methods of mathematical statistics to specific subject areas; BIOSTATISTICS is when it is applied to biological or medical problems
Definitions Statistics - Collection of methods for planning experiments, obtaining data, and then organizing, summarizing, presenting, analyzing, interpreting, and drawing conclusions. Statistic - Characteristic or measure obtained from a sample e.g mean, variance, Chi-square statistic, t-test statistic etc
Definitions Inferential Statistics - Generalizing from samples to populations using probabilities. Performing hypothesis testing, determining relationships between variables, and making predictions. Descriptive statistics refers to the process of organizing and summarising collected information (data) to study the properties of a variable
Definitions Population - All subjects possessing a common characteristic that is being studied. Sample - A subgroup or subset of the population. Parameter - Characteristic or measure obtained from a population. Statistic - Characteristic or measure obtained from a sample.
Definitions Variable - Characteristic or attribute that can assume different values, it is the fundamental element of statistical analysis; it is something measured/counted or identified
Variables Qualitative - Variables which assume nonnumerical values. Quantitative - Variables which assume numerical values. Discrete - Variables which assume a finite or countable number of possible values. Usually obtained by counting. Continuous - Variables which assume an infinite number of possible values. Usually obtained by measurement.
Variables (qualitative/categorical) Nominal Level - Level of measurement which classifies data into mutually exclusive, all inclusive categories in which no order or ranking can be imposed on the data. Ordinal Level - Level of measurement which classifies data into categories that can be ranked. Differences between the ranks do not exist.
Variables (quantitative/numerical) Interval Level - Level of measurement which classifies data that can be ranked and differences are meaningful. However, there is no meaningful zero, so ratios are meaningless. (temperature in celcius, fahrenheit etc) Ratio Level - Level of measurement which classifies data that can be ranked, differences are meaningful, and there is a true zero. True ratios exist between the different units of measure. (temperature in Kelvin)
SAMPLING Random - Sampling in which the data is collected using chance methods or random numbers. Systematic - Sampling in which data is obtained by selecting every kth object. Stratified - Sampling in which the population is divided into groups (called strata) according to some characteristic. Each of these strata is then sampled using one of the other sampling techniques. Cluster - Sampling in which the population is divided into groups (usually geographically). Some of these groups are randomly selected, and then all of the elements in those groups are selected. (all the above methods have a known probability function) Convenience - Sampling in which data is which is readily available is used, the probability of being selected as a sample is unknown.
Why sample? Cheaper than getting data from everyone! Internally valid study because: -easier to manage -standardise methods -easier to conduct -less people involved Good sampling method may ensure external validity!
Beware when sampling
•
Two samples have been taken at random from the same population
•
By chance, sample 1 contains a group of relatively large fish while sample 2 are relatively small
•
You might mistakenly conclude that the two populations are very different
* Even a random sample may not be a good representative of the population from which it has been taken
• Samples selected at random from very different populations may not be different. • Simply by chance sample 1 and sample 2 are similar
• Even if two populations are very different, samples from each may be similar • The misleading impression – the populations are similar
* Natural variation among individuals within a sample may obscure any effect of an experimental treatment
•
Two samples of equal-size fish were taken from the same population
•
One group fed with vitamin supplement diet for 300days & the other untreated control group
•
The supplement diet caused 10% increase in length but the difference is small compared with the variation in growth among individuals which may obscure any effect of treatment
Because of the natural variability among living species: A ‘true’ difference may not be apparent The effect of treatment may not be apparent after a clinical trial
How to solve this unavoidable problem in Life Sciences? Researchers need to know how to sample to ensure you have a good representative sample of your population. They also need a good understanding of experimental design, because a good design will take natural variation into account. Know how to minimise additional unwanted variation introduced by the experimental procedure itself. Need to take accurate and precise measurements to minimise other sources of error.
EXTERNAL VALIDITY External validity is the extent to which the results of a study are applicable to OTHER populations “Can my results be extrapolated to others?”
INTERNAL VALIDITY Internal validity is the extent to which the results of an investigation accurately reflects the true situation of the study population “The ability to measure what it sets out to measure” Avoids BIAS or SYSTEMIC ERRORS
Sources of ‘errors’
Bias – a systematic error that can lead to a distortion of the results; “deviation from truth” Confounding
BIAS Selection bias (non random sample, healthy worker effect etc) Information bias (measurement inaccuracy, misclassification, recall bias, interviewer bias etc Performance bias may occur in multicentre studies
Confounding Mixing of the effect of an extraneous variable with the effects of the exposure and disease of interest For a variable to be considered a confounder, it must satisfy two conditions i.e. (a) has an association with the outcome of interest and (b) is also independently associated with the exposure (NOT a result of being exposed)
Confounding Occurs when groups being compared are different with regards to important risk or prognostic factors other than the factor under investigation Certain study designs are prone to confounding i.e. case control
Mann JI et al (1968). Oral contraceptive use in older women and fatal myocardial infarction. Br Med J 2: 193-199 153 women with myocardial infarction (MI) 178 controls Past exposure to oral contraceptives (ocp) were investigated Second table is stratified according to age Note that OR became higher The confounder ‘age’ weakened the relationship between MI and ocp
USER
Non USER
cases
39
114
controls
24
154
O.R. = 2.2 Age <40 User Non user
Age 40-44 User Non user
Cases
21
26
18
88
Controls
17
59
7
95
O.R.
2.8
2.8
COULD IT BE DUE TO CHANCE? Type I and Type II errors (will be explained in other lectures / slides)
Could errors have been introduced? Susceptibility (?differences in basic characteristics) Performance (e.g. differences in proficiencies of treatment) Detection (differences in measurement of outcome) Transfer (differential losses to follow-up)
The Normal Distribution Theoretical distribution that has the shape of a bell-shaped curve Perfectly symmetrical about its centre (mean=median=mode) Standard deviation reflects the spread of individual observations; 68% of the observations are located I std deviation from the mean. We can thus estimate the area under the curve for any value of the variable once we know the mean and standard deviation of the distribution
Normal distribution curves
Normal distribution Many other distributions e.g. binomial probabilities, Poisson approximates the normal distribution under certain conditions The advantage of this normal approximation is that standard probability tables for the normal distribution can be used for binomial problems or Poisson distributions.
Standard Error Spread of observations in one experiment yields a single mean and standard deviations Repeated sampling from the same population will result in a normal distribution of means with a ‘grand’ mean (mean of means) and a spread called STANDARD ERROR Standard error = standard deviation ÷√n where n is the sample size Sample size and variability of measurements determine magnitude of standard error Used to construct confidence intervals
Binomial probabilities The distribution curve has a mean (M)=np and standard deviation (S)=√(npq) where n is the number of trials; p is the probability of outcome A and q is the probability of outcome B. Refers to situations where there are 2 alternatives (success/failure; black/white; heads/tails; alive/dead etc) i.e. p+q=1 Used to determine whether results observed in trials/experiments would have occurred randomly
Normal approximation to the binomial probability distribution Large n (number of trials) p not close to 0 (probability of occurrence is not rare) Product of np>5
Example of a problem to be solved by using binomial theory / method:What is the probability of the number of successes (As) in the sample of n trials? Formula to calculate the probability n Cr . Pr . (1-p)n-r where r is the number of As (successes), n is the number of trials and nCr is known as the binomial coefficient. Description of binomial problem in a later lecture.
Poisson Distribution Useful for calculating probabilities of rare events No a priori estimate can be made of the probability that the event will occur Pr (n) = e-m mn n! where Pr (n) is probability of occurrence of n such events, m is the mean number of events, e is a mathematical constant and n! is factorial of n
Mean and the standard deviation in a Poisson distribution = m
Poisson Probability Distribution Must meet 4 conditions i.e. Discontinuous (discrete) data Chance of a result is SMALL Chance of a result is independent of previous results A large number of tests can be performed Approaches normal with mean m and standard deviation √m for mean values>30
Statistical tests Used for testing of statistical significance To decide on accepting or rejecting your null hypothesis Type of test you use depends on type of data as well as whether the data approximates the normal curve
General Guide to select an appropriate statistical test in univariate analysis Number of groups
Independent variable
Dependent variable
two (Independent )
Categorical (e.g. smokers and nonsmokers) Categorical (e.g. smokers and nonsmokers) Categorical (e.g. smokers and nonsmokers)
Categorical (e.g. CHD and no CHD)
-
Chi – square test Fisher’s Exact test
Categorical (e.g. CHD and no CHD
-
Mantel Hanszel test ( if a third variable is controlled for ) (e.g. age group is controlled for)
Independent t test
Mann – Whitney test (Wilcoxon Ranked Sum test)
Categorical (e.g. pre-intervention and post-intervention) Categorical (e.g. pre-treatment and post-treatment)
Categorical (e.g. behavioral changes) Numerical (e.g. blood pressure)
Categorical (e.g. race)
two ( Dependent )
>two ( Independent)
two
Numerical (PEFR level)
Parametric test
-
Non – Parametric test
Mc Nemar test
Paired - t test
Wilcoxon Signed Rank test
-
Chi-square test
Categorical (e.g. race)
Categorical (e.g. diabetic and nondiabetic) Numerical (e.g. blood sugar level)
One – way ANOVA
Kruskal – Wallis test
Numerical (e.g. height )
Numerical (e.g. weight)
Pearson correlation & Simple Linear Regression
Spearman correlation
Confidence Intervals (CI) Estimating where the “true” population parameter is believed to be found within a given level of confidence (95%, 99% or more) Parameters of interest are usually means, proportions, difference between means or proportions, regression coefficients, correlation coefficients, relative risks CIs extremely useful in assessing clinical significance of a given result 95% CI=sample estimate± 1.96 x SE
THANK YOU