Data Analysis Florenda F. Cabatit RN MA Facilitator
DATA ANALYSIS Data analysis is the process by which information is rendered meaningful and intelligible (Polit and Hungler, 1995). It is the systematic organization and synthesis of research data and the testing of research hypotheses using those data (2004).
Statistical Analysis Quantitative analysis deals with numerical analysis of information. It is the manipulation of numeric data through statistical procedures for the purpose of describing phenomena or assessing the magnitude and reliability of relationships among them. Statistics is the scientific method used in quantitative analysis.
Statistics Statistics helps to: Organize data Summarize data Evaluate data Present data in an easily understood form.
Statistics Two branches of Statistics: Descriptive statistics statistics used to describe and summarize data Inferential Statistics – statistics that permit inferences on whether relationships observed in a sample are likely to occur in the larger population.
Considerations in the choice of appropriate statistical methods The purpose of the research The level of measurement of the
variables The number of groups/variables involved The type of groups being studied
Levels of Measurement Nominal - the lowest level
- involves assigning numbers to classify characteristics into categories - numeric codes assigned in nominal measurement do not convey quantitative information. - the numbers are merely symbols that represent different values. - categories must be mutually exclusive and collectively exhaustive.
Ordinal Measurement This involves sorting objects on the basis
of their relative standing or ranking on an attribute. The numbers are not arbitrary-they signify incremental values but does not however, tell anything about how much greater one level is than another.
Interval Measurement A measurement in which
an attribute of a variable is rank ordered on a scale that has equal distances between points on that scale.
Ratio Scale A quantitative measurement in which intervals
are equal and there is a true zero point. The highest level of measurement All arithmetic operations are permissible with this measurement (add, subtract, multiply, and divide numbers on this scale).
Descriptive Statistics Three characteristics to fully describe a set of data: • shape of the distribution values • central tendency • Variability
Review of Descriptive Stats. Descriptive Statistics are used to present
quantitative descriptions in a manageable form. This method works by reducing lots of data into a simpler summary. Example: 370 Centigrade as average adult body
temperature SU’s quality-point system
Univariate Analysis This is the examination across cases of one
variable at a time. Frequency distributions are used to group data. One may set up margins that allow us to group cases into categories. Examples include Age categories Price categories Temperature categories.
Distributions Two ways to describe a univariate distribution A table A graph (histogram, bar chart)
Distributions
(con’t)
Distributions may also be displayed
using percentages. For example, one could use percentages to describe the following: Percentage of people under the poverty level Over a certain age Over a certain score on a standardized test
Distributions (cont.) A Frequency Distribution Table Category Under 35 36-45 46-55 56-65 66+
Percent 9% 21 45 19 6
Distributions (cont.) A Histogram
66+
56-65
46-55
36-45
Percent
Under 35
45 40 35 30 25 20 15 10 5 0
Central Tendency An estimate of the “center” of a
distribution Three different types of estimates: Mean Median Mode
Mean The most commonly used method of
describing central tendency. One basically totals all the results and then divides by the number of units or “n” of the sample. Example: The NCM 104 Quiz mean was determined by the sum of all the scores divided by the number of students taking the exam.
Median The median is the score found at the
exact middle of the set. One must list all scores in numerical order and then locate the score in the center of the sample. Example: If there are 500 scores in the list, score #250 would be the median. This is useful in weeding out outliers.
Mode The mode is the most repeated score
in the set of results. Lets take the set of scores: 15,20,21,20,36,15, 25,15 Again we first line up the scores 15,15,15,20,20,21,25,36 15 is the most repeated score and is therefore labeled the mode.
Central Tendency If the distribution is normal (i.e., bell-
shaped), the mean, median and mode are all equal. In our analyses, we’ll use the mean.
Dispersion Two estimates types: Range
Standard deviation Standard deviation is more
accurate/detailed because an outlier can greatly extend the range.
Range The range is used to identify the
highest and lowest scores. Lets take the set of scores:15,20,21,20,36,15, 25,15. The range would be 15-36. This identifies the fact that 21 points separates the highest to the lowest score.
Standard Deviation The standard deviation is a
value that shows the relation that individual scores have to the mean of the sample. If scores are said to be standardized to a normal curve, there are several statistical manipulations that can be performed to analyze the data set.
Standard Dev. (con’t) Assumptions may be made about
the percentage of scores as they deviate from the mean. If scores are normally distributed, one can assume that approximately 69% of the scores in the sample fall within one standard deviation of the mean. Approximately 95% of the scores would then fall within two standard deviations of the mean.
Standard Dev. (con’t) The standard deviation calculates
the square root of the sum of the squared deviations from the mean of all the scores, divided by the number of scores. This process accounts for both positive and negative deviations from the mean.
RESEARCH QUESTION: DESCRIBE LEVEL
TYPE OF DESCRIPTION
Distribution NOMINAL
ORDINAL
Central Tendency
Distribution
Central Tendency
Distribution RATIO/INTERVAL
Central Tendency Variability
STATISTICAL TOOL
Frequency distribution Contingency Table Mode
Frequency Distribution Contingency Table Scatterpoint Mode, Median
Frequency Distribution Contingency Table Scatterpoint Mode, Median, Mean Range, Variance, Standard Deviation
Inferential statistics
Based on the law of probability It provides a means for drawing
conclusions about a population, given data from a sample It estimates population parameters from sample statistics
Inferential Statistics Statistical Inference consists of two techniques: 2.Estimation of parameters 3.Hypothesis testing
Hypothesis Testing Statistical hypothesis testing provides objective criteria for deciding whether hypotheses are supported by empirical evidence. It is a process of disproof or rejection. Researchers seek to reject the null hypothesis through various statistical tests. Hypothesis testing uses samples to draw conclusions about relationships within the population.
Type I and Type II Errors Type I Error - researchers make a type I error when a true null hypothesis is rejected. Type II Error – researchers make a type II error when a false null hypothesis is accepted
Level of Significance This refers to the risk of making a type I error in a statistical analysis. The value selected beforehand signifies the risk or the probability of rejecting of rejecting a true null hypothesis. The two most frequently used significance levels (referred to as alpha or α) are: .05 .01
Level of Significance With .05 significance level, we are
accepting the risk that out of 100 samples drawn from a population, a true null hypothesis would be rejected only 5 times.
With a .01 level of significance, the risk of
a type I error is lower: in only 1 sample out of 100 would we erroneously reject the null hypothesis.
Critical Region This refers to the area in the sampling distribution representing values that are “improbable” if the null hypothesis is true. It is defined by the level of significance
Statistical Tests Two-tailed test- this means that both ends or tails of the sampling distribution are used to determine improbable values. In one-tailed tests, the critical region of improbable values is entirely in one tail of the distribution-the tail corresponding to the direction of the hypothesis
An example of Critical Regions of a two -tailed test
Types of Statistical Tests
Parametric Tests – a class of inferential statistical tests that involve: a. Assumptions about the distribution of the variables b. The estimation of a parameter c. The use of interval or ratio measures.
Statistical Tests Non-parametric Tests –statistical tests that do not estimate parameters - also called distribution-free statistics.
Steps in Hypothesis 1.testing State the alternative hypothesis State the null hypothesis Establish the level of significance Select a one-tailed or two-tailed test Compute a test statistic Calculate the degrees of freedom Obtain a tabled value for the statistical test 8. Compare the test statistic with the tabled value. 2. 3. 4. 5. 6. 7.
The Decision Matrix In reality What we conclude Accept null Reject alternative We say... • There is no real program effect • There is no difference, gain • Our theory is wrong Reject null Accept alternative We say... • • •
There is a real program effect There is a difference, gain Our theory is correct
Null true
Null false Alternative true
Alternative false In reality... • • •
There is no real program effect There is no difference, gain Our theory is wrong
1-α THE CONFIDENCE LEVEL The odds of saying there is no effect or gain when in fact there is none # of times out of 100 when there is no effect, we’ll say there is none α TYPE I ERROR The odds of saying there is an effect or gain when in fact there is none # of times out of 100 when there is no effect, we’ll say there is one
In reality... • • •
There is a real program effect There is a difference, gain Our theory is correct
β TYPE II ERROR The odds of saying there is no effect or gain when in fact there is one # of times out of 100 when there is an effect, we’ll say there is none 1-β POWER The odds of saying there is an effect or gain when in fact there is one # of times out of 100 when there is an effect, we’ll say there is one
Decision Matrix If you try to increase power, you increase the chance of winding up in the bottom row and of Type I error. If you try to decrease Type I errors, you increase the chance of winding up in the top row and of Type II error.