LOCATING VARIABLE VALUES, DESCRIBING DISTRIBUTIONS, AND MEASURES OF AVERAGES AND VARIATIONS
PART 1: PERCENTILES * the outcome or score below which a given percentage of the distribution falls Pi = Lp+
( pi )( N ) − c p fp
(Wi)
Pi = score of the ith percentile Lp = true lower limit of interval containing the ith percentile pi = ith percentile written as a proportion N = total number of observations cp = cumulative frequency up to but not including the interval containing Pi fp = frequency in the interval containing the ith percentile Wi = width of the interval containing Pi; Up – Lp where Up and Lp are the upper and lower true limits of the interval containing Pi PART 2: MEASURES OF CENTRAL TENDENCY OR AVERAGE Central tendency or average is a value that describes the typical outcome of a distribution of scores (or the typical value of a variable) 1. MODE • For both discrete and continuous variables • Value or category of the variable which has the largest frequency 2. MEDIAN • For continuous variables
• Ordinarily defined as the value of the variable that divides an orderable distribution of values in a variable into two equal parts: those above and then those below the median • Textbooks say that the determination of the median would depend on whether the data is ungrouped or grouped (into intervals) FOR UNGROUPED DATA • For odd distribution, the median is the middle value when the distribution is ordered from the lowest to the highest • For even distribution, the median is halfway between the two middle values. In this instance, the median is the sum of the two middle values divided by two FOR GROUPED DATA • Some stat textbooks recommend the 50th percentile as the median. However, Knoke and Bohrnstedt recommend that the median for grouped data is the value of that category in which cumulative percentage equals 50.0%. In contrast, I (AB) assert that given today’s computers, the notion of medians for grouped data is not relevant because we can easily input raw data using computers and get the median based on the ungrouped variable values 3. MEAN • This is the popular notion of an average but which is only appropriate for continuous data. The mean is obtained by adding all the values of the variable and dividing by the number of cases, N or n. Here N is for the population and n is for the sample.
X = ∑N i =i
Xi N
or
X = ∑ in= i
Xi n
PART 3: ADVANTAGES & DISADVANTAGES OF EACH MEASURE OF AVERAGE MEAN • •
•
• •
•
• • • •
ADVANTAGES DISADVANTAGES • Cannot be used on open ended used for interval or ratio data intervals or incomplete affected by every score in a enumeration (in this case, mode or distribution yet it is a stable median is used) measure of central tendency (when we draw several samples from a • Fluctuation in one score can have a big impact if the distribution is population small (affected by extreme scores) amenable to advanced math or statistical procedures MEDIAN ADVANTAGES DISADVANTAGES • amenable to only a few math can be used for ordinal data operations unaffected by extreme scores (but affected by the size of the sample or • Less stable than the mean population) can be useful for incomplete enumeration MODE ADVANTAGES DISADVANTAGES • can be drastically affected by a useful for nominal data single value or is the least stable locates highest concentration of measure scores • cannot be used in math operations quickest estimate can be useful in an incomplete enumeration
PART 4: SKEW OF DISTRIBUTION 1. For continuous variables or ratio data, if the mean, median, and mode coincide, the distribution is “normally” distributed and symmetric. If they do not coincide, the distribution is negatively or positively skewed: there are more values of a variable with less occurrence or observations that result into a tail (with the variable values on the x-axis and the frequency of occurrence on the y-axis). 2. When the tail is at right, we have a positively skewed distribution. A negatively skewed distribution is when the tail of the distribution is at left. A positively skewed distribution has more categories above the median than below but these have low frequencies, observations, or occurrence. Meanwhile, a negatively skewed has more categories below the median but the categories or values have low frequencies, observations, or occurrence. A skewed distribution is asymmetric. 3. NEGATIVELY SKEWED frequencies
mdn mean mode
4. POSITIVELY SKEWED frequencies
mode mean mdn
PART 5: MEASURES OF VARIATION A more complete description of distribution must take account of variation or a description of how close the values of a variable distribution relative to the central tendency or mean. 1. RANGE This is the difference between the largest and smallest outcomes (or values of a variable) in a distribution. However, this material recommend that simply stating the lowest and highest values may be more descriptive of the range. 2. AVERAGE DEVIATION (AD) ABOUT THE MEAN AD = ∑ d i / N 3. VARIANCE
2 ( X ) − X i SAMPLE VARIANCE: s 2X = ∑ ni= 1 n−1 ( X i − X )2 N 2 POPULATION VARIANCE: σ X = ∑ i = 1 N
4. STANDARD DEVIATION SAMPLE STANDARD DEVIATION:
s 2X
POPULATION STANDARD DEVIATION:
σ 2X
PART 6: Z-SCORES Supposedly, Z-scores are used to describe a distribution, especially if 2 values from two different populations or distributions are compared using the same scale. However, we can also conceive the Z-score as a measure of distance from the mean done in the manner that would allow for comparisons across distribution. To obtain the Z score of a variable i: Zi =
(Xi − X) sx