Describing a Set of Data with Numerical Measures
Lesson 2(b)
To present the important measures
and to show how to compute the following:
▪ Mean Median Mode
A measure of center is a value at the
center or middle of a data set.
...Mean ...Median ...Mode
The (arithmetic) mean is generally the most important of all numerical
descriptive measurements, and it is what most people call an average
The arithmetic mean of a set of values
is the number obtained by adding the values and dividing the total by the
number of values; also referred to as mean will be often used throughout
the remainder of the course.
The mean is denoted by
(pronounced “x-bar”) if the data set is a sample from a larger population.
The mean is denoted by (lowercase Greek mu) if all values of the population are used.
The Greek letter (uppercase Greek sigma) indicates that the data values
should be added.
Formula
Denotes the mean of a set of sample values (ungrouped) Notation
Σ
Denotes the addition of a set of values
is the variable usually used to represent the individual data values
x
n represents the
number of values in a sample
Denotes the mean of all values in a population Notation
Σ
Denotes the addition of a set of values
is the variable usually used to represent the individual data values
x
N represents the
number of values in a population
Listed below are the volumes (in ounces) of the Coke in five different cans. Find the mean for this example.
12.3 12.1 12.2 12.3 12.2
12.3 12.1 12.2 12.3 12.2
It is sensitive to every value, so one exceptional value can affect the mean dramatically.
The median largely overcomes this disadvantage.
The median of a data set is the middle
value when the original data values are arranged in order of increasing (or decreasing) magnitude.
The median is often denoted by
(pronounced “x-tilde”, or “x-curl”).
first sort the values (arranged them in order), then follow one of these two procedures: If the number of values is odd, the median is the number located in the exact middle of the list. If the number of values is even, the median is found by computing the mean of the two middle number
Find the median of the following salaries (in millions of dollars) paid to female executives (based on data from Working Woman magazine): 6.72
3.46
3.60
6.44
Since the number of values is an even number, and arranging them in order; such that 3.46
3.60
6.44
6.72
Then, the median is $5.02 million
Repeat Example 1, this time including another salary of $26.70 million. That is, find the median of the following salaries (in million dollars):
6.72
3.46
3.60
6.44
26.70
Since the number of values is an odd number, and arranging them in order; such that 3.46
3.60
6.44
6.72 26.70
Exact middle Then, the median is $6.44
The mode of a data set is the value that occurs most frequently.
The mode is often denoted by M.
When two values occur with the same greatest frequency, each one is a mode and the data is bimodal. When more than two values occur with the same greatest frequency, each is a mode and the data set is said to be multimodal. When no value is repeated, we say that there is no mode.
Find the modes of the following data sets. 1. 2.
3.
5 5 5 3 1 5 1 4 3 5 1 2 2 2 3 4 5 6 6 6 7 9 1 2 3 6 7 8 9 10
1. The number 5 is the mode because it is the value that occurs most often.
5 5 5 3 1 5 1 4 3 5
2. The number 2 and 6 are both modes because they occur with the same greatest frequency. This data set is bimodal.
1 2 2 2 3
4
5
6
6
6 7 9
3. There is no mode because n value is repeated. 1 2 3 6 7 8 9
10
It is the value midway between the highest and the lowest values in the original data set. It is found using the formula shown
Find the midrange of the ages of people arrested on theft charges at the Dutches County jail. 18 16 23 25 19 18 20 38
Find the midrange of the ages of people arrested on theft charges at the Dutches County jail. 19 16 23 25 19 18 20 38
2
2
2
20
34
45
210
Mean = 45;
Midrange = (2
that occur in the data set
average value
+ 210)/2 = 156
Mean
2
Median
Median = 20; middle value
2
Mode
2
20
34
Mode = 2 value that occur most often
45
Outlier 210
A distribution of data is skewed if it is not
symmetric and if it extends more to one side than the other. A distribution of data is symmetric if the
left half of its histogram is roughly a mirror image of its right half
Lopsided to the right = Skewed to the left =
Negatively Skewed The mean and median are to the left of the
mode. Although not always predictable, data of this type of distribution have the
mean to the left of the median
Lopsided to the left = Skewed to the right = positively Skewed
The mean and median are to the right of the mode. Although not always predictable, data of this type of distribution generally have the mean to the right of the median
Grouped Data
When data are summarized in a frequency table, we do not know the exact values falling in a particular class. To make calculations possible, we pretend that within each class, all sample values are equal to the class midpoint.
Since each class midpoint is repeated number of times equal to the class frequency, the sum of all sample values becomes (f•x), where f denotes frequency and x represents the class midpoint. The total number of sample values is the sum of frequencies f.
𝒙=
(𝑭 ∙ 𝒙) (𝑭 ∙ 𝒙) 𝒙= 𝒇 𝒇
Example from Lesson 2(a) Frequency Distribution CI
Class Width
f
x
fx
1
28
-
34
1
31
31
2
35
-
41
4
38
152
3
42
-
48
10
45
450
4
49
-
55
9
52
468
5
56
-
62
9
59
531
6
63
-
69
4
66
264
7
70
-
76
1
73
73
8
77
-
83
2
80
160
Total
40
2129
Lesson 2(c)
To discuss the following key concepts: Variation refers to the amount that values vary among themselves, and it can be measured with specific numbers Values that are relatively close together have lower measures of variations, and values that are spread farther apart have measures of variation that are larger
The Standard deviation, which is a
particularly important measure of variation can be computed
The values of Standard Deviation must be interpreted correctly.
Data sets may have the same center but look difference because of the way the numbers spread
out from the center
Different range and unequal variability
...Range ...Variance ...Standard Deviation ...Coefficient of Variation
The difference between the largest observation and the smallest observation
Its advantage is also its disadvantage Its simplicity; because it is calculated from only two observations, it tells nothing about other observations
Population variance
Sample variance
The population variance is represented by σ2 (Greek letter sigma squared) To compute the sample variance s2 begin by calculating the sample mean , then compute for the difference (also known as deviation) between each observation and the mean Square the deviation and sum, finally devide the sum of squared deviation by (n – 1)
8
4
9
11
The mean is From each observation we determine the deviation from the mean
8–7=1 4 – 7 = -3 9–7=2 11 – 7 = 4 3 – 7 = -4
13
Squaring the deviations yields
Summing and dividing by (n – 1)
(1)2 = 1 (-3)2 = 9 (2)2 = 4 (4)2 = 16 (-4)2 = 16
the difference between the value and the mean
Formula
This is seldom used because of limited utility
The variance provides only a rough idea about the amount of variation in the data It is useful when comparing two or more sets of data Squaring the deviations from the mean is squared requires squaring the unit attached to the variance. This contributes to the problem of interpretation: Solution is Standard Deviation
The following are the number of summer jobs a sample of six students applied for. Find the mean and variance of these data 17
15
23
7
9
13
The mean is
The sample variance is
Population
Sample
The standard deviation of a set of sample values is a measure of variation of values about the mean. Formula
(a) Sample standard deviation
(b) Shortcut formula for standard deviation
Step 1:Find the mean of the values
Step 2:Subtract the mean from each individual value to get a list of
deviations of the form
Step 3:Square each of the differences obtained from Step 2
Knowing the mean and standard
deviation allows the statistician to extract useful bits of information. The
information depends on the shape of the histogram. If the histogram is bellshaped the Empirical Rule is used.
Step 4:Add all the squares obtained
from Step 3 to get
Step 5:Divide the total from Step 4 by
the number (n-1)
Step 6:find the square root of the result of Step 5.
µ
Approximately 68% of all observations fall within one standard deviation of the mean
Approximately 95% of all observations fall within two standard deviations of the mean
Approximately 99.7% of all observations fall within three standard deviations of the mean
Calculate the variance and standard deviation for the five measurements given in the table below. 5
7
1
2
Use formulae and
4
Solution: Given
5
7
1
2
Table for simplified calculation of s2 and s
xi 5
(xi)2 25
7
49
1 2
1 4
4 19
16 95
4
Solution: Given
5
7
1
2
4
Solution: using Computation using deviation from the mean
5
1.2
1.44
7
3.2
10.24
1
-2.8
7.84
2
-1.8
3.24
4
0.2
0.04
19
0.0
22.80
Solution
The coefficient of variance of a set of observations is the standard deviation of the observations divided by the mean Population
Sample
Calculate the variance of the following samples 9 3 7 4 7 5 4
Determine the variance and standard deviation of the following samples 12 6 22 31 23 15 13 15 17 21
Calculate the variance and standard deviation of the following samples
6.5 6.6 6.7 6.8 7.1 7.4 7.7 7.7 7.7 7.3
Lesson 2(d)
Provides information about the position of particular values relative to the entire data set Types Median Centiles (Percentile, Quartile, Decile) Z-score
A centile or centile point is defined as a
specific point in a distribution which has a given percentage of the cases below it.
Widely used in educational circles in reporting the results of standardized
tests
Any Centile Point
Where LL = lower exact limit of interval in which we are interpolating N = number of cases p = proportion corresponding to
the desired centile cf = cumulative frequency of cases below interval in which we are interpolating fi = frequency of the interval in whic we are interpolating i = size of the class interval
Cumulative Relative Frequency Distribution
CI LL
UL
f
Cum Rel Freq
1
28
-
34
1
0.025
2
35
-
41
4
0.125
3
42
-
48
10
0.375
4
49
-
55
9
0.600
5
56
-
62
9
0.825
6
63
-
69
4
0.925
7
70
-
76
1
0.950
8
77
-
83
2
1.000
Total
40
Ogive This side of the curve tells that... 60% of the 82.50% students who took the Geography Test 60.00% got a score below 56 points
92.50% 95.00%
100%
37.50%
12.50% 2.50% 1
2
3
4
5
6
7
8
Ogive 92.50% 95.00%
100%
82.50%
60.00%
C60 37.50%
12.50% 2.50% 1
2
3
4
5
6
7
8
For example, the 60th centile (C60) is that point in a distribution which has 60% of the cases below it.
A frequency distribution of the scores of 376 boys on a test of mechanical ability is presented in the opposite table
Cumulative Frequency and Percentage
CI
f
cf
cP
60-64
2
376
100
55-59
12
374
99.5
50-54
20
362
96.3
45-49
32
342
90.7
40-44
46
310
82.4
35-39
58
264
70.2
30-34
64
206
54.8
25-29
58
142
37.7
20-24
42
84
22.3
15-19
23
42
11.2
10-14
15
19
5.0
5-9
4
4
1.1
Total
376
To illustrate C50
By definition, C50 is the centile point that will have 50% of the cases above and below it.
Cumulative Frequency and Percentage
CI
f
cf
cP
60-64
2
376
100
55-59
12
374
99.5
50-54
20
362
96.3
45-49
32
342
90.7
40-44
46
310
82.4
35-39
58
264
70.2
30-34
64
206
54.8
25-29
58
142
37.7
20-24
42
84
22.3
15-19
23
42
11.2
10-14
15
19
5.0
5-9
4
4
1.1
Total
376
C50 is the midpoint of the distribution and is known as the
median
Cumulative Frequency and Percentage
CI
f
cf
cP
60-64
2
376
100
55-59
12
374
99.5
50-54
20
362
96.3
45-49
32
342
90.7
40-44
46
310
82.4
35-39
58
264
70.2
30-34
64
206
54.8
25-29
58
142
37.7
20-24
42
84
22.3
15-19
23
42
11.2
10-14
15
19
5.0
5-9
4
4
1.1
Total
376
Hence we are interested in finding that point in the distribution with 188 cases above and below it
Cumulative Frequency and Percentage
CI
f
cf
cP
60-64
2
376
100
55-59
12
374
99.5
50-54
20
362
96.3
45-49
32
342
90.7
40-44
46
310
82.4
35-39
58
264
70.2
30-34
64
206
54.8
25-29
58
142
37.7
20-24
42
84
22.3
15-19
23
42
11.2
10-14
15
19
5.0
5-9
4
4
1.1
Total
376
Beginning from the bottom until we come as close to 188 cases, as possible, but not exceeding it.
Cumulative Frequency and Percentage
CI
f
cf
cP
60-64
2
376
100
55-59
12
374
99.5
50-54
20
362
96.3
45-49
32
342
90.7
40-44
46
310
82.4
35-39
58
264
70.2
30-34
64
206
54.8
25-29
58
142
37.7
20-24
42
84
22.3
15-19
23
42
11.2
10-14
15
19
5.0
5-9
4
4
1.1
Total
376
188 cases is at the bottom of class interval 30-34 and above 25-29. This being 29.5 has 142 cases below it. We need 46 cases to meet the 188 cases
Cumulative Frequency and Percentage
CI
f
cf
cP
60-64
2
376
100
55-59
12
374
99.5
50-54
20
362
96.3
45-49
32
342
90.7
40-44
46
310
82.4
35-39
58
264
70.2
30-34
64
206
54.8
25-29
58
142
37.7
20-24
42
84
22.3
15-19
23
42
11.2
10-14
15
19
5.0
5-9
4
4
1.1
Total
376
We need, therefore, to interpolate.
Cumulative Frequency and Percentage
CI
f
cf
cP
60-64
2
376
100
55-59
12
374
99.5
50-54
20
362
96.3
45-49
32
342
90.7
40-44
46
310
82.4
35-39
58
264
70.2
30-34
64
206
54.8
25-29
58
142
37.7
20-24
42
84
22.3
15-19
23
42
11.2
10-14
15
19
5.0
5-9
4
4
1.1
Total
376
We verify by coming down from top.
Cumulative Frequency and Percentage
CI
f
cf
cP
60-64
2
376
100
55-59
12
374
99.5
50-54
20
362
96.3
45-49
32
342
90.7
40-44
46
310
82.4
35-39
58
264
70.2
30-34
64
206
54.8
25-29
58
142
37.7
20-24
42
84
22.3
15-19
23
42
11.2
10-14
15
19
5.0
5-9
4
4
1.1
Total
376
Several of the Centile points have special names: C10 – Decile (D1) C20 – D2 C25 – 1st Quartile (Q1) C50 – Median C75 – 3rd Quartile (Q3)
25%
25%
25%
25%
Median, M
Lower Quartile, Q1
Upper Quartile, Q3
Suppose you have been notified that your score of 610 on the Verbal Graduate record Examination placed you at the 60th percentile in the distribution of scores. Where does your score of 610 stand in relation to the scores of others who took the examination?
Scoring at the
60%
60th
40%
percetile means that 60% of all examination scores were lower that your score and 40%
were higher
25%
25%
25%
•
60th %-tile
25%
Sample z-score
A z-score measures the distance between an observation and the mean, measured in units of standard deviation; Valuable tool for determining whether the observation under consideration is likely to occur quite frequently or somewhat unusual.
Consider the sample of 10 measurements: 1
1
0
15
2
3
4
0
1
3
The measurements x = 15 appears to be unusually large. Calculate the zscore for this observation and state your conclusions.
x
x2
1
1
1
1
0
0
15
225
2
4
3
9
4
16
0
0
1
1
3
9
∑x=30
∑x2=266
The z-score for the suspected outlier is calculated as:
µ
The measurement x=15 lies 2.71 standard deviation above the sample mean. Although the z-score does not exceed 3s, it is close enough so that you might suspect that x=15 is an outlier. In case, examine the sampling procedure to see whether x=15 is a faulty observation