Mathematical Sciences Foundation www.mathscifoun d.org Copyright © Mathematical Sciences Foundation
1
Statistics Statistics is the science of collecting, describing and interpreting data.
Copyright © Mathematical Sciences Foundation
2
Descriptive Statistics
Descriptive statistics includes statistical methods involving collection, presentation, characterization of a set of data in order to describe the various features of that set of data. In general, methods of descriptive statistics include graphic methods and numerical measures. Bar charts, line graphs etc. comprise the graphic methods, Copyright whereas numerical measures © Mathematical Sciences Foundation
3
Measures of Central Tendency/ Statistical Averages
Copyright © Mathematical Sciences Foundation
4
Averages condense the information contained in a data set into a single number. • This number is helpful in taking overview of statistical data. • This number is helpful in making comparison between two or more data sets.
Copyright © Mathematical Sciences Foundation
5
Characteristics of a Good Average • It should be easy to calculate. • It should be easy to comprehend. • It should not be affected too much by fluctuations of the sample.
Copyright © Mathematical Sciences Foundation
6
Measures of Central Tendency
Mean
Median
Copyright © Mathematical Sciences Foundation
Mode
7
Mean (Arithmetic Mean) It is the sum of observations divided by the total number of observations. Mathematical ly,
x1 x2 ... xn x n
Copyright © Mathematical Sciences Foundation
8
Arithmetic mean is affected by extreme values. Consider a situation where two samples differ in only one value Data Set 1 Data Set 2 6 6 10 10 5 38 7 7 4 4 8 8 Arithmetic Mean 6.667 12.167 Copyright © Mathematical Sciences Foundation
Samples differing in one value
9
If we delete the extreme value: 38 from Data Set 2, then the new arithmetic mean is 27.000. Data Set
Arithmetic Mean
Summary: Data Set 1 Mean
6.667
Data Set 2 12.167
6 10 7 4 8 7.000
after deleting an extreme value from the Data Set 2 7.000
This gives motivation for another measure called Copyright © Mathematical Sciences “Trimmed Mean” Foundation
10
Trimmed Mean It
is the mean taken by excluding a percentage of data points from the top and bottom tails of a data set. Note: Trimmed Mean should be calculated when one wishes to exclude outlying data from the analysis. Copyright © Mathematical Sciences Foundation
11
Median It is the value of the data that occupies the middle position when the data is arranged in increasing or decreasing order.
Copyright © Mathematical Sciences Foundation
12
Median Consider a data of size n. Arrange the data in increasing or decreasing order. The median is calculated in the following way: If n is odd: Median will n +1 th term. 2 be the If n is even: Median will be the n th and mean of 2 n 1th terms. 2
Copyright © Mathematical Sciences Foundation
13
For example if we need to find the median for the data set: 20,13,16,17,11,19,12,18 Ranked data: 11, 12, 13, 16, 17, 18 ,19, 20 No. of terms in data = 8 Median = mean of16 17
2
8 th and 2
8 1th terms 2
16.5
Copyright © Mathematical Sciences Foundation
14
N ote: The median
is the number in the middle of
an ordered set of numbers (observations); that is, half the numbers have values that are greater than the median, and half have values that are less.
Copyright © Mathematical Sciences Foundation
15
Median is not affected by extreme values. Example: Consider two sets of data Data set 1: Data set 2:
6, 7, 8, 9, 9, 10 6, 7, 8, 9, 9, 1100
In both cases the median is 8.5
Copyright © Mathematical Sciences Foundation
16
M ode It is the value which occurs most frequently in a set of observations.
Copyright © Mathematical Sciences Foundation
17
Characteristic of Mode • Mode is not affected by extreme values.
Limitation of Mode • Sometimes mode may not be a true representative of a central value of a data set. For example: 2, 3, 4, 5, 6, 10, 10 Copyright © Mathematical Sciences Foundation
18
Comparison: Mean, Median and Mode Mean and Median of a data are unique, whereas a data can have more than one Mode. Exampl e: Consider the data set 1,1,1,2,2,2,3,4,5. The mean is 2.333, median is 2, but there are two modes namely 1 and 2.
Copyright © Mathematical Sciences Foundation
19
Comparison: Mean, Median and Mode Consider the following data: 100, 100, 100, 421, 422, 423,424, 425. Mean = 301.875 Median = 421.5 Mode = 100 In such data median is the best measure of central tendency among the three measures.
Copyright © Mathematical Sciences Foundation
20
Averages in Open Office Calc
Copyright © Mathematical Sciences Foundation
21
What is Open Office Calc? A powerful spreadsheet program that • performs numerical computations • can organize/summarize huge data sets • carries out advanced statistical and financial analysis by solving complicated mathematical models Copyright © Mathematical Sciences Foundation
22
How to access Open Office Calc
Applications
Office
Openoffice.org Spreadsheet
Copyright © Mathematical Sciences Foundation
23
A First Look at Open Office Calc Input Line
Name Box
Copyright © Mathematical Sciences Foundation
24
Points to note • Rows are numbered as 1,2,3,… • Columns are marked as A,B,C,… • Name box always displays the current selected cell Copyright © Mathematical Sciences Foundation
25
Open Office Calc as a desk calculator You can perform simple operations like addition, multiplication and division. Simply select a cell and in the Input Line enter the expression. Remember to begin the expression with an equal to (=) sign. To compute 5+7 you have Copyright © Mathematical Sciences to enter Foundation
26
Various Arithmetic Operations Operation Addition Subtraction Multiplication Division Raise to power
Symbol + * / ^
Copyright © Mathematical Sciences Foundation
27
Averages using Calc functions
Copyright © Mathematical Sciences Foundation
28
AVERAGE Calculates the arithmetic mean of numeric arguments Syntax: AVERAGE (number1,
number2,...) number1, number2, ... are numeric arguments for Example: Refer to the worksheet which you want the average
“Averages”
Copyright © Mathematical Sciences Foundation
29
Remarks : AVERAGE • The arguments must either be numbers, arrays or references that contain numbers. • If an array or reference argument contains text, or empty cells, those values are ignored; however, cells with the value zero are included. • Arguments that contain TRUE evaluate as 1; arguments that contain FALSE evaluate as 0 (zero). Copyright © Mathematical Sciences Foundation
30
AVERAGEA Calculates the arithmetic mean of the values in the list of arguments. In addition to numbers, text and logical values such as TRUE and FALSE are also included in the calculation.
Syntax: AVERAGEA (value1,
value2,...) value1, value2, ... are arguments for which you Refer want to the worksheet Example: the”average “AverageA Copyright © Mathematical Sciences Foundation
31
Remarks: AVERAGEA • The arguments must be numbers, arrays or references. • Array or reference arguments that contain text evaluate as 0 (zero). If the calculation does not include text values in the average, use the AVERAGE function. • Arguments that contain TRUE evaluate as 1; arguments that contain FALSE evaluate as 0 (zero). Copyright © Mathematical Sciences Foundation
32
TRIMMEAN Returns the mean of the interior of a data set. Syntax: TRIMMEAN(array, alpha) Array is the array or range of values to trim and average. Alpha is the fractional number of data points to exclude from the calculation. For example, if percent = 0.2, 4 points are trimmed from a Example: Refer the( worksheet data set of 20 to points 2 from the top and 2 from theCopyright of the set). “Trimmean ”bottom © Mathematical Sciences Foundation
33
Remarks: TRIMMEAN • If percent < 0 or percent > 1, TRIMMEAN returns an error value. • TRIMMEAN rounds the number of excluded data points down to the nearest multiple of 2. If percent = 0.1, 10 percent of 30 data points equals 3 points. For symmetry, TRIMMEAN excludes a single value from the top and bottom of the data set. Copyright © Mathematical Sciences Foundation
34
MEDIAN Returns the median of the given numbers.
Syntax: MEDIAN (number1,
number2,...) number1, number2, ... are numerical arguments for which you want the median
Example: Refer to the worksheet “Averages ” Copyright © Mathematical Sciences Foundation
35
MODE Returns the most frequently occurring or repetitive value in an array or range of data.
Syntax: MODE (number1,
number2,...) number1, number2, ... are arguments for which you want to calculate mode. Example: Refer to thethe worksheet
“Averages ” Copyright © Mathematical Sciences Foundation
36
Measures of Dispersion
Copyright © Mathematical Sciences Foundation
37
Let’s look at an example of three data sets, Observations Mean Data set 1
7
8
10
11
9
9
Data set 2
4
6
9
12
14
9
Data set 3
2
5
9
13
16
9
2
3
4
5
6
7
8
9
1 0
1 1
1 2
1 3
1 4
1 5
1 6
2
3
4
5
6
7
8
9
1 0
1 1
1 2
1 3
1 4
1 5
1 6
2
3
4
5
6
7
8
9
1 1 1 Copyright © Mathematical 0 Sciences 1 2
1 3
1 4
1 5
1 638
Foundation
To capture the sense of the data, we need to measure the central location as well as the spread. This is carried out by the various measures of dispersion. The numerical value of the various measures of dispersion describe the amount of spread, or variability, in the data: These measures will give large values for data which is more spread out and small values for data which is less spread out. Copyright © Mathematical Sciences Foundation
39
Characteristics for an Ideal Measure of Dispersion • It should be easy to calculate and easy to understand. • It should be affected as little as possible by fluctuations of sampling.
Copyright © Mathematical Sciences Foundation
40
Common Measures of Dispersion RANGE MEAN DEVIATION VARIANCE STANDARD DEVIATION
Copyright © Mathematical Sciences Foundation
41
R ange Range is the difference between the largest and the smallest value in the data. It can be determined by: Range = Highest value – Lowest value It gives a quick measurement of Copyright © Mathematical Sciences Foundation the spread.
42
Limitations of Range It does not measure the spread of the majority of data – it only measures the spread between highest and lowest values.
Copyright © Mathematical Sciences Foundation
43
600 500 400 300 200 100 0 0
5
10
15
600
Range in both these distributions is the same i.e. 300.
500 400 300 200 100 0 0
2
4
6
8
10
12
14
Copyright © Mathematical Sciences Foundation
44
Deviations from a Central One way to measure the spread of a data set is to Value
xi point measure the distance of each data from a central value, say A (which could xi be meanxior Amedian or mode). We define the deviation of from A to be . Note: The sum of the deviations about mean is zero and consequently the mean deviation about mean is also zero, which is not a useful statistic. One way to remove this neutralizing effect is to Copyright © Mathematical Sciences 45 ignore the Foundation
Mean Absolute Deviation Mean absolute deviation is mean of the absolute values of the deviations from mean of the data. N
i.e. Mean absolute deviation =
i 1
xi x , where x is N
mean of the data.
Copyright © Mathematical Sciences Foundation
46
Varian ce The mean
of the squares of deviation about mean is called the variance. N
i.e variance
xi x
2
i 1
N
wherex is the mean and N is the size of the population
Copyright © Mathematical Sciences Foundation
47
Standard Deviation The positive square root of the variance is called standard deviation. i.e. standard deviation
variance
or standard deviation
xi x
2
N
wherex is the mean and N is the size of the population Copyright © Mathematical Sciences Foundation
48
Measures of dispersion using Calc functions
Copyright © Mathematical Sciences Foundation
49
Rang e There is
no built in function to calculate range directly. We can calculate range by taking the difference of the maximum value and the minimum value of the data set. Following formula can be used to calculate range:
= MAX(value1,value2,…) MIN(value1, value2…)
-
Example: Refer to the worksheet “Dispersion ” Copyright © Mathematical Sciences Foundation
50
AVEDEV Returns the average of the absolute deviations of data points from their mean. Syntax: AVEDEV ( number1,
number2 , …. ) number1, number2, ... are 1 to 30 arguments for which you want the average of the absolute Example: Refer to the worksheet deviations ” “Dispersion Copyright © Mathematical Sciences Foundation
51
VARP Calculates variance based on the entire population. Syntax: VARP ( number1, number2,
……. ) number1, number2, ... are 1 to 30 number arguments corresponding to a population.
Example: Refer to the worksheet “Dispersion ” Copyright © Mathematical Sciences Foundation
52
VARPA Calculates variance based on the entire population. In addition to numbers, text and logical values such as TRUE and FALSE are included in the calculation.
Syntax: VARPA ( value1, value2, …….
) value1, value2, ... are 1 to 30 value arguments corresponding to a sample of a population Copyright © Mathematical Sciences Foundation
53
STDEVP Calculates standard deviation based on the entire population given as arguments.
Syntax: STDEVP (number1,
number2, ……. ) number1, number2, ... are 1 to 30 number arguments corresponding to a population. Example: Refer to the worksheet “Dispersion ” Copyright © Mathematical Sciences Foundation
54
MEASURES OF POSITION
Copyright © Mathematical Sciences Foundation
55
PERCENT ILE , x ,..., x Consider the data xset
. Percentiles are the numbers which divide the ordered data set in 100 equal sized data subsets. For any data set, there are 99 percentiles denoted P1 , P2 ,..., P99by . 1
2
n
P2 For instance, ,the second percentile, is a number such that at most 2% of the data points are less than it and at most 98% of the data points are greater than it. Copyright © Mathematical Sciences Foundation
56
How to find percentile of a data set? Supposex1 , x2 ,..., x101 is a data set arranged x1 xorder, in increasing i.e., 2 ... x 100 x101 . Here P1 x2 because at most 1% of the data points arex2less than and at most 99% of x2 are the data points more than . P20 x21 because at most 20% of the data points arex21less than and at most 80% of x21 the data points are more than . Copyright © Mathematical Sciences Foundation
57
How to find percentile of a data set? Supposex1 , x2 ,..., x10 is a data set arranged x1 xorder, x10 in increasing . 2 ... i.e., Here we do not have data points that can divide the data set into 100 equal parts. In such a situation, percentiles are calculated in the following way:
Copyright © Mathematical Sciences Foundation
58
x1 x2
x3
x4 x5
x6
x7 x8
x9 x10
Here we have 9 intervals. The complete data constitutes 100%. We distribute this 100% over 9 intervals so that each interval contains 100% 11.1% 9 11.1% 11.1% 11.1% 11.1% 11.1% 11.1% 11.1% 11.1%
x1 x2
x3
x4 x5
x6
x7 x8
Copyright © Mathematical Sciences Foundation
11.1%
x9 x10
59
Hence, x2 P11.1 , x3 P22.2 , x4 P33.3 ,...
P20 Suppose we want to find . x3 P22.2 As x2 P11.1 and lies between x2 and x3 P20 of To find the exact value following steps:
P20 , therefore, we follow the
Copyright © Mathematical Sciences Foundation
60
Step 1: Count the number of intervals between the data points. If there are n data points, then there will be n-1 intervals. In above example there are 10 – 1 = 9 intervals.
x1 x2
x3
x4 x5
x6
x7 x8
Copyright © Mathematical Sciences Foundation
x9 x10
61
Pm Step 2: To find
we calculate the number n 1 p m 100 p as sum of its integer part i and and write fractional part f.p i f In our example, we wish toP20 find . 10 1 p 20 Henc 100 e, 1.8 1 0.8 Thus,
i 1, f 0.8 Copyright © Mathematical Sciences Foundation
62
m th Step 3: The
Pm percentile
is given by
Pm xi 1 f xi 2 xi 1 Thus, in our example P20 x11 0.8 x1 2 x11 x2 0.8 x3 x2
Copyright © Mathematical Sciences Foundation
63
Example: Find 20th percentile of the data set 12, 13, 15, 18, 19, 20, 23, 24, 29 Step 1: There are 9 data points. Thus number of intervals = 9-1 = 8. P20 Step 2: To calculate we find the number 9 1 p 20 1.6 1 0.6 100 f 0.6 Thus,i 1 and . Copyright © Mathematical Sciences Foundation
64
Example: Find 20th percentile of the data set 12, 13, 15, 18, 19, 20, 23, 24,P29 Step 3: Thus is 20 given by P20 x11 0.6 x1 2 x11 x2 0.6 x3 x2 13 0.6 15 13 14.2 Copyright © Mathematical Sciences Foundation
65
Quartile P25 , Pi.e., Consider 25th, 50th and 75th percentiles 50 andpercentiles divide the ordered P75 . These setparts. into These percentiles are fourdata equal known as Quartiles. Q1 P25 is known as first quartile and is denoted by . P50 is known as second quartile and is Q2 denoted by . It is also equal to the median. Q3 P75 is known as third quartile and is denoted by . Copyright © Mathematical Sciences 66 Foundation
Percentiles using Open Office Calc functions
Copyright © Mathematical Sciences Foundation
67
PERCENTILE Returns the kth percentile of values in a range.
yntax:
PERCENTILE ( data, alpha )
data is the range of data alpha is the percentile value in the range 0…1, incl Note: For 1st percentile, alpha = 0.01, for 15th percentile, alpha = 0.15 and so on. Example: Refer to the worksheet “Percentile ” Copyright © Mathematical Sciences Foundation
68
QUARTILE Returns the quartile of a data set.
yntax:
QUARTILE (array, quart)
array is the array or cell range of numeric values for which the quartile value is to be calculated.
quart indicates the quartile to be calculated. Note: For 1st quartile, quart = 1; for 2nd quartile, quart = 2 and for 3rd quartile, quart = 3. Example: Refer to the worksheet “Percentile ” © Mathematical Sciences Copyright Foundation
69
Histogram A histogram is a graphical display based on the frequency table.
Copyright © Mathematical Sciences Foundation
70
FREQUENCY function in OPEN OFFICE CALC Class Interval
Classes
<=7000
7000
7000-7500
7500
7500-8000
8000
8000-8500
8500
8500-9000
9000
9000-9500
9500
9500-10000
10000
10000-10500
10500
10500-11000
11000
11000-11500
11500
11500-12000
12000
12000-12500
12500
Frequency function can be used to construct frequency table.
Copyright © Mathematical Sciences Foundation
71
FREQUENCY function (cont..)
1
2
3. Select the data range and class range
4 Copyright © Mathematical Sciences Foundation
72
FREQUENCY function (cont..) Class Interval <=7000
7000
Freque ncy 2
7000-7500
7500
1
7500-8000
8000
3
8000-8500
8500
0
8500-9000
9000
12
9000-9500
9500
15
9500-10000 1000010500 10500-
10000
23
10500
20
11000
15
11500
4
12000
3
12500
2
11000 1100011500 1150012000 1200012500
Classes
Bar graph of frequency table 25
20
15
10
5
0 7000 7500 8000 8500 9000 9500 10000 10500 11000 11500 12000 12500
Copyright © Mathematical Sciences Foundation
73