HS550: Statistical Methods (3-0-2-4) (Feb-June Semester, 2019)
Course instructor: Shyamasree Dasgupta Office: A1-303; Phone: 1905267122 Drop me an email anytime at
[email protected] Indian Institute of Technology Mandi
2/28/2019
1
Module 1: Representation of Data and Descriptive Statistics [Week 1-3 (7 lectures)]*
How to represent the data that you have – table and charts? Who is the “one” representative value of the dataset? What is the average deviation from the representative value? If you have data on two variables, how to check their relationship? * No classes on 20th and 22nd Feb 2/28/2019
2
Books (Also consult Basic Statistics by Nagar and Das)
2/28/2019
3
Let’s appreciate the need for “quantity” as well as “quality”
2/28/2019
4
Comparison between these two events are possible only when both quantitative and qualitative information are available
• British Indian Army killed a huge number of civilians in Jallianwala Bagh
• Thuggees killed a huge number civilians in various parts of India
• Around 400 people were killed
• Around 2 million individuals were killed
• Open fired on a group of unarmed, nonviolent protesters and pilgrims
• It was a part of criminal activities of the Thuggees and was related to robbery
• It happened in the year 1919
• It happened over a period of 600 years – (1290 -1870)
2/28/2019
5
• The distinction between the quantitative and qualitative information, as they are often articulated, is misleading. • Both are equally important to know whether an event/ a finding is typical or atypical.
2/28/2019
6
Tracing the history of data representation Mortality Table of John Graunt (1661)
Overcoming the problem of “can’t see the forest for the tree” 2/28/2019
7
Scottish imports/exports by W. Playfair (1786)
2/28/2019
8
1859: Florence Nightingale’s polar area diagram 2/28/2019
9
Not the numbers but the arrangement of the numbers tells the story! Click on the link!
2/28/2019
10
Types of Data Nominal variable (or Attribute based variable): Pass/Fail? Category
Cardinal variable: height? 5.5ft A number Ordinal variable: 1st/2nd/3rd/...../last but one/last? Rank
Observe the Data_1 carefully and identify the variables as Nominal, Ordinal or Cardinal. Also observe that all the cardinal variables can be converted to ordinal variables
2/28/2019
11
What are the heights of the students in IIT Mandi? Height of Roll no. .....is 5.2 ft Height of Roll no. ..... Is 4.9 ft and so on....
When data is in a raw format, the first task is to arrange them in a meaningful manner. You may lose some of the details while doing it, but that’s fine! 2/28/2019
12
Arrangement makes life easy! • Height (in ft) of 30 students: 5.2, 5.9, 4,9, 5.6, 6.1, 4.9, 5.5, 5.8, 5.7,6.0,5.0, 6.2, 5.7, 4.8, 5.8, 5.6, 5.7, 6.0, 4.8, 5.7,5.9, 5.4, 5.2, 4.8, 5.4, 5.2, 5.2, 5.4, 5.7, 5.7
2/28/2019
13
Tabular representation of data
2/28/2019
14
A table is prepared to represent the summary of the data. The table that you want to create out of any raw data should depend on your research objective. Same data can be tabulated in various ways to answer the particular research question that you are trying to address. Further, ask yourself the following questions before you proceed to create any table. Tables based on Cardinal data?
Tables based on Nominal Data?
Here you are the one to construct class intervals, which will act as categories
Here you know your categories
A table for
A table for more than one variables?
one variable? This is rather simple! 2/28/2019
Think carefully how would you like to create subgroups of a variable! 15
Representing one variable in a table Table 1: Distribution of households according to ownership of agricultural land (Based on Data_1) Classes (in Classes (in hectare) acre)
Landless Marginal Small Semi medium Medium Large Total
0 <1 1-2 2-4 4-10 >10
0 <2.5 2.5-5 5-10 10-30 >30
No. of Midpoint households % of (xi) (fi) households
0 1.25 3.75 7.5 20 45
11 2 2 0 2 7
46% 8% 8% 0% 8% 29%
24
100%
Observe that there is a logic behind defining the classes in such a manner. [Note: In India, following categories of landholdings are generally used: Marginal: <1 ha; Small: 1.01–2 ha; Semi-medium: 2–4 ha; Medium: 4–10 ha; Large: >10 ha. However, to use these categories as your classes, you have to convert the landholding from acre to hectare (ha) and 1 ha =2.5 acre (approx)]
2/28/2019
17
Representing 2 variables in one table Table 2: Distribution of households according to their castes (mentioned as 'category') in various villages (Based on Data_1)
Caste Village
SC
ST
OBC
Gen
Total
Paschim Malipur
0
0
0
9
9
Sherpara
0
0
1
6
7
Madanpur
0
0
0
3
3
Gopalpur
3
0
0
1
4
Purushia
0
1
0
1
2
Total
3
1
1
20
25
2/28/2019
19
Table 3: Distribution of households according to caste, religion and monthly expenditure
Stub
Title
Caption
Caste
SC
ST
OBC
Gen
Total
Religion Expenditure in Rs.
H
M
T
H
M
T
H
M
T
H
M
T
H
M
T
<5000
2
0
2
0
0
0
0
1
1
5
5
10
7
6
13
5000-10000
1
0
1
0
0
0
0
0
0
4
2
6
5
2
7
10000-15000
0
0
0
1
0
1
0
0
0
1
1
2
2
1
3
>15000
0
0
0
0
0
0
0
0
0
0
1
1
0
1
1
Total
3
0
3
1
0
1
0
1
1
10
9
19
14
10
24
Observe that there is column that displays the total number of households under each category. H: Hindu, M: Muslim, T: Total. Since one data point is missing under the variable monthly expenditure, the total count will remain 24. Body of the Table 2/28/2019
22
Frequency Distribution
Table 1: Distribution of households according to ownership of agricultural land
Landless Marginal Small Semi-med Medium Large Total 2/28/2019
Classes (in acre) 0 0-2.5 2.5-5 5-10 10-30 30-60
Midpoint (xi) 0 1.25 3.75 7.5 20 45
No. of households (fi) 11 2 2 0 2 7 N=24
Cumulative Frequency (Fi)
Relative Frequency (fi/N) < type > type 11 24 0.46 0.08 13 13 0.08 15 11 0.00 15 9 0.08 17 9 0.30 24 7
Freq. Density (fi/class length) 0.8 0.8 0 0.1 0.23
1 24
Diagrammatic representation of data
2/28/2019
25
Figure 1: Distribution of households according to ownership of agricultural land
12
Bar/column diagram
11
Bar diagram
No. of households
10 8
7
6
Pie chart
4 2
2
2
2 0
0 Landless Marginal Small
Semi Medium medium
Large
Large 29% Landless 46% Medium 8%
2/28/2019
Small 8% Marginal 9%
26
Table 2: Distribution of households according to their castes (mentioned as 'category') in various villages
Caste Village
SC
ST
OBC
Gen
Total
Paschim Malipur
0
0
0
9
9
Sherpara
0
0
1
6
7
Madanpur
0
0
0
3
3
Gopalpur
3
0
0
1
4
Purushia
0
1
0
1
2
Total
3
1
1
20
25
2/28/2019
27
Figure 2: Distribution of households according to their castes in various villages No of households
10
Bar diagram
SC
8
ST
6
OBC
4
Gen
2 0 Paschim Malipur
Sherpara
Madanpur
Gopalpur
Purushia
Stacked bar diagrams
100%
10
90% 80%
8
70% 60%
6
50% 40%
4
30% 20%
2
10% 0% Paschim Malipur
2/28/2019
Sherpara Madanpur Gopalpur
Purushia
0 Paschim Sherpara Madanpur Gopalpur Purushia Malipur
28
Table 3: Distribution of households according to caste, religion and monthly expenditure Caste Religion
SC
ST
OBC
Gen
Total
H
M
T
H
M
T
H
M
T
H
M
T
H
M
T
<5000
2
0
2
0
0
0
0
1
1
5
5
10
7
6
13
5000-10000
1
0
1
0
0
0
0
0
0
4
2
6
5
2
7
10000-15000
0
0
0
1
0
1
0
0
0
1
1
2
2
1
3
>15000
0
0
0
0
0
0
0
0
0
0
1
1
0
1
1
Total
3
0
3
1
0
1
0
1
1
10
9
19
14
10
24
Expenditure
Observe that there is column that displays the total number of households under each category. H: Hindu, M: Muslim, T: Total. Since one data point is missing under the variable monthly expenditure, the total count will remain 24. 2/28/2019
29
Figure 3. Distribution of households according to caste, religion and monthly expenditure 12
No. of households
10 8 >15000
6
10000-15000 5000-10000
4
<5000
2 0 H
M SC
2/28/2019
T
H
M ST
T
H
M OBC
T
H
M Gen
30
• Histogram is another way of representation – isto-s – ‘mast’/ something set upright – gram-ma – something written/graphics – Histogram
Not really!! This is a column diagram. Also remember, the term `histogram' was coined by the statistician Karl Pearsonwhile talking about the geometry of statistics (1892). 2/28/2019
32
• Have a careful look at the monthly expenditure data • Identify the highest value and the lowest value • The highest is 18000 and the lowest is 2000 (in INR)
• So, consider the range 2000 INR – 18000 INR • Bin the range into a series of intervals (continuous but disjoint) and identify frequency corresponding to each range • Bins may contain less that the lowest value and more than the highest value
2/28/2019
33
Frequency Table Class Interval (expenditure in INR)
Midpoint (xi)
Frequency (fi)
Freq. Density
0-5000
2500
13
0.0026
5000-10000
7500
7
0.0014
10000-15000
12500
3
0.0006
15000-20000
17500
1
0.0002
24
0.0048
Total
2/28/2019
34
Proportion of households
0.0050
0.0040
0.0030
0.0020
Frequency curve 0.0010
0.0000 2500
7500
12500
17500
Monthly expenditure in INR 1. 2. 3. 4. 2/28/2019
Widths are proportional to classes Heights are proportional to frequency density The area of each bar represents the frequency Notice, histogram is appropriate even when the class intervals are unequal 35
Central tendency and dispersion
2/28/2019
36
Type of data
Measure of central tendency
Measure of dispersion
Cardinal
Mean
Standard deviation
Ordinal
Median
Quartile deviation
Nominal/ Attributes
Mode
Range
2/28/2019
37
Correlation: Chalk and Talk!
2/28/2019
38
Food for thought… 1. The household that spends INR 18000 has 14 family members, whereas the household that spends INR 7000 has only 3 family members!! How do you take this information into account? 2. A histogram can be drawn with unequal class intervals. Why? Try creating a Histogram based on the data on landholding 3. How do you calculate the appropriate central value and dispersion for the variables in the data set?
2/28/2019
39
Questions?
Cartoon curtsey: The Cartoon Guide to Statistics By Larry Gonick and Woollcott Smith 2/28/2019
40