UNDERSTANDING DATA Dr. Rohit Vishal Kumar Reader, Department of Marketing Xavier Institute of Social Service PO Box No 7, Purulia Road Ranchi – 834001, Jharkhand, India Email:
[email protected]
(C) Rohit Vishal Kumar
Presented at WBUT 25-May-09
1
What is Data? • Observations of a set of variables • Lowest level of abstraction from which information is derived • Each Discipline has evolved it’s own method of classification of data • Two Broad Classification of Data Based on Source – Primary Data: • Data Collected from Primary Source
– Secondary Data: • Data Collected From Secondary Source
(C) Rohit Vishal Kumar
Presented at WBUT 25-May-09
2
Classification :: Statistics • Categorical Data – The Objects are grouped into categories based on some Qualitative Trait – The resultant data are merely labels or categories – Example: • Hair Color: Brown / Black / Red • Smoking Status: Favor / Neutral / Against
• Measurement Data – The Objects are “measured” on some Quantitative Trait – The resultant data is a set of numbers – Example: • Age of the Students • JEMAT Score • Number of Students Not Attending Class
(C) Rohit Vishal Kumar
Presented at WBUT 25-May-09
3
Categorical Data • Nominal Data – A type of categorical data in which numbers act as a label without having any specific meaning – Example: • Male : • Female:
1 2
• Ordinal Data – A type of categorical data in which numbers act as an guide to the level of importance of the object – Example: • Mild • Moderate • Severe
(C) Rohit Vishal Kumar
Presented at WBUT 25-May-09
4
Measurement Data • Discrete Data – – – –
Only Certain Values are Possible There are gaps between the possible value Are generated through the process of Counting Example: • Number of students in the class • Number of Employees Absent from Work
• Continuous Data – Any value within an interval is possible with a suitable measuring device – Theoretically, the number can be accurate to any desired number of decimal places – Are generated through the process of Measurement – Example: • Height in cm • Time to complete the assignment
(C) Rohit Vishal Kumar
Presented at WBUT 25-May-09
5
Classification :: Scaling Theory • Nominal Data
ORDER
DISTANCE
ORIGIN
– A type of categorical data in which numbers act as a label without having any specific meaning – Example: • Male : • Female:
1 2
• Ordinal Data – A type of categorical data in which numbers act as an guide to the level of importance of the object – Example: • Mild • Moderate • Severe
ORDER (C) Rohit Vishal Kumar
Presented at WBUT 25-May-09
DISTANCE
ORIGIN 6
Classification :: Scaling Theory • Interval Data
ORDER
DISTANCE
ORIGIN
– Quantitative Data but does not has any real zero point – Allows comparison within the scale but cannot compare outside the scale – Used in Social Research, but most researcher not clear about Interval scale – Example: • Definitely Will Buy / Probably Will Buy / May or May not Buy / Probably Will not Buy / Definitely Will not Buy
• Ratio Data – Quantitative Data but has real zero point – Allows conversion and preservation on the magnitude in another scale – Example: • Distance in Kms
ORDER (C) Rohit Vishal Kumar
Presented at WBUT 25-May-09
DISTANCE
ORIGIN 7
Why understand Data? • The type of Analysis depends on the Type of data you have collected • General Guideline is a follows: – Nominal Data
Mode, Chi-Square
– Ordinal Data
+ Median / Percentiles
– Interval Data
+ Mean / SD / Correlation / Regression /
ANOVA – Ratio Scale
+ Geometric Mean / Harmonic Mean /
Coefficient of Variation / Logarithms (C) Rohit Vishal Kumar
Presented at WBUT 25-May-09
8
Some Points to Remember • • • •
Tend to use Interval Scales Data need not be comparable with other studies Data has to make sense in your context Students fail to understand the importance of Data – Wrong Approach • “Data Collect Kore Niyechi… Ebar Ki Kori”
– Right Approach • “Amar Ki Data Dorkar? Kano Daokar? Kothay Pabo? Kibhabe Analyse Kore Uttor Pabo”
(C) Rohit Vishal Kumar
Presented at WBUT 25-May-09
9
Descriptive Statistics :: A Quick Review
(C) Rohit Vishal Kumar
Presented at WBUT 25-May-09
10
Measures of Central Tendency • Central tendency is “loosely” defined as the concept of location of the center of a distribution of data • Three basic measures – Arithmetic Mean – Median – Mode
(C) Rohit Vishal Kumar
Presented at WBUT 25-May-09
11
Arithmetic Mean • Advantages: – – – – –
Easy to Compute Affected by every value in the set of observations Defined by rigid mathematical formulation It is relatively reliable It represents the “center of gravity” of the data
• Disadvantages: – Unduly affected by small and / or large values – Cannot be calculated for data with open ended class – Is a good measure only when the distribution is fairly symmetric
(C) Rohit Vishal Kumar
Presented at WBUT 25-May-09
12
Median • Advantages – – – – – –
Refers to the “Middle Value” of the distribution It is a “positional measure” Useful in case of open ended class Not seriously affected by Extreme Values Most appropriate for dealing with Qualitative Rank Data Has a series of related positional measures like Quartiles, Deciles, Percentiles
• Disadvantages: – It does not take every value into consideration – It is not capable of algebraic treatment – It is erratic if the number of items are smalle
(C) Rohit Vishal Kumar
Presented at WBUT 25-May-09
13
Mode • Advantages: – It is the most typical or representative value of a distribution – Not unduly affected by extreme values – It can be used to describe qualitative phenomenon
• Disadvantages: – Mode may not be there in a distribution or may be present more than once in a distribution – Not capable of algebraic treatment – It is not rigidly defined for calculation
(C) Rohit Vishal Kumar
Presented at WBUT 25-May-09
14
Relation Between the 3 Measures • In moderately skewed distribution: Mode = 3 Median – 2 Mean
(C) Rohit Vishal Kumar
Presented at WBUT 25-May-09
15
Measures of Dispersion • Dispersion is defined as the degree to which data tends to spread about a central value • Four Absolute & Relative Measures – – – –
Range Quartile Deviation Mean Absolute Deviation Standard Deviation
Coefficient of Range Coefficient of Quartile Deviation Coefficient of MAD Coefficient of Variation
• Range and QD are positional measures of dispersion • AD and SD are calculation measures of dispersion
(C) Rohit Vishal Kumar
Presented at WBUT 25-May-09
16
Range • Range • Coefficient of Range:
• Advantages – Simplest to understand and compute
• Disadvantages: – Not based on each and every item in the data – Does not take into account the shape of distribution – Cannot be computed in case of open ended classes
(C) Rohit Vishal Kumar
Presented at WBUT 25-May-09
17
Quartile Deviation • Inter Quartile Range (IQR)
• Quartile Deviation (Semi IQR)
• Coefficient of QD
(C) Rohit Vishal Kumar
Presented at WBUT 25-May-09
18
Quartile Deviation • Advantages: – Can measure variation in open ended distributions – It is extremely useful in case of erratic or badly skewed data – It is not affected by extreme values
• Disadvantages: – Ignores 50% of the data – Is not capable of mathematical manipulation – Is not considered as a measure of dispersion: • Effectively shows the distance between two positional points
(C) Rohit Vishal Kumar
Presented at WBUT 25-May-09
19
Mean Absolute Deviation • Mean Absolute Deviation (MAD) defined as:
• Coefficient of MAD defined as: = MAD / Median or MAD / Mean • Advantages: – Simple to understand and compute – Based on each and every item in the data – Less affected by extreme values than other measured
• Disadvantage: – It is not capable of mathematical treatment (C) Rohit Vishal Kumar
Presented at WBUT 25-May-09
20
Standard Deviation • Defined as “Root Mean Squared Deviation from Mean”
• Coefficient of Variation
(C) Rohit Vishal Kumar
Presented at WBUT 25-May-09
21
Standard Deviation • Advantages: – Best Measure of Dispersion – Possible to calculate the combined standard deviation of two or more groups – Chebycheff’s Theorem (1821-1894) • What so ever be the distribution at least 75% of the values will fall within +/- 2 sd from the mean of the distribution and at least 89% will fall within +/- 3 sd from the mean of the distribution
– Has relation with other measures: • QD = 0.667 SD • MD = 0.80 SD
(C) Rohit Vishal Kumar
Presented at WBUT 25-May-09
22
Skewness • Refers to the asymmetry in the shape of the distribution
• Important to test skewness in data analysis as skewed data suggest that the assumption of normality is violated
(C) Rohit Vishal Kumar
Presented at WBUT 25-May-09
23
Kurtosis • Kurtosis means “Bulginess” • Refers to the degree of flatness or peaked-ness in the region about the mode of the distribution: – Lepto-Kurtic : If the curve is more peaked than Normal Curve – Meso-Kurtic : If the curve is the same as the Normal Curve – Platy-Kurtic : If the curve is less peaked than Normal Curve
• The peakedness of Normal Curve is taken as 3 • Presence of Kurtosis does not violate normality • Important to check Kurtosis because it shows the distribution of data around the mode
(C) Rohit Vishal Kumar
Presented at WBUT 25-May-09
24
What is Descriptive Statistics? • The following Needs to Be Reported: – – – – – – – – – – – –
Arithmetic Mean Median Mode Standard Deviation Variance Kurtosis Skewness Range Minimum Maximum Sum Count
(C) Rohit Vishal Kumar
Presented at WBUT 25-May-09
25