UNIT III LESSON 16: DATA CODING AND ANALYSIS
Once the information is tabulated, it is easy to perform various statistical tests for their validity, accuracy and significance. This step seems very simple, although it is not so. Gathered information should be presented in such a manner that even a layman understands what, why, when and how of information.
Data Entry It is the process of taking completed questionnaires\surveys and putting them into a form that can readily be analyzed. A series of options need to consider when you enter the information you have gathered.You will first have to decide on a file format and then devise a code for analysis.
•
How missing data are treated
You should have knowledge about : i. Non-ascertained Information has to be recognized: information not obtained because of interviewer or respondent performance. • Reason for failure to ask question • Failure to obtain appropriate response • Refusal to answer question (separate)
ii. Inapplicable Information: information does not apply to a particular respondent iii.Unknown information: information as to respondent’s claim of awareness (How to treat “Don’t know” option) c. Entry of Data • You should fix up the number of translation steps
between subject’s response and readable data file
Decision on File Format It comprises of decisions regarding:
• Computer assisted techniques: 1
• The way the data will be organized in a file
• Digital answer format (Scantron):
• Order of information collected
• Entry by hand: 4
• How subject is referenced
• Impacts ability to check quality of data entry (accuracy,
reliability)
• Constructing individual records • History of 80-column format
d. Clean Data File
• Application to statistics programs
• You should examine each data file to ensure each record is
Devise Code for Analysis The main points you want to remember while devising the code for analysis are: • Set of rules that translates answers into discrete values • Alphabetical or Numerical depending on measurement scale • Preserve level of measurement for each item • General Considerations (closed questions):
a. Now, we will discuss these in detail for the better understanding. • First of all you should try to make coding translation simple
complete and in order • You should remove non-legal codes • Then you should replace it with information from original response format • Proper importance should be given to verification The problem most decision makers must resolve is how to deal with the uncertainty that is inherent in almost all aspects of their jobs. Raw data provide little, if any, information to the decision makers. Thus, they need a means of converting the raw data into useful information. In this lecture note, we will concentrate on some of the frequently used methods of presenting and organizing data.
•
Coding should be done minimizing effort and risk of coding errors
•
Remember the Item-level: Leave #s as #s (#s can be nominal).
Frequency Distribution The easiest method of organizing data is a frequency distribution, which converts raw data into a meaningful pattern for statistical analysis.
•
Perform Reverse coding/Unfolding complex response formats.
The following are the steps of constructing a frequency distribution:
•
For Test-level: you code questions in order of appearance.
•
You have to be consistent in assigning values with similar responses
1. Specify the number of class intervals. A class is a group (category) of interest. No totally accepted rule tells us how many intervals are to be used. Between 5 and 15 class intervals are generally recommended. Note that the classes must be both mutually exclusive and all-inclusive. Mutually exclusive means that classes must be selected such that an item can’t fall into two classes,
You should identify the question groups within test. B. It should help in facilitating data interpretation •
11.556
© Copy Right: Rai University
97
RESEARCH METHODOLOGY
Students, today we shall be doing the most crucial step in research process- Data coding and Data Analysis. This stage of data entry and coding comes after the collection of desired information is the coding and analysis of data.
RESEARCH METHODOLOGY
and all-inclusive classes are classes that together contain all the data. 2. When all intervals are to be the same width, the following rule may be used to find the required class interval width: W = (L - S) / K where: W= class width, L= the largest data, S= the smallest data, K= number of classes Example Suppose the age of a sample of 10 students are: 20.9, 18.1, 18.5, 21.3, 19.4, 25.3, 22.0, 23.1, 23.9, and 22.5 We select K=4 and W=(25.3 - 18.1)/4 = 1.8 which is rounded-up to 2. The frequency table is as follows: Class Interval...............Class Frequency............Relative F r e q u e n c y 18-U-20................................3..................................30% 20-U-22................................2..................................20% 22-U-24................................4..................................40% 24-U-26................................1..................................10% Note that the sum of all the relative frequency must always be equal to 1.00 or 100%. In the above example, we see that 40% of all students are younger than 24 years old, but older than 22 years old. Relative frequency may be determined for both quantitative and qualitative data and is a convenient basis for the comparison of similar groups of different size. What Frequency Distribution Tells Us
1. It shows how the observations cluster around a central value; and 2. It shows the degree of difference between observations. For example, in the above problem we know that no student is younger than 18 and the age below 24 is most typical. The most common age is between 22 an 24, which from general information we know to be higher than usual for the students who enter college right after high school and graduate about age 22. The students in the sample are generally older. It is possible that the population is made up of night students who work on their degrees on a part-time basis while holding full-time jobs. This descriptive analysis provides us with an image of the student sample, which is not available from raw data. As we will see in lecture number 3, frequency distribution is the basis for probability theory. Stated & True Class Limits
True Classes are those classes such that the upper true (or real) limit of a class is the same as the lower true limit of the next class. For comparison, the stated class limits and true (real) class limits are given in the following table: Stated Limit................True Limits $600 - $799.................$599.50 up to but not including $799.50 $800 - $999.................$799.50 up to but not including $999.50 In the first column of the above table the data were rounded to the nearest dollar. For example, $799.50 was rounded up to $800 and tallied in the second class. Any amount over $799 but under $799.50 was rounded down to $799 and included in the first class. Thus, the $600 - $799 class actually includes all data from $599.50 inclusive up to but not including $799.50.
98
Cumulative Frequency Distribution
When the observations are numerical, cumulative frequency is used. It shows the total number of observations which lie above or below certain key values. Cumulative Frequency for a population = frequency of each class interval + frequencies of preceding intervals. For example, the cumulative frequency for the above problem is: 3, 5, 9, and 10. Presenting Data
Graphs, curves, and charts are used to present data. Bar charts are used to graph the qualitative data. The bars do not touch, indicating that the attributes are qualitative categories, variables are discrete and not continuous. Histograms are used to graph absolute, relative, and cumulative frequencies. Ogive is also used to graph cumulative frequency. An ogive is constructed by placing a point corresponding to the upper end of each class at a height equal to the cumulative frequency of the class. These points then are connected. An ogive also shows the relative cumulative frequency distribution on the right side axis. A less-than ogive shows how many items in the distribution have a value less than the upper limit of each class. A more-than ogive shows how many items in the distribution have a value greater than or equal to the lower limit of each class. A less-than cumulative frequency polygon is constructed by using the upper true limits and the cumulative frequencies. A more-than cumulative frequency polygon is constracted by using the lower true limits and the cumulative frequencies. Pie chart is often used in newspapers and magazines to depict budgets and other economic information. A complete circle (the pie) represents the total number of measurements. The size of a slice is proportional to the relative frequency of a particular category. For example, since a complete circle is equal to 360 degrees, if the relative frequency for a category is 0.40, the slice assigned to that category is 40% of 360 or (0.40)(360)= 144 degrees. Pareto chart is a special case of bar chart and often used in quality control. The purpose of this chart is to show the key causes of unacceptable quality. Each bar in the chart shows the degree of quality problem for each variable measured. Time series graph is a graph in which the X axis shows time periods and the Y axis shows the values related to these time periods. Stem-and-leaf plots offer another method for organizing raw data into groups. These types of plots are similar to the histogram except that the actual data are displayed instead of bars. The stemand-leaf is developed by first determining the stem and then adding the leaves. The stem contains the higher-valued digits and the leaf contains the lower-valued digits. For example, the number 78 can be represented by a stem of 7 and a leaf of 8. Thus, the numbers 34, 32, 36, 20, 20, 22, 54, 55, 52, 68, and 63 can be grouped as follows: Stem...............Leaf 2....................0..0..2 3....................2..4..6 4 5....................2..4..5 6....................3..8 Steps to Construct a Stem and Leaf Plot 1. Define the stem and leaf that you will use. Choose the units for the stem so that the number of stems in the display is between 5 and 20.
© Copy Right: Rai University
11.556
RESEARCH METHODOLOGY
2. Write the stems in a column arranged with the smallest stem at the top and the largest stem at the bottom. Include all stems in the range of the data, even if there are some stems with no corresponding leaves. 3. If the leaves consist of more than one digit, drop the digits after the first. You may round the numbers to be more precise, but this is not necessary for the graphical description to be useful. 4. Record the leaf for each measurement in the row corresponding to its stem. Omit the decimals, and include a key that defines the units of the leaf. See the following figures:
References: Aaker D A , Kumar V & Day G S - Marketing Research (John Wiley &Sons Inc, 6th ed.) Donald R. Cooper – Business Research Methods, Tata McGraw – Hill Publication Notes
11.556
© Copy Right: Rai University
99