Imputation Procedures for Partial Nonresponse: The Case of Family Income and Expenditure Survey (FIES)
A Thesis Presented to The Faculty of the Mathematics Department College of Science De La Salle University - Manila
In Partial Fulfillment of the Requirements for the Degree Bachelor of Science in Statistics Major in Actuarial Science
by Diana Camille B. Cortes James Edison T. Pangan
August 2007
Approval Sheet The thesis entitled Imputation Procedures for Partial Nonresponse: The Case of FIES Submitted by Diana Camille Cortes and James Edison Pangan, upon the recommendation of their adviser, has been accepted and approved in partial fulfillment of the requirements for the degree of Bachelor of Science in Statistics Major in Actuarial Science.
Arturo Pacificador Jr., Ph.D. Thesis Adviser
Acknowledgments The researchers would like to extend their warmest gratitude to the following people, who have undoubtedly contributed to the success of this study: • To Dr. Jun Pacificador Jr., for his supervision, suggestions and guidance during the duration of this thesis. • To Dr. Ederlina Nocon, for sharing her time during THSMTH1 in teaching us how to use LaTeX. • To our parents especially Jed’s mother, Mrs. Erlinda Pangan, for constantly reminding Jed (i.e. ”Tapos na ba ang thesis nyo?”) about our thesis. • To Mark Nanquil and Norman Rodrigo, for helping us in using LaTeX and for their unwavering support to our thesis • To our friends from COSCA, La Salle Debate Society and Math Circle for their continuous encouragement and support. • Lastly, to The Lord Almighty, for providing us the strength, patience, wisdom and determination to finish this thesis.
Table of Contents
Title Page Approval Sheet
i ii
Acknowledgments
iii
Table of Contents
iv
Abstract 1 The Problem and Its Background
vii 1
1.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2
Statement of the Problem . . . . . . . . . . . . . . . . . . . . . .
3
1.3
Objectives of the Study . . . . . . . . . . . . . . . . . . . . . . . .
4
1.4
Significance of the Study . . . . . . . . . . . . . . . . . . . . . . .
4
1.5
Scope and Limitations . . . . . . . . . . . . . . . . . . . . . . . .
6
2 Review of Related Literature 3 Conceptual Framework
8 17
3.1
Nonresponse Bias . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
3.2
Nonresponse and Its Patterns . . . . . . . . . . . . . . . . . . . .
19
v 3.3
Types of Nonresponse . . . . . . . . . . . . . . . . . . . . . . . .
21
3.4
The Imputation Procedures . . . . . . . . . . . . . . . . . . . . .
23
3.4.1
Overall Mean Imputation (OMI) . . . . . . . . . . . . . .
25
3.4.2
Hot Deck Imputation (HDI) . . . . . . . . . . . . . . . . .
27
3.4.3
General Regression Imputation . . . . . . . . . . . . . . .
31
Stochastic Regression . . . . . . . . . . . . . . . . . . . . .
32
4 Methodology 4.1
34
Source of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
4.1.1
General Background . . . . . . . . . . . . . . . . . . . . .
34
4.1.2
Sampling Design and Coverage . . . . . . . . . . . . . . .
35
4.1.3
Survey Characteristics . . . . . . . . . . . . . . . . . . . .
35
4.1.4
Survey Nonresponse . . . . . . . . . . . . . . . . . . . . .
36
4.2
The Simulation Method . . . . . . . . . . . . . . . . . . . . . . .
37
4.3
Formation of Imputation Classes . . . . . . . . . . . . . . . . . .
39
4.4
Performing the Imputation Techniques . . . . . . . . . . . . . . .
40
4.4.1
Overall Mean Imputation (OMI) . . . . . . . . . . . . . .
40
4.4.2
Hot Deck Imputation (HDI) . . . . . . . . . . . . . . . . .
42
4.4.3
Deterministic and Stochastic Regression Imputation (DRI)
4.5
and (SRI) . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
Comparison of Imputation Techniques . . . . . . . . . . . . . . .
45
4.5.1
The Bias and Variance of the Estimates . . . . . . . . . .
45
4.5.2
Comparing the Distributions of the Imputed vs. the Actual Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
vi 4.5.3
4.5.4
Using Kalton’s Criteria in Assessing the Performance of the Imputation Techniques . . . . . . . . . . . . . . . . . . . .
47
Determining the Best Imputation Method . . . . . . . . .
49
5 Results and Discussion
50
5.1
Descriptive Statistics of Second Visit Data Variables . . . . . . . .
50
5.2
Formation of Imputation Classes . . . . . . . . . . . . . . . . . .
51
5.2.1
5.3
Mean of the Simulated Data by Nonresponse Rate for Each Variables of Interest . . . . . . . . . . . . . . . . . . . . .
56
5.2.2
Regression Model Adequacy . . . . . . . . . . . . . . . . .
57
5.2.3
Evaluation of the Different Imputation Methods . . . . . .
59
Overall Mean Imputation . . . . . . . . . . . . . . . . . .
60
5.2.4
Hot Deck Imputation . . . . . . . . . . . . . . . . . . . . .
62
5.2.5
Deterministic Regression Imputation . . . . . . . . . . . .
65
5.2.6
Stochastic Regression Imputation . . . . . . . . . . . . . .
67
Choosing the Best Imputation . . . . . . . . . . . . . . . . . . . .
74
6 Conclusion
79
7 Recommendations for Further Research
83
Abstract
Several imputation methods have been developed for imputing missing responses. It is often not clear which imputation method is ”best” for a particular assumption. In choosing an imputation method, one should consider several factors, including the types of estimates that will generated, the partial nonresponse rates, the nature of nonresponse, and the availability of the auxiliary data that are highly correlated with characteristic of interest or with the response propensity.
This study compared the effectiveness of four imputation procedures namely the overall mean imputation, Hot Deck imputation, deterministic and stochastic regression imputation using the first visit variable to be its auxiliary variable. A total of 4,130 cases were simulated in the study. Values for both variables were set to nonresponse to satisfy the assumption of partial nonresponse. The results of the study provide some support for the following conclusions: (a) for this data, the Hot Deck imputation and overall mean imputation method are not appropriate for handling nonresponse data; (b) when predicting an actual value from the set of nonresponse data, the regression model must have a minimum coefficient of determination of 80% for better prediction of the observations; and (c) the imputation classes must be homogeneous to produce less biased estimates.
viii
Chapter 1
The Problem and Its Background 1.1
Introduction
Missing data in sample surveys is inevitable. The problem of missing data occurs for various reasons such as when the respondent moved to another location, refused to participate in the survey or is unable to answer specific items in the survey. This failure to obtain responses from the units selected in the sample is called nonresponse. There are several types of nonresponse; (a) Unit nonresponse refers to the failure to collect any data from a sample unit; (b) while item nonresponse refers to the failure to collect valid responses to one or more items from a responding sample unit; and (c) partial nonresponse occurs when there is a failure to collect responses for large sets of items for a responding unit.
In surveys where there is more than one round of data collection, the problem of nonresponse becomes more complicated. In surveys of this type, it is likely possible that a unit would respond to the first round of the survey but eventually the same unit would fail to answer on the succeeding rounds of the survey. Hence, partial nonresponse occurs.
2
The effect of nonresponse must not be ignored since it leads to biased estimates which if large would result to inaccuracy. Bias due to nonresponse is believed to be a function of nonresponse rates and the difference in characteristic between responding and nonresponding units. The larger the nonresponse rate or the wider the difference between the responding and nonresponding, the result will lead to a larger bias. Moreover, the wider the difference between the responding and nonresponding units will also result to a larger bias.
In practice, there are three ways of handling missing data. These are discarding the missing values, applying weighting adjustments or using imputation techniques. Weighting adjustments is based on matching nonrespondents to respondents in terms of data available on nonrespondents and increasing the weights of matched respondents to account for the missing values. Hence, a weight proportionate to the amount of nonresponse is often multiplied to the inverse of the response rate. This is often applied for unit nonresponse. On the other hand, imputation is also used by statisticians to account for nonresponse, usually in the case of item and partial nonresponse. In imputation, a missing value is replaced by a reasonable substitute for the missing information. Once nonresponse has been dealt with, whether by weighting adjustments or imputation, then researchers can proceed with their data analysis.
The Family Income and Expenditure Survey (FIES) is an example of a survey which has more than one round of data collection. The FIES is a nation-
3 wide survey of households conducted every three years with two visits per survey period on the sample unit by the National Statistics Office (NSO) in order to provide information of the country’s income distribution, spending patterns and poverty incidence. Like any other survey, FIES encounters the problem of missing data, particularly the problem of nonresponse during the second visit. Given the various contributions that this survey can provide, it is then important to have precise estimates of these indicators.
With the FIES 1997 as the data set for this study, this paper will focus on dealing with partial nonresponse through the use of imputation techniques. Specifically, applying some of the imputation techniques in the study about the 1978 Research Panel Survey for the Income Survey Development Program (ISDP) entitled Compensating for Missing Data by Kalton (1983), the purpose of this paper is to examine the effects of imputed values in coming up with estimates for the missing data at various nonresponse rates and to determine which imputation technique is appropriate for the FIES data.
1.2
Statement of the Problem
This paper attempts to answer the following questions: 1. Which imputation technique is the most appropriate for the FIES data? 2. Will the imputation methods generate less biased estimates than the estimates generated by ignoring nonresponse in this study?
4 3. How do varying nonresponse rates affect the results for each imputation method?
1.3
Objectives of the Study
The paper will attempt to achieve the following objectives: 1. To compare the imputation techniques namely Overall Mean Imputation, Hot Deck Imputation, Deterministic Regression and Stochastic Regression, based on its efficiency and ability to recapture the deleted values by generating the missing values on the FIES 1997 second visit data using the first visit data of the same survey. 2. To investigate the effect of the varying rates of missing observations, particularly the effect of 10%, 20% and 30% nonresponse rates on the precision of the estimates.
1.4
Significance of the Study
Nonresponse is a common problem in conducting surveys. The presence of nonresponse in surveys causes to create incomplete data, which could pose serious problems during data analysis, particularly in the generation of statistically reliable estimates. For this reason, the use of imputation techniques enables to account for the difference in the way nonrespondents answer the survey questions compared to the respondents. This then helps reduce the nonresponse bias in the survey estimates.
5
Secondly, since most statistical packages require the use of complete data before conducting any procedure for data analysis, the use of imputation techniques can ensure consistency of results across analyses, something that an incomplete data set cannot fully provide.
Third, most countries in the developing world such as the United States, Canada, UK and the Netherlands already employ imputation techniques in their respective national statistical offices. In a country such as the Philippines, where data collection is very difficult especially for some regions like the National Capital Region (NCR), imputation will be able to ease the problem of data collection and nonresponse. This can even make us at par with our counterparts in the developing world in terms of statistical research.
Lastly, in the case of FIES, whose primary objective is to provide information about the country’s income distribution, spending patters and poverty incidence, it is then important to ensure that the precision of the estimates in the survey. Given the great impact of this survey to the country, employing imputation techniques helps statisticians to provide a method in handling nonresponse, which could lead to a more meaningful generalization about our country’s income distribution and spending patterns. Hence, having estimates with less bias and more consistent results, this can contribute in making our policymakers and economists provide better solutions in improving the lives of the Filipinos.
6
1.5
Scope and Limitations
Throughout this paper, only the Family Income and Expenditure Survey (FIES) 1997 will be used to tackle the problem of nonresponse and to examine the impact of the different imputation methods applied in the dataset. Other methods of handling nonresponse will not be covered in this paper
Second, the researchers will only focus on using the FIES 1997 data on the first visit to impute the partial nonresponse that is present on the second visit. This paper also assumes that the first visit data is complete and the pattern of nonresponse follows Missing Completely at Random (MCAR) case. The Missing Completely At Random case happens if the probability of response to Y is unrelated to the value of Y or to any other variables; making the missing data randomly distributed across all cases (Musil, 2002). If it the pattern on nonresponse does not satisfy the MCAR assumption, imputation techniques may not achieve its purpose.
As for the imputation techniques, only four imputation methods will be applied for this paper namely: Overall Mean Imputation (OMI), Hot Deck Imputation (HD), Deterministic Regression Imputation (DR) and Stochastic Regression Imputation (SR).
With regards to the extent of how these imputation methods will be applied and evaluated, this paper will only cover the partial nonresponse occurring in the National Capital Region (NCR) since NCR is noted as the region with highest
7 nonresponse rate. Also, the variables that will be imputed for this study would be the Total Income (TOTIN2) and Total Expenditures (TOTEX2) in the second visit of the FIES data. On the aspect of evaluating the efficacy and appropriateness of the four imputation methods, this will only be limited to the following: (a) Nonresponse Bias and Variances of the Imputed Data, (b) Assessment of the Distributions of the Imputed vs. the Actual Data and the criteria set in the report entitled Compensating for Missing Data(Kalton, 1983) namely the Mean Deviation, Mean Absolute Deviation and the Root Mean Square Deviation.
Chapter 2
Review of Related Literature Much research effort has been devoted in the efficacy of various imputation methods. In the report entitled Compensating for Missing Survey Data, the author carried out two simulation studies using the data in the 1978 ISDP Research Panel to compare some imputation methods. The first study compared imputation methods for the variable Hourly Rate of Pay while the second dealt with the imputation of the variable Quarterly Earnings. For both studies, the author stratified the data into its imputation classes, constructed data sets with missing values by randomly deleting some of the recorded values in the original dataset and then applied the various imputation methods to fill in the missing values. This process was replicated ten times to ensure consistency of the results. Once the imputation methods have been applied, the three measures for evaluating the effectiveness of imputation methods namely the Mean Deviation, Mean Absolute Deviation and the Root Mean Square Deviation were obtained and averaged across the ten trials. (Kalton, 1983)
For the first study of imputing the variable Hourly Rate of Pay, eight methods were used namely the Grand Mean Imputation (GM), the Class Mean Imputa-
9 tion using eight imputation classes (CM8), the Class Mean Imputation using ten imputation classes (CM10), Random Imputation with eight imputation classes (RM8), Random Imputation with ten imputation classes (RM10), Multiple Regression Imputation (MI), Multiple Regression Imputation plus a random residual chosen from a normal distribution (MN) and Multiple Regression Imputation plus a randomly chosen respondent residual (MR). Using the Mean Deviation criteria, the results showed that all mean deviations were negative, indicating that the imputed values underestimated the actual values. Moreover, the results show that the Grand Mean Imputation (GM) has the greatest underestimation among the eight procedures. Meanwhile for the Mean Absolute Deviation and Root Mean Square Deviation, which measures the ability to reconstruct the deleted value, the results showed that the Grand Mean Imputation fared the worst for both criteria. In addition, it also showed that the Multiple Regression Imputation (MI) obtained the best measures for the two criteria and that the procedures with greater number of imputation classes (i.e.CM8 VS. CM10, RC8 VS. RC10) yield slightly better results for the two criteria. (Kalton, 1983)
For the second study, which is the imputation of Quarterly Earnings, ten imputation procedures were used. These are the Grand Mean Imputation (GM), the Class Mean Imputation using eight imputation classes (CM8), the Class Mean Imputation using twelve imputation classes (CM12), Random Imputation with eight imputation classes (RM8), Random Imputation with twelve imputation classes (RM12), Multiple Regression Imputation (MI), Multiple Regression Imputation plus a random residual chosen from a normal distribution (MN), Multiple Regres-
10 sion Imputation plus a randomly chosen respondent residual (MR), Mixed Deductive and Random Imputation using eight imputation classes (DI8) and Mixed Deductive and Random Imputation using twelve imputation classes (DI12). Using the first criteria, the Mean Deviation, the results showed that the Grand Mean (GM) obtained a positive bias. This implied that the grand mean imputation is not an effective imputation method for the this study. The results also showed that the regression imputation procedures have almost similar results producing almost unbiased estimates. In addition, the Class Mean Imputation methods (CM8 and CM12) have similar measures with those of the Random Imputation Methods. Nevertheless, all methods have produced relatively small mean deviations except for the last two methods. Comparing the Mean Absolute Deviations and the Root Mean Square Deviations, the results show that the Grand Mean Imputation obtained values similar to the regression procedures with residuals (i.e. Multiple Regression Imputation plus a random residual chosen from a normal distribution or MN, Multiple Regression Imputation plus a randomly chosen respondent residual or MR). The results also show that the RC8. RC12, MN and MR procedures are over one third larger compared to deterministic procedures such as the CM8, CM12 and MI procedures. (Kalton, 1983).
To further investigate the relatively larger biases of DI8 and DI12 procedures, the author further divided the date into the deductive and non deductive cases. This shed further light on the Mean Deviations and Mean Absolute Deviations of the various imputation methods. It was found that the mean deviations are positive on the deductive case and negative on the non deductive case for all of the
11 procedures. These then explains why there are relatively small deviations in the previous results since the measures between the cases tend to cancel out. It also showed that the DI8 and DI12 results are similar to those of the RC8, RC12, CM8 and CM12 in the non deductive cases but are largely different in the deductive cases. This explains the larger values of DI8 and DI12 in the previous results. (Kalton, 1983)
At the end of the two studies, it showed that the imputation procedures tend to overestimate the Hourly Rate of Pay and underestimate the Quarterly Earnings. Moreover, it showed how the mean imputation appears to be the weakest imputation method among the studies since it has distorted the distribution of the original data. Lastly, Kalton’s study shows the impact of increasing the imputation classes with respect to the criteria used such that it gives a better yield of values for the three criteria.
In contrast to Kalton’s criteria in measuring the performance of imputation procedures, a paper entitled A Comparison of Imputation Techniques for Missing Data by C. Musil, C. Warner, P. Yobas and S. Jones, the authors presented a much simple approach in evaluating the performance of imputation techniques by using the means, standard deviation and correlation coefficients, then comparing the statistics of the original data with the statistics obtained from the five methods namely Listwise deletion, Mean Imputation, Deterministic Regression, Stochastic Regression and EM Method. The Expectation Maximization (EM) Method is an iterative procedure that generates missing values by using expectation (E-step)
12 and maximization (M-step) algorithms. The E-step calculates expected values based on all complete data points while the M-step replaces the missing values with E-step generated values and then recomputed new expected values. (Musil, Warner, Yobas and Jones, 2002)
Using the Center for Epidemiological Studies data on stress and health ratings of older adults, the authors imputed a single variable namely the functional health rating. Of the 492 cases, 20% cases were deleted in an effort to maximize the effects of each imputation method. Except for the Listwise Deletion and Mean Imputation, the researchers used the SPSS Missing Value Analysis function for the Deterministic Regression, Stochastic Regression and EM Method. For the correlations, the researchers obtained the correlation values of the original data and the five methods of the imputed variable with the variables, age, gender and self assed health rating. (Musil, Warner, Yobas and Jones, 2002) The results show that comparing the mean of the original data with the five methods, all imputed values underestimated the mean. The closest to the original data was the Stochastic Regression, followed very closely by EM Method, Deterministic Regression, Listwise Deletion and Mean Imputation. The same results also hold for the standard deviations. For the correlations, however, the EM Method produced the closest correlation values to the original data followed closely by the Stochastic Regression, Deterministic Regression, Listwise Deletion and Mean Imputation. Hence, the Finding suggests that the Stochastic Regression and EM Method performed better while the Mean Imputation is the least effective. (Musil, Warner, Yobas and Jones, 2002)
13
In another study by Nordholt entitled Imputation Methods, Simulation, Experiments and Practical Examples, the authors described two simulation experiments of the Hot Deck Method. The first study focused on comparing whether the Hot Deck Method performs better than leaving the records with nonresponse out of the data set when analyzing the variable, which is known as the Available Case Method. This was done by constructing a fictitious data set of four values; two of these variables were used for the imputation. Then nonresponse rates were identified namely 5%, 10% and 20% and the simulation process was replicated 50 times. The data set containing the missing values was first analyzed using the Available Case Method then followed by the Hot Deck Imputation. Same with the methodology of Musil et.al., descriptive statistics such as the mean, variance and correlation were computed. Moreover, the absolute differences between the original and the available case method also with the original and hot deck method were computed. Based on his criteria, the results show that Hot Deck performs better than the Available Case Method. Also, it showed that the Hot Deck, while had closer results with the original data, has the tendency to underestimate the values. In terms of the absolute differences, it was observed that these values increase when the percentage of missing values also increases. (Nordholt, 1998)
Nordholt’s second simulation study focused on the effects of covariates, otherwise known as imputation classes on the quality of the Hot Deck Imputation. Using the data of the Dutch Housing Demand Survey of Statistics Netherlands, the variable value of the house was chosen as the variable to be imputed due to its
14 importance and the frequency of nonresponse occurring in that variable. For this study, the observations under category 13 (value worth at least 150,000) and category 22 (value worth at 300,000) are changed into missing values. The rationale for this choice was to ensure that the original value from these categories will note be used as the replacements for the variable to be imputed since it is no longer in the file. Then imputation classes were created once the missing values were already identified. A table showing the number of respondents before and after imputation showed that in every category except for 13 and 22, which was set as missing values, the number of respondents increased after the imputation. This showed that the remaining records have equal probability of becoming a donor record for an imputation and that not all imputations give values that are near category 13 or 22. Nordholt also explored on the Available Case Method and Hot Deck Method for this real life data. Same with the first study, the Hot Deck fared better than the Available Case Method. (Nordholt, 1998)
Lastly, Nordholt addressed several questions regarding imputation. Using examples of how imputation is applied on the real life surveys such as the Dutch Housing Demand Survey, European Community Household Panel Survey (ECHP) and the Dutch Structure of Earning Survey, he outline four criteria to decide which variables to be imputed. These are the importance of a variable, the percentage of nonresponse, the predictability of missing values and the cost of imputation. He also mentioned how it is important to estimate the duration of the imputation process due to the need of the study to be timely. The duration, according to Nordholt, is dependent on the number of variables to be imputed, the available
15 capacity, the user friendliness of an imputation package and the desired imputation quality. These issues must be settled first before conducting any imputation process and choosing the appropriate imputation strategy. (Nordholt, 1998)
There were two undergraduate theses that conducted a similar study on imputation. The first undergraduate thesis was by Salvino and Yu. They assessed the efficiency of the Mean Imputation versus Hot Deck Imputation Technique by applying these techniques on the 1991 Census on Agriculture and Fisheries (CAF) data. In their research, they generated an incomplete data using the Gauss Software for the imputed variables which were the count for cattle, hogs and chicken. In order to determine which is better between the two, the variances were compared. Looking at the variances, it was determined that the Hot Deck Imputation Technique was better. Also, the design effect was considered by dividing the variance of the Hot Deck Imputation versus the Mean Imputation, since the ratio produced was less than one, they concluded that again, the Hot Deck Imputation Technique is a better option. (Salvino and Yu, 1996)
Another undergraduate thesis by Cheng and Sy focused on assessing imputation techniques on a clinical data. The authors employed four methods of imputation namely Mean Imputation, Hot Deck Imputation, Linear Regression and Multiple Linear Regression. They assessed the efficacy of the imputation techniques by looking at the accuracy and precision of the estimates. Accuracy was measured by the percentage error and the variance of these percentage errors were the basis for the precision of the estimates. The results show that the Linear Regression
16 was the best method, followed closely by Multiple Regression, then Hot Deck and finally the Mean Imputation. (Cheng and Sy, 1999)
Chapter 3
Conceptual Framework 3.1
Nonresponse Bias
In most surveys, there is a large propensity of the post-analysis results to become invalid due to the missing data. Missing data can be discarded, ignored or substituted through some procedure. When data is deleted or ignored in generating estimates, the nonresponse bias becomes a problem. (Kalton, 1983) This section examines the nonresponse bias as a result of discarding the missing data and using only the data from the responding units in the survey analysis.
To be able to understand the concept of nonresponse bias better, this section would only pertain to the concept of nonresponse in general and would not mention anything regarding the types and patterns of nonresponse, as these would be discussed later in the subsections of this chapter.
Consider a Simple Random Sample (SRS) in the variable y, where y contains missing data, from a population of size N is drawn. The population will then be assumed that it can be divided in two groups, the first group being the respon-
18 dents and the other one being the nonrespondents.
Let R be the number of respondents and M (M stands for missing) be the number of nonrespondents in the population, with R + M = N ; the corresponding sample ¯ = quantities are ( r) and ( m), with r + m = n. Let R
R N
¯ = and M
M N
be the
proportions of respondents and nonrespondents in the population and let r¯ = and m ¯ =
m n
r n
be the response and nonresponse rates in the sample. The population
¯ Y¯r + M ¯ Y¯m , total and mean are given by Y = Yr + Ym = RY¯r + M Y¯m and Y¯ = R where Yr and Y¯r are the total and mean for respondents and Ym and Y¯m are the same quantities for the nonrespondents. The corresponding sample quantities are y = yr + ym = r¯ yr and y¯ = r¯y¯r + m¯ ¯ ym .(Kalton, 1983)
If no compensation is made for nonresponse, the respondent sample mean y¯r is used to estimate Y¯ . Its bias is given by B(Y¯r ) = E(Y¯r ) − Y¯ . The expectation of y¯r can be obtained in two stages, first conditional on fixed r and then over different values of r, i.e. E(¯ yr ) = E1 E2 (¯ yr where E2 is the conditional expectation for fixed r and E1 is the expectation over different values of r. Thus, E(¯ yr ) = E1 [
P E (y i) 2
r
r
] = E1 (Y¯r ) = Y¯r .
Hence, the bias of y¯r is given by ¯ (Y¯r − Y¯m ). B(¯ yr ) = Y¯r − Y¯ = M The equation above shows that y¯r is approximately unbiased for Y¯ if either the ¯ is small or the mean for nonrespondents, Y¯m , is proportion of nonrespondents M close to that for respondents, y¯r . Since the survey analyst usually has no direct
19 empirical evidence on the magnitude of (Y¯r − Y¯m ), the only situtation in which he can have confidence that the bias is small is when the nonresponse rate is low. ¯ many survey results escape sizable However, in practice, even with moderate M baises because (Y¯r − Y¯m ) is fortunately often not large. (Kalton, 1983)
In reducing nonresponse bias caused by missing data, there are many procedures that can be applied and one of these procedures is imputation. In this study, imputation procedures are applied to eliminate nonresponse and reduce bias to the estimates. Imputation is briefly defined as the substitution of values for the nonresponse observations. The discussion of imputation procedures will be provided in the later portions of this chapter.
3.2
Nonresponse and Its Patterns
This section gives a more in depth explanation about nonresponse and its patterns. It also presents the rationale why it is important to identify the nonresponse pattern should be taken into consideration before creating procedures in addressing the problem of missing data.
A critical issue in addressing the problem of nonresponse is identifying the pattern of nonresponse. Determining the patterns of nonresponse is important because it influences how missing data should be handled. There are three patterns of nonresponse namely Missing Completely At Random, Missing at Random
20 and Non Ignorable Nonresponse. A missing data is said to be Missing Completely At Random (MCAR) if the probability of having a missing value for Y is unrelated to the value of Y itself or to any other variable in the data set. Data the are MCAR reflect the highest degree of randomness and show no underlying reasons for missing observations that can potentially lead to bias research findings (Musil, Warner, Yobas and Jones, 2002). Hence, the missing data is randomly distributed across all cases such that the occurrence of missing data is independent to other variables in the data set.
Another pattern of nonresponse is the Missing At Random (MAR) case. The missing data is considered to be MAR if the probability of missing data on Y is unrelated to the value of Y after controlling for other variables in the analysis. This means that the likelihood of a case having incomplete information on a variable can be explained by other variables in the data set.
Meanwhile, the Non Ignorable Nonresponse (NIN) is regarded as the most problematic nonresponse pattern. When the probability of missing data on Y is related to the value of Y and possibly to some other variable Z even if other variables are controlled in the analysis, such case is termed as Non Ignorable Nonresponse (NIN). NIN missing data have systematic, nonrandom factors underlying the occurrence of the missing values that are not apparent or otherwise measured. NIN missing data are the most problematic because of the effect in terms of generalizing research findings and may potentially create bias parameter estimates, such as the means, standard deviations, correlation coefficients or regression co-
21 efficients.(Musil, Warner, Yobas and Jones, 2002)
These patterns are considered as an important assumption before any imputation takes place. For an imputation procedure to work and achieve statistically acceptable and reliable estimates, the pattern of nonresponse must either satisfy the MCAR or MAR assumption. For this study, the researchers’ created missing observations that satisfy the MCAR assumption.
3.3
Types of Nonresponse
Another important issue in dealing with missing data is the type of nonresponse. While the patterns of nonresponse focus on the relationships of the nonresponse variable to other variables, the types of nonresponse focus on the method in which the observations are nonresponse values. Kalton (1983) stressed the importance to differentiate the types of nonresponse: noncoverage, total (unit) nonresponse, item nonresponse, partial nonresponse.
Noncoverage denotes the failure to include some units of the survey population in the sampling frame. As a consequence, units that are excluded in the frame have no chance of appearing in the sample. NC is not usually a type of nonresponse; however, Kalton (1983) loosely classifies this for convenience purposes. NC can be seen in surveys where units are failed to cover in the sampling frame or the listing of units are incomplete.
22
Unit (or Total) nonresponse takes place wherein no information collected from a sampling unit. There are many causes of this nonresponse, namely, the failure to contact the respondent (not at home, moved or unit not being found), refusal to collect information, inability of the unit to cooperate (might be due to an illness or a language barrier) or questionnaires that are lost.
Item nonresponse, on the other hand, happens when the information collected from a unit is incomplete due to the refusal of answering some of the questions. There many causes of item nonresponse, namely, refusal to answer the question due to the lack of information necessarily needed by the informant, failure to make the effort required to establish the information by retrieving it from his memory or by consulting his records, refuses to give answers because the questions might be sensitive, embarrassing or considers to his perception of the survey’s objectives, the interviewer fails to record an answer or the response is subsequently rejected at an edit check on the grounds that it is inconsistent with other responses (may include an inconsistency arising from a coding or punching error occurring in the transfer of the response of the computer data file).
Lastly, Partial Nonresponse is the failure to collect large sets of items for a responding unit. A sampled unit fails to provide responses for the following, namely, in one or more waves of a panel survey, later phases of a multi-phase data collection procedure (e.g. second visit of the FIES), and later items in the questionnaire after breaking off a telephone interview. Other reasons namely in-
23 clude, data are unavailable after all possible checking and follow-up, inconsistency of the responses that do not satisfy natural or reasonable constraints known as edits which one or more items are designated as unacceptable and therefore are artificially missing, and similar causes in Unit (Total) Nonresponse. In this study, the researchers dealt with Partial Nonresponse occurring in the second visit of the FIES 1997.
3.4
The Imputation Procedures
Earlier, imputation is listed as one of the many procedures that can be used to deal with nonresponse in order to generate more unbiased results. Imputation defined by Kalton is the process of replacing a missing value through available statistical and mathematical techniques, with a value that is considered to be a reasonable substitute for the missing information. (Kalton, 1983)
Imputation has certain advantages. First, utilizing imputation methods help reduce biases in survey estimates. Second, imputation makes analysis easier and the results are simpler to present. Imputation does not make use of complex algorithms to estimate the population parameters in the presence of missing data hence, much processing time is saved. Lastly, using imputation techniques can ensure consistency of results across analyses , a feature that an incomplete data set cannot fully provide.
24 On the other hand, imputation has also several disadvantages. There is no guarantee that the results obtained after applying imputation methods will be less biased than those based on the incomplete data set. There is a possibility that the biases from the results using imputation could be greater. Hence, the use of imputation methods depends on the suitability of the assumptions built into the imputation procedures used. Even if the biases of univariate statistics are reduced, there is no assurance that the distribution of the data and the relationships between variables will remain. More importantly, imputation is just a fabrication of data. Many naive researchers falsely treat the imputed data as a complete data set for n respondents as if it were a straightforward sample of size n.
Given that imputed values are substituted for missing responses, there are a variety of methods in which the imputed value may be determined. These methods are called Imputation Procedures or Methods. Imputation Methods are techniques applied to replace missing values. These techniques can either implement statistical or simply mathematical procedures like replacing an observation by a constant value (e.g. mean).
There are four IMs applied in this study, namely, the Overall (Grand) Mean Imputation (OMI), Hot Deck Imputation (HDI), Deterministic Regression Imputation (DRI) and Stochastic Regression Imputation (SRI). For most imputation methods, imputation classes are needed to be defined in order to proceed in performing the imputation methods.
25 Imputation classes are stratification classes that divide the data into groups before imputation takes place. The formation of imputation classes is very useful if the classes are divided into homogeneous groups. That is, similar characteristics that has some propensity to provide the same response. The variables used to define imputation classes are called matching variables. In getting the values to be substituted to the nonresponse observations, a group of observations coming from a variable with a response are used. These records are called donors. The missing observations to be substituted are called recipients.
Problems might arise if imputation classes are not formed correctly to imputation methods that rely on them. One of them is the number of imputation classes. The imputation class must have a definite number of classes applied to each method. The larger the number of imputation class, the possibility of having fewer observations in one class increases. This can cause the variance of the estimates under that class to increase. On the other hand, the smaller the number of imputation class, the possibility of having more observations in that class increases thus making the estimates burdened with aggregation bias.
3.4.1
Overall Mean Imputation (OMI)
The mean imputation method is the process by which missing data is imputed by the mean of the available units of the same imputation class to which it belongs. (Cheng, 1999) One of the types of this method is the OMI method. The OMI method simply replaces each missing data by the overall mean of the available
26 (responding) units in the same population. The overall mean is given by m X
y¯omi =
i=1 r
yri = y¯r
where yomi is the mean of the entire sample of the responding units of the yth variable and yri is the observation under y which are responding units.
In performing this method, the need for an imputation class to be homogeneous is unnecessary. The imputation class for this method is the entire population itself. In fact, in many related literature, imputation classes is not a requirement and often ignored in performing this method.
There are many advantages and disadvantages of this method. The advantage of using this method is its universality. This means that it can be applied to any data set. Moreover, this method does not require the use of imputation classes to be homogeneous or the variables to be highly correlated. Without imputation classes, the method becomes easier to use and results are generated faster. Among the related literature included in this study, this is the most used method in imputing for missing data.
27
Figure 1 Distribution of the Data Before and After Imputation
However, there are serious disadvantages of this method. Since missing values are imputed by a single value, the distribution of the data becomes distorted (see Figure 1). The distribution of the data becomes too peaked making it unsuitable in many post-analysis. Second, it produces large biases and variances because it does not allow variability in the imputation of missing values. Many related literatures stated that this is the least effective and it is highly discouraged to used this method.
3.4.2
Hot Deck Imputation (HDI)
One of the most popular and widely known methods used is the Hot Deck Imputation (HDI) method. The HDI method is the process by which the missing observations are imputed by choosing a value from the set of available units. This value is either selected at random (traditional hot deck), or in
28 some deterministic way with or without replacement (deterministic hot deck), or based on a measure of distance (nearest-neighbor hot deck). To perform this method, let Y be the variable that contains missing data and X that has no missing data. In imputing for the missing data: 1. Find a set of categorical X variables that are highly associated with Y . The X variables to be selected will be the matching variables in this imputation. 2. Form a contingency table based on X variables. 3. If there are cases that are missing within a particular cell in the table, select a case from the set of available units from Y variable and impute the chosen Y value to the missing value. In choosing for the imputation to be substituted to the missing value, both of them must have similar or exactly the same characteristics. Cheng (1999) stated that HDI procedure gets estimates reflect more accurately to the actual data by making imputation classes. If the matching variables are closely associated with the variable being imputed, the nonresponse bias should be reduced which is similar to the advantage of imputation classes stated earlier.
29 Example 1: Suppose that a study is conducted among ten people. Assume that three people in the survey refused to answer some of the questions in the study. Replacing the missing answer from each unobserved unit by a known value from an observed unit who has similar characteristics such as sex, degree or course (Course), Dean Lister (DL), Honor student in High School (HS2), and Hours of study classes (HSC). Suppose the set of X matching variables are DL and HS2. Choosing randomly for the values to be imputed,
Table 1: Using the Hot Deck Imputation to Impute the GPA
Person
Sex
DL
HS2
HSC
GPA
1
M
Y
Y
2
-[3.999]
2
F
Y
N
1
3.567
3
F
N
N
0
1.298
4
F
N
N
0
2.781
5
M
N
Y
1
2.344
6
M
N
N
0
1.111
7
M
N
Y
1
-[2.781]
8
F
Y
N
1
3.246
9
F
Y
N
1
-[3.246]
10
F
Y
Y
1
3.999
30 The use of hot deck imputation is justified. First, imputed values came from the same class, nonresponse bias and variance of the estimates decrease. This is because the observation coming from the imputation classes are homogeneous. If the OMI method was used here, the bias and variance of the estimates would definitely increase. More importantly, the distribution of the data was preserved. In OMI, it can be sure that the distribution will be distorted since the only one value would be substituted for the missing values.
Like OMI, there are certain advantages in using this method. One major attraction of this method cited by Kazemi (2005) is that imputed values are all actual observed values. Another is the nonexistence of out-of-range values or impossible values. Out-of-range values are one of the problems of the Deterministic Regression Imputation (DRI) procedure which will be tackled in the next section. More importantly, the shape of the distribution is preserved. Since imputation classes are introduced, the chance in distorting the distribution decreases.
On the other hand, it also has a set of disadvantages. In order to form imputation classes, all X variables must be all categorical. Second, the possibility of generating a distorted data set increases if the method used in imputing values to the missing observations is without replacement as the nonresponse rate increases. Observations from the donor record might be used repeatedly by the missing values causing the shape of the distribution to get distorted. Third, the number of imputation classes must be limited to ensure that all missing values will have a donor for each class.
31
3.4.3
General Regression Imputation
As in MI and HDI methods, this procedure is one of the widely known used imputation methods. The method of imputing missing values via the least-squares regression is known to be the regression imputation (RI) method. This technique is seen as the generalization of the group mean imputation (GMI), another type of mean imputation other than OMI that uses imputation class.
There are many ways of creating a regression model. In Kalton’s study, the value for which imputations are needed y is regressed on the matching variables (x1 , x2 , ..., xp ) for the units providing a response on y. The imputation classes in this method are the categories of the matching variables that were transformed to dummy variables in the model. The matching variables may be quantitative or qualitative, the latter being incorporated into the regression model by means of dummy variables. The missing value may then be imputed into two basic ways: (a) to use the predicted value from the model given the values of the matching variables for the record with a missing response or (b) to use this predicted value plus some type of randomly chosen residual. The former one is called the Deterministic Regression Imputation (DRI) and latter one is called the Stochastic Regression Imputation (SRI). (Kalton, 1983)
In comparing the accuracy and efficiency of this method, it will be helpful if the methods to be compared have the same imputation class. In Kalton’s study,
32 there were two quantitative matching variables that were considered each with a few categories so that no categorization will be needed. The general model underlying based on imputation classes is in the form: yˆk = βˆ0 +
Pˆ βi xik + ek
where βˆ0 and βˆi are the parameter estimates computed from the m responding units,xik is the dummy independent variable which are the matching variables in the data under the kth nonresponding units of the ith matching variable,ek the random residual and yˆk the predicted value under the kth nonresponding unit to be imputed.
Stochastic Regression The use of the predicted value from the model corresponds to the mean value imputation in the restricted model, and hence has the same undesirable distributional properties. A good case therefore exists for including the estimated residual. There are various ways in which this could be done depending on the assumptions made about the residuals. The following are some of the more obvious possibilities: 1. Assume that the errors are homoscedastic and normally distributed,N (0, σe2 ). Then σe2 could be estimated by the residual variance from the regression,s2e , and the residual for a recipient could be chosen at random from N (0, s2e ) 2. Assume that the errors are heteroscedastic and normally distributed, with 2 2 σej being the residual variance in some group j. Estimate the σej by s2ej ,
and choose a residual for a recipient in group j fromN (0, s2ej ).
33 3. Assume that the residuals all come from the same, unspecified, distribution. Then estimate yk by yˆk + eˆk , where eˆi is the estimated residual for a randomchosen donor. 4. The assumption in (3) accepts the linearity and additivity of the model. If there are doubts about these assumptions, it may be better to take not a random-chosen donor but instead one close to the recipient in terms of his x-values (see Kalton, 1983). In the limit, if a donor with the same set of x-values is found, this procedure reduces to assigning that donor’s y-value to the recipient. There are many advantages and disadvantages of RI. RI has the potential to produce closer imputed values for the nonresponse observations, however, missing data is known to be assumption-free, rough-and-ready and imputation class approaches. Though this method has the potential to make closer imputed values, this method is a time-consuming operation and often times unrealistic to consider its application for all the items with missing values in a survey. In order to make the method effective by imputing a predicted value which is near the actual value, a high R2 is needed.(Kalton, 1983)
On the part of the deterministic and stochastic regression, a few disadvantages should be noted. In DRI, the distortion of the distribution becomes too peaked and the variance is underestimated. Comparing this to its stochastic counterpart, while deterministic imputed value was feasible, it is possible under the SRI that after adding the residual to the deterministic imputation, an unfeasible value could result.
Chapter 4
Methodology 4.1
Source of Data
The purpose of this section is to give an overview about the data that will be used for this study which is the Family Income and Expenditures Survey (FIES) 1997.
4.1.1
General Background
The Family Income and Expenditures Survey (FIES) 1997 is a nationwide survey with two visits per survey period on the same households conducted by the National Statistics Office (NSO) every three years. The objectives of the survey are as follows: 1. to gather data on family income and family living expenditures and related information affecting income and expenditure levels and patterns in the Philippines; 2. to determine the sources of income and income distribution, levels of living and spending patterns, and the degree of inequality among families,
35 3. to provide benchmark information to update weights in the estimation of consumer price index, and 4. to provide information in the estimation of the country’s poverty threshold and incidence.
4.1.2
Sampling Design and Coverage
The sampling design method for the FIES 1997 is a stratified multi - stage sampling design consisting of 3,416 Primary Sampling Units (PSU’s) for the provincial estimate with as subsample of 2,247 PSU’s as a master sample for the regional level estimates. (National Statistics Office [NSO], 1997-2005)
This multi stage sampling design involved three stages. First is the selection of sample barangays. Second is the selection of sample enumeration areas, which is a physically delineated portion of the barangay. This was followed by a selection of sample households. The sampling frame and stratification of the three stages were based on the 1995 Census of Population (POPCEN) and 1990 Census of Population and Housing (CPH). From this method, a sample of 41,000 households participated in this survey. (NSO, 1997-2005)
4.1.3
Survey Characteristics
The FIES 1997 questionnaire contains about 800 data items, where questions are asked by the interviewer to the respondent of the selected sample household. A
36 respondent is defined as the household or the person who manages the finances of the family or any member of the family who can give reliable information to the questionnaire. (NSO, 1997-2005)
For the list of items or variables gathered in the FIES 1997, see the Appendix.
4.1.4
Survey Nonresponse
Two types of nonresponse occurred in the 1997 FIES. The first type of nonresponse which resulted from factors such as being unaware of the question, unwilling to provide the answer or omission of the question during the interview is called the item nonresponse. (NSO, 1997-2005)
The other type of nonresponse which is due to households being temporarily away, on vacation, not at home, demolished or transferred residence during the second visit is called as partial nonresponse. This type of nonresponse totaled to only 3.6% of the total number of respondents. (NSO, 1997-2005)
The NSO has only devised the deductive imputation for solving the problem of item nonresponse while no specific method was mentioned to compensate for the partial nonresponse. (NSO, 1997-2005)
Hence, the researchers will focus on the comparison of imputation procedures for partial nonresponse. The first selection made by the researchers is the choice of regional data set to which the imputation techniques will be applied. In this
37 case, the National Capital Region (NCR) was chosen because it was noted as the region with highest nonresponse rate. The data consist of 4,130 observations, 39 categorical variables and the rest are continuous variables pertaining to income and expenditures of the respondents. As to which variables will be imputed, the researchers chose two variables namely Total Income (TOTIN) and Total Expenditure (TOTEX). The selections for these variables were chosen due to its importance to the FIES and the frequency of missing values for these observations.
4.2
The Simulation Method
In order to investigate and make an empirical comparison of the statistical properties of the estimates with imputed values using selected imputation methods, a data set with missing observations was simulated. This simulation method will create an artificial data set with missing observations to indicate which values will be imputed.
The alogrithm for this simulation procedure is as follows: 1. To get the number of observations to be set to missing for each nonresponse rate, the total number of observations, which is 4130, was multiplied to the indicated nonresponse rate. The nonresponse rates used for this study were 10%, 20% and 30%. The rational for setting different nonresponse rate is because the study aims to investigate the effect of varying nonresponse rates for each imputation method.
38 2. Each observation from the matrix of random numbers was assigned to both observations of the 1997 FIES second visit variables TOTIN2 and TOTEX2. This was done in order to satisfy the assumptions that the data has partial nonresponse and that the missing observations follow the Missing Completely At Random (MCAR) nonresponse pattern. 3. The second visit observations for both variables were sorted in ascending order through their corresponding random number. 4. The first 10% of the sorted second visit data for both variables were selected and set to as missing observations. The same procedure goes for the data set which will contain 20% and 30% nonresponse rates respectively. 5. The missing observations were flagged. This was done to distinguish the imputed from the actual values during the data analysis. This simulation method was implemented with the use of the Decimal Basic program, SIMULATION.BAS (see Appendix for the Source Code) where the files Simulated Values for Income (SIMI) and Simulated Values for Expenditure (SIME), a matrix containing missing observations for the income and expenditure were stored in order to use it in the application of the imputation methods.
39
4.3
Formation of Imputation Classes
Imputation classes are stratification classes that divide the data in order to produce groups that have similar characteristics. Assuming that the units that have the same characteristics have the propensity to give the same response, the formation of imputation classes would help reduce the biasness of the estimates.
The steps undertaken in the formation of the imputation classes are as follows: 1. The researchers identified the potential matching variables, which are the candidate variables that could have an association with the variables of interest (i.e. TOTIN and TOTEX). 2. The categorical variables from the first visit data must fit into the criteria in order to be selected as a candidate variable. Three criteria were used as a basis for selecting the candidate variables. The first criterion is that the variable must be known. Second, the candidate variable must be easy to measure. Lastly, the probability of missing observations for the candidate variable is small. If the variable from the first visit data would fit in the three criteria, then it can be used as a candidate variable. 3. For the variables that have many categories, the researchers reduced the number of categories for these variables. The rationale for this procedure is because having too many categories can increase heterogeneity and the bias of the estimates. This was done with the use of the software Statistica, particularly, the Recode function.
40 4. Measures of association were tested on the matching variables. The Chi Squared test was the first test applied on the variables. This was made to determine if the candidate variables is a significant factor for the variables of interest. 5. Other tests for measuring the association of matching variables to the variables of interest followed. For the other tests of association, three tests were used namely Phi-coefficient, Cramer’s V and Contingency Test. The candidate variable with the greatest degree of association will be chosen as the matching variable that will group the data into their respective imputation class. All these tests were made using statistical packages Statistica and SPSS. The results of these tests will be presented in the next chapter.
4.4 4.4.1
Performing the Imputation Techniques Overall Mean Imputation (OMI)
The Overall Mean Imputation (OMI) is an imputation procedure where the missing observations are replaced with the mean of the variable which contains available units. As said in the Conceptual Framework, this imputation method does not require the formation of imputation classes, which makes this method as the simplest procedure among the four methods in this study.
41 The procedures in applying the Overall Mean Imputation (OMI) are as follows: 1. The overall mean for the variables of interest, TOTIN and TOTEX, for the first visit was computed. The formula that was used for the computation of the overall mean is: m X
y¯omi =
yri
i=1 r
where y¯omi is the overall mean for the first visit TOTIN or TOTEX while yri is the first visit observation for the variable TOTIN or TOTEX and r is the total number of responding units for the first visit variable TOTIN or TOTEX 2. Using the nonresponse data sets generated, the missing observations for the second visit variables TOTIN and TOTEX were replaced with the overall means of the first visit TOTIN and TOTEX. The implementation of the Overall Mean Imputation (OMI) was made through the Decimal Basic program OMI.BAS. (See Appendix for the source code).
42
4.4.2
Hot Deck Imputation (HDI)
The Hot Deck (HDI) Imputation is an imputation procedure where the missing observations are replaced by choosing a value from the set of available units.
The steps undertaken in applying the Hot Deck (HD) Imputation are as follows: 1. The donor and recipient record for each imputation class and variable were first identified. 2. The missing observations of the second visit TOTIN2 and TOTEX2 were assigned to their respective recipient records for each imputation class while the first visit TOTIN2 and TOTEX2 observations were placed to their respective donor records for each imputation class. 3. The values that were substituted for the missing observations were randomly chosen from the donor record for each imputation class. The implementation of the Hot Deck (HD) Imputation was made through the Decimal Basic program HOT DECK.BAS. (See Appendix for the source code)
43
4.4.3
Deterministic and Stochastic Regression Imputation (DRI) and (SRI)
Deterministic Regression Imputation (DRI) is a procedure that involves the creation of a Least Squares Regression where Y is regressed on the matching variables (x1 , x2 , ..., xp ) in order to predict for the missing value. On the other hand, Stochastic Regression Imputation (SRI) is an imputation method which employs a similar procedure to that of the deterministic regression but with an additional procedure of adding an error term e to the estimated value in order to predict for the missing data.
The steps employed for the Regression Imputation are as follows: 1. A logarithmic transformation was applied for the first and second visit of the variables TOTIN and TOTEX. The rationale for this transformation is because the income and expenditure variables are not normally distributed, moreover logarithmic transformations help correct the non-linearity of the regression equation. 2. The formation of regression equation was done after the transformation. For this study, only one predictor variable was used and the general formula for the regression equation is: yˆ = βˆ0 + βˆ1 x + eˆi where yˆ is the predicted observation for the second visit variable TOTIN ˆ 0 and beta ˆ 1 are the parameter estimates, x is the first visit or TOTEX, beta
44 variable, and hatei is the random residual term. Note that for DRI, eˆi = 0. 3. For the stochastic regression which involves the computation of the error term, the following steps were made: (a) A frequency distribution of the residuals was created. (b) The class means of the frequency distributions were used to obtain the error terms for the regression equation. 4. Model validation of the regression equations follow. This diagnostic checking requires to satisfy the following assumptions: (a) Linearity (b) Normality of the error terms (c) Independence of error terms (d) Constancy of Variance The results for the diagnostic checking of each regression equation used for this study will be presented in the Appendix. 5. The missing observations were replaced by the predicted value using the corresponding regression equation.
45
4.5
Comparison of Imputation Techniques
4.5.1
The Bias and Variance of the Estimates
The primary objective of using imputation techniques is to be able to generate statistically reliable estimates. To check if the imputation techniques produce reliable estimates, the bias and the variances of the estimates must be examined.
The variances of the actual data and the imputed data were obtained and to compare the ability of the imputation techniques to mirror the actual data, moreover to determine the effect of the varying nonresponse rates on the performance of the imputation techniques.
On the other hand, to compute for the bias of the estimates the following steps were done: 1. The mean of the responding units, y¯r was computed. 2. The mean of the nonresponding units, y¯m was computed. 3. The bias of the estimates was computed by getting the difference between the responding units and nonresponding units, divided by the difference of the number of responding units and the number of nonresponding units. The results of this section will be presented in the next chapter.
46
4.5.2
Comparing the Distributions of the Imputed vs. the Actual Data
In order to determine which imputation method was able to maintain the same distribution of the actual data, a goodnesss - of - fit test was utilized. For this study, the researchers chose the Kolmogorov - Smirnov (K-S) test. The Kolmogorov - Smirnov is a goodness of fit test concerned with the degree of agreement between the distribution of a set of sampled (observed) values and some specified theoretical distribution (Siegel, 1988). In this study, the researchers were concerned with how the imputation methods affected the distribution of the FIES 1997 data.
The following steps are made for the Kolmogorov - Smirnov Test: 1. Income and Expenditure deciles were created. The creation of these deciles was based on the second visit actual FIES 1997 data. 2. The obtained deciles were used as upper bounds of the frequency classes. 3. A Frequency Distribution Table (FDT) for each trial was created. For this part, the researchers used the SPSS aggregate function to generate the FDT. 4. The FDT includes the Relative Cumulative Frequency (RCF) for both the imputed and actual distribution. RCFs are computed by dividing the cumulative frequency by the total number of observations. 5. The absolute value of the difference of the actual data RCF and the imputed RCF was computed. This was computed using Microsoft Excel
47 6. The test statistic for the Kolmogrov - Smirnov Test, which is the maximum deviation, D, was determined by using this formula: D = max|RCFimputed − RCFactual | 7. Since this is a large sample case and assuming a 0.05 level of significance, the critical value for this is computed using the formula:
1.36 √ , N
N = 4, 130
8. If D is less than the critical value, then the conclusion that the imputed data maintains the same distribution of the actual data follows. The results of this test will be presented in the next chapter.
4.5.3
Using Kalton’s Criteria in Assessing the Performance of the Imputation Techniques
Lastly, the researchers adopted Kalton’s measures for evaluating the effectiveness of imputation methods. These measures are: (a) Mean Deviation (MD), (b) Mean Absolute Deviation (MAD) and (c) Root Mean Square Deviation (RMSD).
The Mean Deviation (MD) measures the bias of the imputed values. This is represented by the formula: X MD =
(ˆ ymi − ymi ) m
, i = 1, 2..., m
where yˆmi is the imputed value for the variables TOTIN or TOTEX and ymi is the actual value of the variables TOTIN or TOTEX for case i = 1, 2..., m.
48 The Mean Absolute Deviation (MAD) is a criterion for measuring the closeness with which the deleted are reconstructed measures the bias of the imputed values. This is represented by the formula: X |(ˆ ymi − ymi )| , i = 1, 2..., m M AD = m where yˆmi is the imputed value for the variables TOTIN and TOTEX ymi is the actual value of the variables TOTIN or TOTEX for case i = 1, 2..., m.
The Root Mean Square Deviation (RMSD) is the square root of the sum of the square deviations of the imputed and actual observation. Same as the MAD, it measures the closeness with which the deleted values are reconstructed. This is expressed as: rX RM SD =
(ˆ ymi − ymi )2 m
where yˆmi is the imputed value for the variables TOTIN and TOTEX ymi is the actual value of the variables TOTIN or TOTEX for case i = 1, 2..., m.
These three criteria for measuring the performance of the imputation techniques were implemented using the Decimal Basic program.
After each imputation
method is performed, the program proceeds in finding the Mean Deviation, Mean Absolute Deviation and Root Mean Square Deviation and were saved were saved in their corresponding Kalton’s Criteria for Expenditure (CRITEX) and Kalton’s Criteria for Income (CRITIN) files.
49
4.5.4
Determining the Best Imputation Method
To answer the primary objective of this study which is determining the best or the most appropriate imputation technique for FIES 1997, the researchers ranked the four imputation techniques based on the criteria discussed in the previous sections. The selection of the best method will be independent for all the variables of interest and nonresponse rates. The ranking of the imputation methods covered the following: Nonresponse Bias (NB), Estimated Percentage of Correct Distribution of the Imputed Data (PCD), Mean Deviation (MD), Mean Absolute Deviation (MAD) and Root Mean Square Deviation (RMSD)
The procedure for ranking are as follows: 1. In each criteria mentioned above, the imputation were ranked using the scale of 1 to 4,with 1 indicating the best imputation method and 4 being the worst. 2. Each variable of interest and nonresponse rate, these rankings for each criteria were obtained and summarized. 3. The obtained rankings of a particular imputation method for each criteria is added. 4. The imputation method with the lowest total will be considered as the best imputation method for the respective variable of interest and nonresponse rate. The results of the ranking procedure will be presented in the next chapter.
Chapter 5
Results and Discussion 5.1
Descriptive Statistics of Second Visit Data Variables
Table 2 shows the descriptive statistics of the second visit variables of interests (VI), TOTEX2 and TOTIN2. This was computed to provide a brief idea on how much a household spends and earns in a period of time, measure the differences of the statistics between the two variables and to compare the results with other tests later on. This descriptive statistics will be also used in comparing the results of imputation classes (IC), how well the observations are grouped.
Table 2: Descriptive Statistics Variable
Mean
Std. Dev
Min
Max
N
TOTEX2
102389.8
129866.6
8926.00
3903978
4130
TOTIN2
134119.4
216934.9
9067.00
4357180
4130
The average total spending of a household in the National Capital Region (NCR) is about Php 102389.80 while the average total earnings amounted to P134119.40,
51 a difference of more than thirty thousand pesos. Observations from the TOTIN2 are larger and more spread than the TOTEX2 because of a larger mean and standard deviation respectively. The dispersion can be also seen by just looking at the minimum at maximum of the two variables. The range of TOTIN2 which measured more than four million against the range of TOTEX2 measured one million lower than TOTIN2 can be also a sign of the extreme variability of the observations.
5.2
Formation of Imputation Classes
Table 3 shows the results of the chi-square test where it was done to determine if the candidate matching variables (MV) are associated with the VIs. The MV stated in the methodology must be highly correlated to the variables of interest. The first visit variables of interest, TOTIN1 and TOTEX1, were grouped into four categories in order to satisfy the assumptions in the association tests. The first visit VIs were used in as the variables to be tested for association rather than second visit variables of interest since the second visit VIs already contained missing data.
The following candidate matching variables that were tested are the provincial area codes (PROV), recoded education status (CODES1) and recoded total employed household members (CODEP1). The PROV has four categories, CODES1 has three, and CODEP1 has also four. Originally, CODES1 and CODEP1 have
52 more than what they have now. Since the original matching variables have numerous categories (i.e. In CODES1 and CODEP1, there were more than 60 and 7 categories respectively.), the matching variables were recoded and further categorized into smaller groups.
The resulting number for each candidate is the χ2 test statistic and below it is p-value.
Table 3: Tests of Association for Matching Variable: The Chi-Square Test of Independence
The Chi-Squared test of association for the candidates and the variables of interest showed that PROV, CODES1 and CODEP1 are associated to CODIN1 and CODEX1. The p-values for all the candidates were less than 0.0001 indicating that the association is very significant. The results of succeeding tests of association will determine which of the three candidates will be chosen as the MV of the study. The Chi- Squared test is insufficient since it failed to determine the best MV.
53
54 Table 4 shows the other tests of association, namely, the Phi-Coefficient, Cramers V and the contingency test. These tests were done in order to assess the degree of association of the candidates to CODIN1 and CODEX1.
Table 4: Tests of Association for Matching Variable: Degree of Association
The table above displays the degree of association between the candidates and the variables of interest. The degree of association for all the tests showed weak association. In real complex data, the association between variable happens to be smaller or even no association at all. In all the other tests of association, only CODES1 measured at a minimum of 20% to be used in dividing the data into imputation classes. The result above is now sufficient to say that CODES1 is the chosen MV for this data.
To have a detailed description of the CODES1 imputation classes, the descriptive statistics for each imputation class was performed. Table 5 shows the descriptive statistics of each imputation class of the data. The descriptive statistics will tell if the best MV decreases the variability of the observations. In checking for the variability of each imputation class, the standard deviation will be used and
55 compared with the value from the overall standard deviation of the variables of interest.
Table 5: Descriptive Statistics of the Data Grouped into Imputation Classes.
The table shown above that in the IC1 for both VIs, the first IC which has the largest number of observations produced lesser spread than the two ICs. The two ICs, IC2 and IC3 produced large standard deviations however it is being neutralized by a low value from IC1 which has the largest proportion of the data. It may be that reason why the standard deviation and the mean of IC3 are large because majority of the extreme values were contained on that class.
56
5.2.1
Mean of the Simulated Data by Nonresponse Rate for Each Variables of Interest
Table 6 shows the result of the means in both VIs under the varying rates of nonresponse. This was generated to have a brief description on the effects on nonresponse rate on the population mean ignoring the missing values. More importantly, the results below will become input in the comparison of the estimates from the imputed data for each imputation method (IM).
Table 6: Means of the Retained and Deleted Observations
The mean rates of the observations set to nonresponse and observations retained showed contrasting results. When the nonresponse rate gets larger for both sets, the mean rate of observations set to nonresponse increases. Conversely, the mean rate of observations set to nonresponse decreases when nonresponse rate increases. Its a possibility that large values were set to nonresponse that increased the means of the data sets containing nonresponse for the varying rates of nonresponse.
57 Comparing the means for the varying nonresponse rates under each VI, the results showed that there is little difference between the population mean ignoring the missing data and the population mean of the actual data. However, similar to the description above, as the number of missing values increase, the deviation between the means of the actual and retained data slowly increases.
5.2.2
Regression Model Adequacy
Table 7 show the different regression models for all VIs and nonresponse rates (NRRs) that were checked for adequacy. The columns are represented as follows: (a) VI, (b) the nonresponse rate (NRR), (c) IC, (d) the prediction model, (e) the coefficient of determination (R2) and (f) the F-statistic and its p-value.
58 Table 7: Model Adequacy Results
59 The results showed that all of the models are fitted adequately to their respective data sets. The highest r2 in Table 7 measured 93.2%, the coefficient of variation for the third imputation class of the TOTEX2 variable under the highest NRR while the lowest is 70.3%, the coefficient of variation for the first imputation class of the TOTIN2 variable under 20% NRR. It is interesting to note that for all NRRs and VIs, the third IC generated the highest r2 among the ICs. The lowest r2 from all the models under the third imputation class is 88.8% which is from the 30% NRR of the TOTEX2 variable. Contrary to the r2 of the third IC, the first IC generated the lowest r2 for all NRRs and VIs. (For the other figures and graphs of the fitted models, see the Appendix.)
5.2.3
Evaluation of the Different Imputation Methods
In the evaluation of the different imputation methods (IMs), the results of each IM will be discussed independently. For each IM, the discussion of results will go as follows: (1) nonresponse bias and variances of the estimates of the population of the imputed data, (2) distribution of the imputed data using the KolmogorovSmirnov Goodness of Fit Test, and (3) other measures of variability using the mean deviation (MD), mean absolute deviation (MAD) and root mean square deviation (RMSD).
The table of results will contain the following columns: (a) VI, (b) NRR, (c) the bias of the population mean of the imputed data, Bias(ˆ y 0 ), (d) the variance of the population mean of the imputed data, Var(ˆ y 0 ), (e) Estimated percentage
60 of correct distribution of the imputed data set to the actual data set (PCD), (f) Mean Deviation (MD), (g) Mean Absolute Deviation (MAD) and (h) Root Mean Square Deviation (RMSD).
Overall Mean Imputation Table 8 shows the results of the different criteria in evaluating the newly created data with imputations using the overall mean imputation (OMI) method.
Table 8: Criteria Results for the OMI Method
1. Nonresponse Bias and Variance In (c) of Table 8, results show that for nonresponse bias, as the nonresponse rate increases for both VI, the value of the bias decreases. The decrease in value of the bias in TOTIN2 was faster and more dramatic than TOTEX2. It seemed that in TOTIN2, the extent of the decrease in value are almost
61 500% under 20% NRR and almost tripled the rate of decrease under twenty percent NRR for the highest NRR. In contrast of the results in TOTIN2, the extent of decrease of the bias for TOTEX2 is much slower. The biases of the 20% and 30% for TOTIN2 is more than 6 times larger than TOTEX2.
The variance for all NRR and VI are all zero because the population mean of the imputed data set is constant. The data was not simulated one thousand times unlike for hot deck imputation (HDI) and stochastic regression imputation (SRI). Further, the OMI method did not create a sampling distribution for the mean of the created data due to a single simulation.
2. Distribution of the Imputed Data Results in column (e) of Table 8 showed that in all nonresponse rates and variables, the OMI method failed to maintain the distribution of the actual data. This was expected primarily because in each missing observation from all data sets with missing data, the missing observations were replaced by a single value which is the overall mean of the first visit of the VI.
Results from other studies stated that the OMI is one of the worst among all imputation methods. It is remarked that even if it is a simple process, inaccurate results are obviously made. Cases that vary significantly to the imputed values were the primary cause for inaccuracy. Also, the use of only a single value to be imputed for the missing data distorts the distribution of the data. The distribution of the data becomes too peaked which makes
62 this method unsuitable for many post-analysis. (Cheng, 1999)
3. Other Measures of Variability The three criteria in Table 8 under the columns (f), (g) and (h) show the other measures of variability of the imputed data. In all the criteria, the values for TOTEX2 are increasing as the nonresponse rate increases. However, this is not the case for TOTIN2. Suprisingly, the data which have twenty percent nonresponse observation that were imputed have the highest values for the three criteria.
It is worth noting to see that the mean deviation that focuses on each observation showed contrast with the results of the bias which focused on the population mean of the imputed data. The mean deviation for all nonresponse rates under the TOTEX2 variable were overestimating the actual data however in the results of bias. On the other hand, the population mean of the imputed data underestimates the actual data. Likewise in the other variable, when the result in mean deviation is an underestimate, the result from the bias is just the opposite which is an overestimation.
5.2.4
Hot Deck Imputation
Table 9 shows the results of the different criteria in evaluating imputed data with imputations using the hot deck imputation (HDI3) method with three imputation classes.
63 Table 9: Criteria Results for the HDI Method
1. Nonresponse Bias and Variance Similar in the results of the OMI method, the bias of the population mean of the imputed data increases for both variables as the NRR increases. As seen in OMI, for the TOTIN2 variable, the bias of the data which has twenty percent imputations is more than four times the bias of the data which contained ten percent imputed and almost half the bias of the data which has thirty percent imputed. The bias in the TOTIN2 variable in this method is a little worse than the OMI method.
Similar results were seen in OMI for the other variable, TOTEX2 where in the data which contained 30% imputations, the bias becomes negative. The bias seemed to decrease in value as the NRR increases. The biases for the first and second NRR under HDI3 performed better than OMI.
64 The variance of the population mean of the data which have imputations increases by more than one hundred percent as the nonresponse rate increases. The data which contained the lowest number of imputations provided the least spread of the population means and the data which contained the largest number of imputation provided the worst spread.
2. Distribution of the Imputed Data Results in column (e) shows that in TOTIN2, the imputed data maintained the distribution of the actual data for the data which contained ten and twenty percent imputations. On the other variable, only the data which contained ten percent imputation provided maintained the distribution of the actual data for all the one thousand data set. In the data which contained twenty percent imputations, only 969 out of the 1000 data set maintained the distribution of the actual data.
In the data sets which contained the largest number of imputations, both variables failed to maintain the distribution of the actual. Much worse, none of the simulated data set for TOTEX2 registered the same distribution as the actual. On the other hand, only a lone data set maintained the same distribution as the actual. The researchers look into the possibility that more than one recipient are having the same donor or could be that majority of the imputations are coming from one particular area in the record.
65 3. Other Measures of Variability For the three remaining criteria, the values generated were better than the results in the OMI method. In the MD criterion for both variables, the MD criterion generated an underestimation of the actual observation. While the OMI method overestimates the deleted actual values for the TOTIN2 variable, the HDI3 underestimates them. The underestimation rapidly increases as the nonresponse rate increases. The magnitude of the MD for TOTIN2 is larger for HDI3 than in OMI for all nonresponse rates. Similar to the results in MD for TOTIN2, the MAD and RMSD were unusually large compared to the OMI. In seems that imputation classes for the TOTIN2 variable were not as effective as compared to the TOTEX2 variable wherein in majority of values in all the nonresponse rates and criteria showed that HDI3 was better than OMI.
5.2.5
Deterministic Regression Imputation
Table 10 shows the results of the different criteria in evaluating the imputed data using the deterministic regression imputation method with three imputation classes (DRI).
Table 10: Criteria Results for the DRI Method
1. Nonresponse Bias and Variance Looking at Table 10, the bias for all NRR and VI showed negative results
66
which indicates that the population mean of the imputed data is underestimated. The results in the nonresponse bias from this method are similar to the results of the previous two methods that the TOTIN2 is underestimated. However, not like the results in OMI and HDI which the bias increases tremendously as the nonresponse rate increases, the increase in bias for this method is much slower. The bias of the data which has twenty percent imputations of the imputed data set is just twice the bias of the data set which has a lower percentage of imputations. For the TOTEX2 variable, this method produces more biased estimates for all NRR than the two previous methods.
As in the OMI method, the variance for this method is also zero since the population mean is constant due to a single simulation of the missing observations.
2. Distribution of the Imputed Data
67 Contrary to the results of the OMI method under this criterion, the DRI maintained its distribution for all the NRRs and VIs. It is even much better than the HDI since all of the imputed data sets under all the NRRs and VIs preserved the same distribution as the actual data. It is interesting to note that the regression models that were used in this study did not show the expected results that were mentioned in the related literature and provided a distinct result. Earlier studies that made use of categorical auxiliary variables, variables that are known to be the matching variables in this study, conclude that deterministic regression is just the same as the mean imputation to generate distorted and peaked distributions. However, in this study, the independent variable was the first visit VIs and for each imputation class there is a fitted model which registered better R2 that made the difference.
3. Other Measures of Variability Similar to the results in the nonresponse bias, the MD for all VI and NRR underestimates the actual observations. The underestimation for all NRR is almost stable because the rate of change is very small as compared to the two previous IMs. The MAD and RMSD show better results than OMI and HDI providing closer values of the imputed to the actual observations. As seen in OMI and HDI, the TOTIN2 have larger values for the MAD and RMSD criteria. Fitting models with high R2 was the key factor that made this method better than the other two IM previously evaluated.
68
5.2.6
Stochastic Regression Imputation
Table 11 shows the results of the different criteria in evaluating the imputed data using the stochastic regression imputation method with three imputation classes (SRI). Table 11: Criteria Results for the SRI Method
1. Nonresponse Bias and Variance The only method that produced reasonable estimates is the SRI method. The random residual added to the deterministic predicted observation made the difference. Clearly, there is no relationship between the nonresponse bias estimates of the population mean and the nonresponse rate. The biases fluctuate from one nonresponse rate to the other. This method provided the least bias in the highest nonresponse for both TOTEX2 and TOTIN2. While the other methods reached a four digit bias, the SRI generated a much lesser bias than the other three methods. Moreover, there is a huge disparity in the third nonresponse rate wherein it only produced less than twenty percent of
69 the bias produced by its deterministic counterpart.
The variances of the SRI proved to be much better than its model-free counterpart which is the HDI. In all the methods and nonresponse rate, it is clearly seen that there is a huge disparity between the variances of the SRI and HDI. Variances from the HDI are almost ten times larger compared to SRI.
2. Distribution of the Imputed Data Results from the SRI performed better than its model-free counterpart that is the HDI method which also simulated the data 1000 times. Unlike in hot deck imputation, stochastic regression imputation maintained the same distribution for all imputed data sets for the first and third nonresponse rates. It also outperformed the former in the second nonresponse rate, TOTEX2 variable. One of the reasons why 16 out of the 1000 sets failed to maintain the distribution of the actual data set for the imputed data set which contained twenty percent or 826 imputations might be the unfeasibility of the predicted values.
In earlier studies, the stochastic regression imputation performs better than any of the four methods used here. The random residual was added to the deterministic predicted value to preserve the distribution of the data. However, even if the original deterministic imputed values were feasible, the stochastic counterpart need not be. After adding the residual to the deter-
70 ministic imputation, unfeasible values could namely result. (Nordholt, 1998)
3. Other measures of variability Similar to the results in the nonresponse bias, the MD has no relationship with the NRR since from one NRR to another, the MD fluctuates. In the same criteria, it outperformed its regression counterpart but also getting outperformed by the two other methods. Contradictory to the results and observations in the MD criteria, the SRI closely follows second to the DRI3 methods and provides better values than the two other methods.
In the review of related literature, the stochastic regression performs way better than the deterministic regression. The researchers look at the same reason from the previous criteria. Its likely possible that the predicted values are unrealistic as compared to the deterministic predicted value.
After comparing the different methods with the criteria proposed in the methodology, the distribution of the true values (TVs) that were deleted and the imputed values (IVs) from each of the imputation procedures for all the VIs and nonresponse rates were computed. Table 11, 12 and 13 shows the frequency distribution of the methods with their corresponding relative frequencies (RFs) for the first, second and third nonresponse rates respectively. The RFs for the 1000 simulated data set from HDI and SRI were averaged. The first column represents the VIs frequency classes. This was the same classes that were used in the Kolmogorov-Smirnov Goodness of Fit
71 test in determining the estimated percentage of similar distributions of the imputed data. The second column is the relative frequencies of the actual data. The succeeding columns are the imputation methods.
Table 12: Distribution of the True Values and Imputed Values from the imputation procedures: 10% NRR
72 Table 13: Distribution of the True Values and Imputed Values from the imputation procedures: 20% NRR
73 Table 14: Distribution of the True Values and Imputed Values from the imputation procedures: 30% NRR
74 For the actual and imputed data with the lowest number of observations set to missing, it clearly illustrates the distortion of the distribution created by the OMI method. The OMI method assigns the mean of the first visit VI to all the missing cases, as a result, all the distribution of the missing values replaced by a single value concentrates at one frequency class. The three methods which implemented imputation classes, gave a better outcome than OMI by spreading the distribution of the imputed data.
For the HDI method, in all nonresponse rates, most of the imputed observations clustered in the first frequency class, that is less than 37859.5 for TOTEX2 and 40570 for TOTIN2. The clustering was also formed for the first and third nonresponse rate in last frequency class for TOTEX2 and for the all nonresponse rates in second frequency class for TOTIN2. The percentage of the data in from the lowest class for TOTEX2 and TOTIN2, for all nonresponse rate ranges from 14-16% compared to the actual percentage which only ranges from 9-11%.
While there is an over representation of the data, an under representation was observed from the interval 86103-126254.5 for the 10% and 20% nonresponse imputed data sets respectively and from the interval 63265-101947 for the 30% nonresponse imputed data sets. The percentage from the interval indicated for the 10% and 20% under the actual data totaled about 30% while the imputed data only totaled less than 30%.
For the two regression imputation methods, unlike hot deck and OMI which had
75 major cluster, produced more spread distribution although there are some areas that are under represented. The failure to consider a random residual term in deterministic regression resulted into a severe under representation of the data in particular the first frequency class. On the other hand, the SRI which considered a random residual provided better results than DRI. However, there are some areas that the added random produced significant excess mostly from the last frequency class.
5.3
Choosing the Best Imputation
For this section, the rankings of all the tests are the basis to determine which of the following IMs will be chosen as the best IMs for this particular study and data. The selection of the best method will be independent for all VIs and NRRs. The ranking are based on a four-point system wherein the rank value of 4 denotes the worst IM for that specific criterion and 1 denotes the best IM for that criterion. In case of ties, the average ranks will be substituted. The IM with the smallest rank total will be declared the best IM for the particular VI and NRR. The ranking of IM will cover the following criteria: (a) Nonresponse bias, (b) Distribution of correct distributions, and (c) Other measures of variability. All in all, there are five criteria that each IM will be rank in.
Tables 14, 15 and 16 show the ranking of the different imputation methods for the 10%, 20% and 30% NRR respectively. The table is divided into six columns.
76 The first column represents the VI, second is the criteria, third up to the sixth column are the imputation methods.
77 Table 15: Ranking of the Different Imputation Methods: 10% NRR
78 Table 16: Ranking of the Different Imputation Methods: 20% NRR
79 Table 17: Ranking of the Different Imputation Methods: 30% NRR
80 Rankings show that the two regression imputation methods provided better results than their model-free counterparts. For all the nonresponse rates under the TOTIN2 variable, the two regression methods tied as the best imputation method, and surprisingly the HDI finished the worst imputation method behind OMI. Under the TOTEX2 variable, mixed rankings were seen for all nonresponse rates. The regression methods still provided good results. The SRI method finished first in the 10% and 30% NRR and ranked third in the 20% NRR while the DRI method finished third, first and second in the 10%, 20% and 30% NRR respectively. While the HDI was seen as the worst IM for TOTIN2, the OMI was concluded the worst IM for TOTEX2 by ranking last for both 10% and 20% NRR and third for the 30% NRR.
In conclusion, the best imputation method for this study is the Stochastic Regression Imputation using the 1997 FIES data. It is very closely followed by the Deterministic Regression Imputation. No records in the results show that SRI method ranked last in all the criteria, NRRs and VIs, unlike for DRI which provided the worst IM in the nonresponse bias and Mean Deviation criteria. The researchers selected the HDI as the worst IM in this study. The HDI method fared the worst such that majority of the results in the different criteria under each NRR and VI in particular the said method rated poorly.
Chapter 6
Conclusion This paper discussed a range of imputation methods to compensate for partial nonresponse in survey data and showed proofs on the disadvantages and advantages of the methods. It showed that when applying imputation procedures, it is important to consider the type of analysis and the type of point estimator of interest. Whether the goal of the researcher is to produce unbiased and efficient estimates of means, totals, proportions and official aggregated statistics or a complete data file that can be used for a variety of different analyses and by different users, the researcher should clearly identify first the type of analysis and the type of estimator of interest that will suit his or her purpose.
Anyone faced with having to make decisions about imputation procedures will usually have to choose some compromise between what is technically effective and what is operationally expedient. If resources are limited, this is a hard choice. This study aims to help future researchers in choosing the most appropriate imputation technique.
There are other issues to consider in determining which imputation method should
82 be used for a particular assumption. There are several practical issues that involve the case of implementation, such as difficulty of programming, amount of time it spends and complexity of the procedures used.
For our particular implementation, all of the methods were run to a programming language due to the unavailability of software that can generate imputations for all the methods these researchers intended to use. In all of the methods, the overall mean imputation was the simplest and easiest to use and to create a computer program. The other three methods required the formation of imputation classes. Both regression imputations were the hardest to program and the most time consuming imputation methods.
The performance of several imputation methods in imputing partial nonresponse observations was compared using the 1997 Family Income Expenditure Survey (FIES) data set. A set of criteria were computed for each method based on the data set with imputed values and data set with actual values to find the best imputation method for this data set. The criteria in judging the best method were the bias and variance estimates of the population mean of the imputed data, the preservation of the distribution by the actual data, and the other measures of accuracy and precision incorporated from the study of Kalton (1983).
The results show that the choice of imputation method significantly affected the estimates of the actual data. The similarities among the two best methods, namely, the deterministic and stochastic regression imputation methods were due in part
83 to the adequacy and prediction power of the models.
The bias and variance estimates of the population mean of the imputed data obtained appeared to vary much across imputation methods and it was unexpected that the hot deck imputation method rendered the highest estimates in majority of the nonresponse rates as well as its variables. Stochastic regression, on the other hand, was the best method in that particular criterion since in majority of the results in the tests it delivered relatively small biases and variances.
The distributions of the imputed data of each method were checked for the preservation of the distribution using the Kolmogorov-Smirnov Goodness of Fit test. In the methods used in this study, both regression imputation methods retained the distribution of the data especially the deterministic regression imputation that generated exactly the same distribution as the actual data.
In the other tests of accuracy and precision, namely, the mean deviation, mean absolute deviation and root mean square deviation, the different methods provided mixed results in all nonresponse rates. The results for some methods did not consistently and clearly yielded good results. Only half of the methods used provided great results in one particular criterion which is the preservation of the distribution of the data. In the other results, inconsistency was obviously seen due to the frequent alternating rankings from each method.
Given the criteria and procedures in judging the best imputation procedure among
84 the four methods, the selection of the best method was difficult. Consequently, in order to determine the best method of imputing nonresponse observation for each variable in the study, the methods were ranked according to several criteria. Methods that were ranked 1 indicate as the best imputation method while methods ranked 4 shows that it is the worst in that particular criterion.
After comparing the methods, the two regression method namely the deterministic regression and stochastic imputation method gave the outstanding results. The results were ranking first and second and vice-versa in majority of the criteria. The researchers concluded that the stochastic regression imputation procedure is considered the best imputation method for this study.
The efficiency of the imputation method was supported by the coefficient of determination of the model and the added random residual in the deterministic imputed value. The random residuals added to the deterministic imputation provided a change in making the estimates less biased than its deterministic counterpart.
Deterministic regression imputation method performed much better than Hot Deck imputation method. It is surprising that the Hot Deck imputation method was less efficient than deterministic regression where in the related studies; it emerged as the better method than deterministic regression. Most likely the selection of donors with replacement might be the cause of its poor performance and not the imputation classes. If the imputation classes were the cause of its low ranking, then both regression imputation methods estimates could be as worse as
85 the Hot Deck imputation even if the model is adequate.
Chapter 7
Recommendations for Further Research In this study, we have compared four imputation methods commonly used in dealing with partial nonresponse data and with the assumption of MAR. However, there are other methods that are currently being developed and improved. For example, the multiple imputation method involves independently imputing more than one value for each nonresponse value. Multiple imputation is an important and powerful form of imputation and has the advantage that variance estimation under imputation can be carried out comparatively easily. (Kalton, 1983)
Regarding the variance estimation, further studies should implement the use of the jackknife variance estimator. This variance estimator is more often used in comparing the variance estimates of most imputation methods. The study of Rao and Shao (1992) has proposed an adjusted jackknife variance estimator for use with the imputation methods related to the Hot Deck imputation procedure. This variance estimator is said to be asymptotically unbiased.
Future researchers may test other methods on the same data set and compare the results with those presented in this paper. They could also compare the re-
87 sults of this study with those of multiple imputation and the Rao-Shao jackknife variance estimator. There is a need, however, for a higher knowledge in statistics and Bayesian statistics in using the above procedures. The complexity of the methods especially both regression imputations could hinder future researchers in the use of modern variance estimator.
It is also suggested that the use of a method to select a matching variable through the use of advanced modern statistical methods like the CHAID analysis. The acronym CHAID stands for Chi-squared Automatic Interaction Detector. It is one of the oldest tree classification methods originally proposed by Kass (1980; according to Ripley, 1996, the CHAID algorithm is a descendent of THAID developed by Morgan and Messenger, 1973). CHAID will ”build” non-binary trees (i.e., trees where more than two branches can attach to a single root or node), based on a relatively simple algorithm that is particularly well suited for the analysis of larger datasets. Also, because the CHAID algorithm will often effectively yield many multi-way frequency tables (e.g., when classifying a categorical response variable with many categories, based on categorical predictors with many classes), it has been particularly popular in marketing research, in the context of market segmentation studies. (Statsoft, 2003)
In pursuing regression imputation, instead of creating models for each imputation class that can really be time-consuming at the same time frustrating since not all models will have the same result, dummy variables should be inserted in the model. These dummy variables are the categories of the matching variables.
88 It would definitely save time and money since only one model is created and tested.
These researchers strongly recommend using a statistical package that can generate faster and a lot easier imputations but generate less biased estimates than programming. It would definitely save time than creating a computer program that eats up a majority of the research time in debugging and prevent computer crashes due to computer memory overload.
Bibliography [1] Cheng, J.H. and Sy, F. ,A Comparison of Several Techniques of Imputation on Clinical Data, 1997.
[2] Kalton, G, Compensating for Missing Survey Data, Michigan.1983
[3] Musil, C., Warner, C., Yobas, P. K. and Jones. S. A Comparison of Imputation Techniques for Handling Missing Data, Western Journal of Nursing Research. Vol.24, No.7, 815-829 (2002)
[4] Netter, J., Wasserman, W. and Kutner, M.H.. Applied Linear Statistical Models 2nd ed. Homewood, Illinois: Richard D. Irwin, Inc.
[5] Nordholt, E.S. (1998): Imputation: Methods, Simulation, Experiments and Practical Examples, International Statistical Review, Vol.66, No. 2, 157180.
[6] Salvino, S. and Yu, A. C. Some Approaches in Dealing With Nonresponse in Survey Operations With Applications to the 1991 Marinduque Census of Agriculture and Fisheries Data(1996)
90 [7] No tronic
author. Statsoft
CHAID Textbook.
Analysis
[Electronic
Retrieved
29
version],
July
2007,
Elecfrom
http://www.statsoft.com/textbook/stchaid.html [8] StatSoft, Inc. STATISTICA (data analysis software system), version 7.1. www.statsoft.com.(2005)
Appendix Appendix A Items and Information Gathered in the FIES 1997
92
Appendix B Source Codes of the Imputation Programs
93
Appendix C Model Validation of the Regression Equations used in the Regression Imputation Procedures