Chapter 7
Conclusion
This paper discussed a range of imputation methods to compensate for partial nonresponse in survey data and showed proofs on the disadvantages and advantages of the methods. It showed that when applying imputation procedures, it is important to consider the type of analysis and the type of point estimator of interest. Whether the goal of the researcher is to produce unbiased and efficient estimates of means, totals, proportions and official aggregated statistics or a complete data file that can be used for a variety of different analyses and by different users, the researcher should clearly identify first the type of analysis and the type of estimator of interest that will suit his or her purpose.
Anyone faced with having to make decisions about imputation procedures will usually have to choose some compromise between what is technically effective and what is operationally expedient. If resources are limited, this is a hard choice. It is to be hoped that this study might be helpful to some in guiding that choice.
There are other issues to consider in determining which imputation method should be used for a particular assumption. There are several practical issues that involve the case of implementation, such as difficulty of programming, amount of time it spends and complexity of the procedures used.
For our particular implementation, all of the methods were run to a programming language due to the unavailability of software that can generate imputations for all the methods these researchers intended to use. In all of the methods, the overall mean imputation was the simplest and easiest to use and to create a computer program. The other three methods required the formation of imputation classes. Both regression imputations were the hardest to program and the most time consuming imputation methods.
The performance of several imputation methods in imputing partial nonresponse observations was compared using the 1997 Family Income Expenditure Survey (FIES) data set. A set of criteria were computed for each method based on the data set with imputed values and data set with actual values to find the best imputation method for this data set. The criteria in judging the best method were the bias and variance estimates of the population mean of the imputed data, the preservation of the distribution by the actual data, and the other measures of accuracy and precision incorporated from the study of Kalton (1983).
The results show that the choice of imputation method significantly affected the estimates of the actual data. The similarities among the best two methods, namely, the deterministic and stochastic regression imputation methods were due in part to the adequacy and prediction power of the models.
The bias and variance estimates of the population mean of the imputed data obtained appeared to vary much across imputation methods and it was unexpected that the hot deck imputation method rendered the highest estimates in majority of the nonresponse rates as well as
its variables. Stochastic regression, on the other hand, was the best method in that particular criterion since in majority of the results in the tests it delivered relatively small biases and variances.
The distributions of the imputed data of each method were checked for the preservation of the distribution using the Kolmogorov-Smirnov Goodness of Fit test. In the methods used in this study, both regression imputation methods retained the distribution of the data especially the deterministic regression imputation that generated exactly the same distribution as the actual data.
In the other tests of accuracy and precision, namely, the mean deviation, mean absolute deviation and root mean square deviation, the different methods provided mixed results in all nonresponse rates. The results for some methods did not consistently and clearly yielded good results. Only half of the methods used provided great results in one particular criterion which is the preservation of the distribution of the data. In the other results, inconsistency was obviously seen due to the frequent alternating rankings from each method.
Given the criteria and procedures in judging the best imputation procedure from the set of four methods, the selection of the best method was difficult. Consequently, in order to determine the best method of imputing nonresponse observation for each variable in the study, the methods were ranked according to several criteria. The rank value registered a value 1 if it ranks first and 4 if it ranks worst in one particular criterion.
After comparing the methods, the two regression method namely the deterministic regression and stochastic imputation method gave the outstanding results. The results were ranking first and second and vice-versa in majority of the criteria. The researchers concluded that the stochastic regression imputation procedure is considered the best imputation method for this study.
The efficiency of the imputation method was supported by the coefficient of determination of the model and the added random residual in the deterministic imputed value. The random residuals added to the deterministic imputation provided a change in making the estimates less biased than its deterministic counterpart.
Deterministic regression imputation method performed much better than hot deck imputation method. It is surprising that the hot deck imputation method was less efficient than deterministic regression where in the related studies; it emerged as the better method than deterministic regression. Most likely the selection of donors with replacement might be the cause of this downfall and not the imputation classes. If it were the imputation classes, then both regression imputation methods estimates could be as worse as the hot deck imputation even if the model is adequate.