Final Revisions

October 2019
PDF

Download

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA

Overview

Download & View Final Revisions as PDF for free.

More details

Words: 18,485
Pages: 103

Preview
Full text

Imputation Procedures for Partial Nonresponse: The Case of 1997 Family Income and Expenditure Survey (FIES)

A Thesis Presented to The Faculty of the Mathematics Department College of Science De La Salle University - Manila

In Partial Fulfillment of the Requirements for the Degree Bachelor of Science in Statistics Major in Actuarial Science

by Diana Camille B. Cortes James Edison T. Pangan

August 2007

Approval Sheet The thesis entitled Imputation Procedures for Partial Nonresponse: The Case of 1997 FIES Submitted by Diana Camille B. Cortes and James Edison T. Pangan, upon the recommendation of their adviser, has been accepted and approved in partial fulfillment of the requirements for the degree of Bachelor of Science in Statistics Major in Actuarial Science.

ARTURO Y. PACIFICADOR JR., Ph.D. Thesis Adviser PANEL OF EXAMINERS

RECHEL G. ARCILLA, Ph.D. Chairperson IMELDA E. de MESA, M.O.S.

MICHELE G. TAN, M.S.

Member

Member

Date of Oral Defense: August 25, 2007

Acknowledgments The researchers would like to extend their warmest gratitude to the following people, who have undoubtedly contributed to the success of this study: • To Dr. Jun Pacificador Jr., for his supervision, suggestions and guidance during the duration of this thesis. • To our panelists, Dr. Rechel Arcilla, Prof. Imelda de Mesa and Ms. Michele Tan for helping us improve our thesis. • To Dr. Ederlina Nocon, for providing us the software LaTeX during THSMTH1 • To our parents especially Jed’s mother, Prof. Erlinda Pangan, for constantly reminding the researchers (i.e. ”Tapos na ba ang thesis nyo?”) about the thesis. • To Mark Nanquil and Norman Rodrigo, for helping us in using LaTeX and for their unwavering support to our thesis • To our classmates, friends from COSCA, La Salle Debate Society and Math Circle for their continuous encouragement and support. • Lastly, to The Lord Almighty, for providing us the strength, patience, wisdom and determination to finish this thesis.

Table of Contents

Title Page Approval Sheet

i ii

Acknowledgments

iii

Table of Contents

iv

Abstract

ix

1 The Problem and Its Background

1

1.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.2

Statement of the Problem . . . . . . . . . . . . . . . . . . . . . .

4

1.3

Objectives of the Study . . . . . . . . . . . . . . . . . . . . . . . .

4

1.4

Significance of the Study . . . . . . . . . . . . . . . . . . . . . . .

5

1.5

Scope and Limitations . . . . . . . . . . . . . . . . . . . . . . . .

6

2 Review of Related Literature 3 Conceptual Framework

8 17

3.1

Definition of Terms . . . . . . . . . . . . . . . . . . . . . . . . . .

17

3.2

Types of Nonresponse . . . . . . . . . . . . . . . . . . . . . . . .

18

v 3.3

Patterns of Nonresponse . . . . . . . . . . . . . . . . . . . . . . .

19

3.4

Nonresponse Bias . . . . . . . . . . . . . . . . . . . . . . . . . . .

21

3.5

The Imputation Process . . . . . . . . . . . . . . . . . . . . . . .

24

3.5.1

Overall Mean Imputation . . . . . . . . . . . . . . . . . .

26

3.5.2

Hot Deck Imputation . . . . . . . . . . . . . . . . . . . . .

28

3.5.3

Regression Imputation . . . . . . . . . . . . . . . . . . . .

30

Deterministic Regression . . . . . . . . . . . . . . . . . . .

31

Stochastic Regression . . . . . . . . . . . . . . . . . . . . .

32

4 Methodology 4.1

34

Source of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34

4.1.1

General Background . . . . . . . . . . . . . . . . . . . . .

34

4.1.2

Sampling Design and Coverage . . . . . . . . . . . . . . .

35

4.1.3

Survey Characteristics . . . . . . . . . . . . . . . . . . . .

35

4.1.4

Survey Nonresponse . . . . . . . . . . . . . . . . . . . . .

36

4.2

The Simulation Method . . . . . . . . . . . . . . . . . . . . . . .

37

4.3

Formation of Imputation Classes . . . . . . . . . . . . . . . . . .

39

4.4

Performing the Imputation Methods . . . . . . . . . . . . . . . .

41

4.4.1

Overall Mean Imputation (OMI) . . . . . . . . . . . . . .

41

4.4.2

Hot Deck Imputation (HDI) . . . . . . . . . . . . . . . . .

42

4.4.3

Deterministic and Stochastic Regression Imputation (DRI)

4.5

and (SRI) . . . . . . . . . . . . . . . . . . . . . . . . . . .

43

Comparison of Imputation Methods . . . . . . . . . . . . . . . . .

45

4.5.1

45

The Bias of the Mean of the Imputed Data . . . . . . . . .

vi 4.5.2

Comparing the Distributions of the Imputed vs. the Actual Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.5.3

4.5.4

46

Other Measures in Assessing the Performance of the Imputation Methods . . . . . . . . . . . . . . . . . . . . . . . .

48

Determining the Best Imputation Method . . . . . . . . .

50

5 Results and Discussion

51

5.1

Descriptive Statistics of Second Visit Data Variables . . . . . . . .

51

5.2

Formation of Imputation Classes . . . . . . . . . . . . . . . . . .

52

5.2.1

Mean of the Simulated Data by Nonresponse Rate for Each Variables of Interest . . . . . . . . . . . . . . . . . . . . .

58

5.3

Regression Model Adequacy . . . . . . . . . . . . . . . . . . . . .

59

5.4

Evaluation of the Imputation Methods . . . . . . . . . . . . . . .

63

5.4.1

Overall Mean Imputation . . . . . . . . . . . . . . . . . .

64

5.4.2

Hot Deck Imputation . . . . . . . . . . . . . . . . . . . . .

66

5.4.3

Deterministic Regression Imputation . . . . . . . . . . . .

69

5.4.4

Stochastic Regression Imputation . . . . . . . . . . . . . .

71

5.5

Distribution of the True vs. Imputed Values . . . . . . . . . . . .

73

5.6

Choosing the Best Imputation . . . . . . . . . . . . . . . . . . . .

78

6 Conclusion

83

7 Recommendations for Further Research

87

vii

List of Tables • Table 1: Imputed Values of GPA Using HDI (p.29) • Table 2: Descriptive Statistics of the 1997 FIES Second Visit (p.51) • Table 3: The Candidate MV PROV and Its Categories (p.53) • Table 4: The Candidate MV CODEP1 and Its Categories (p.54) • Table 5: The Candidate MV CODES1 and Its Categories (p.54) • Table 6: Chi-Square Test of Independence for the Matching Variable (p.55) • Table 7: Measures of Association for Matching Variable (p.56) • Table 8: Descriptive Statistics of the Data Grouped into Imputation Classes. (p.57) • Table 9: Means of the Retained and Deleted Observations (p.58) • Table 10: Model Adequacy Results (p.60) • Table 11: Criteria Results for the OMI Method (p.64) • Table 12: Criteria Results for the HDI Method (p.66) • Table 13: Criteria Results for the DRI Method (p.69) • Table 14: Criteria Results for the SRI Method (p.71) • Table 15: Ranking of the Different Imputation Methods: 10% NRR (p.79) • Table 16: Ranking of the Different Imputation Methods: 20% NRR (p.80) • Table 17: Ranking of the Different Imputation Methods: 30% NRR (p.81)

viii

List of Figures • Figure 1: Distribution of the Data Before and After Imputation (p.27) • Figure 2: Bar Chart for TOTIN2, 10% NRR (p.73) • Figure 3: Bar Chart for TOTIN2, 20% NRR (p.74) • Figure 4: Bar Chart for TOTIN2, 30% NRR (p.74) • Figure 5: Bar Chart for TOTEX2, 10% NRR (p.75) • Figure 6: Bar Chart for TOTEX2, 20% NRR (p.75) • Figure 7: Bar Chart for TOTEX2, 30% NRR (p.76)

Abstract Several imputation methods have been developed for imputing missing responses. It is often not clear which imputation method is best for a particular assumption. In choosing an imputation method, several factors should be considered such as the types of estimates that will be generated, the type and pattern of nonresponse, and the availability of the auxiliary data that are highly correlated with characteristic of interest or with the response propensity.

This study compared the effectiveness of four imputation methods namely the Overall Mean, Hot Deck, Deterministic and Stochastic Regression Imputation using the first visit variable to be its auxiliary variable. Values for variables Total Income and Expenditures (TOTIN2 and TOTEX2)of the second visit were set to nonresponse to satisfy the assumption of partial nonresponse. The results of the study provide some support for the following conclusions: (a) for the 1997 FIES data, the Hot Deck Imputation and Overall Mean Imputation methods are not appropriate for handling partial nonresponse data; (b) stochastic regression imputation was selected as the best imputation method; and (c) the imputation classes must be homogeneous to produce less biased estimates.

Chapter 1

The Problem and Its Background 1.1

Introduction

Missing data in sample surveys is inevitable. The problem of missing data occurs for various reasons such as when the respondent moved to another location, refused to participate in the survey or is unable to answer specific items in the survey. This failure to obtain responses from the units selected in the sample is called nonresponse. There are several types of nonresponse; (a) Unit nonresponse refers to the failure to collect any data from a sample unit; (b) while item nonresponse refers to the failure to collect valid responses to one or more items from a responding sample unit (i.e. in cases of surveys with only one phase or considers a single phase ignoring other phases); and (c) partial nonresponse occurs when there is a failure to collect responses for large sets or a block of items (i.e. in cases of surveys with two phases, the same respondent cannot answer in the second phase of the survey hence the items for the second phase of the survey are missing) for a responding unit.

The effect of nonresponse must not be ignored since it leads to biased estimates

2 which if large would result to inaccuracy. Bias due to nonresponse is believed to be a function of nonresponse rates and the difference in characteristic between responding and nonresponding units. The larger the nonresponse rate or the wider the difference in characteristic between the responding and nonresponding units, the larger the bias.

In practice, there are three ways of handling missing data. These are discarding the missing values, applying weighting adjustments or using imputation methods. Discarding the missing values or otherwise known as the Available Case Method is based on excluding the nonresponse records when analyzing the variable of interest. The problem with this method is that it does not account for the difference in characteristic between the responding and nonresponding units. Hence, methods for compensating missing data are applied. Another method is called weighting adjustments. Weighting adjustments is based on matching nonrespondents to respondents in terms of data available on nonrespondents and increasing the weights of matched respondents to account for the missing values. Hence, a weight proportionate to the amount of nonresponse is often multiplied to the inverse of the response rate. This is often applied for unit nonresponse. On the other hand, imputation is also used by statisticians to account for nonresponse, usually in the case of item and partial nonresponse. In imputation, a missing value is replaced by a reasonable substitute for the missing information. Once nonresponse has been dealt with, whether by weighting adjustments or imputation, then researchers can proceed with their data analysis.

3 The Family Income and Expenditure Survey (FIES) is an example of a survey which has more than one round of data collection. The FIES is a nationwide survey of households conducted every three years with two visits per survey period on the sample unit by the National Statistics Office (NSO) in order to provide information of the country’s income distribution, spending patterns and poverty incidence. Like any other survey, FIES encounters the problem of missing data, particularly the problem of nonresponse during the second visit. Given the various contributions that this survey can provide, it is then important to have precise estimates of the income and expenditure indicators.

With the 1997 FIES as the data set for this study, this paper will focus on dealing with partial nonresponse through the use of imputation methods. It aims to examine the effects of imputed values in coming up with estimates for the missing data at various nonresponse rates. Furthermore, the study aims to determine which imputation method is appropriate for the FIES data through applying some of the methods mentioned in the study about the 1978 Research Panel Survey for the Income Survey Development Program (ISDP) entitled Compensating for Missing Data by Kalton (1983).

4

1.2

Statement of the Problem

This paper attempts to answer the following questions: 1. Which imputation method is the most appropriate for the FIES data? 2. How do varying nonresponse rates affect the results for each imputation method?

1.3

Objectives of the Study

The paper aims to achieve the following objectives: 1. To compare the imputation methods namely Overall Mean Imputation, Hot Deck Imputation, Deterministic Regression and Stochastic Regression, based on its efficiency and ability to recapture the deleted values by generating the missing values on the FIES 1997 second visit data using the first visit data of the same survey. 2. To investigate the effect of the varying rates of missing observations, particularly the effect of 10%, 20% and 30% nonresponse rates on the precision of the estimates.

5

1.4

Significance of the Study

Nonresponse is a common problem in conducting surveys. The presence of nonresponse in surveys causes to create incomplete data, which could pose serious problems during data analysis, particularly in the generation of statistically reliable estimates. For this reason, the use of imputation methods enables to account for the difference between respondents and nonrespondents. This then helps reduce the nonresponse bias in the survey estimates.

Since most statistical packages require the use of complete data before conducting any procedure for data analysis, the use of imputation methods can ensure consistency of results across analyses, something that an incomplete data set cannot fully provide.

In a news article by Obanil(2006) entitled Topmost Floor of the NSO Building gutted by Fire posted at Manila Bulletin Online, it mentioned that last October 3, 2006 around 1 Million Pesos worth of documents were destroyed by the fire. Among the documents gutted by the fire is the first visit questionnaire of the FIES for the NCR which at the time of the fire has not yet been encoded.

In terms of statistical research, most countries in the developing world such as the United States, Canada, UK and the Netherlands already employ imputation methods in their respective national statistical offices. In a country such as the Philippines, where data collection is very difficult especially for some regions like the National Capital Region (NCR), imputation will be able to ease the problem

6 of data collection and nonresponse.

More importantly, given the great impact of this survey to the country, employing imputation methods help statisticians to provide a method in handling nonresponse, which could lead to a more meaningful generalization about our country’s income distribution, spending patterns and poverty incidence. Hence, having estimates with less bias and more consistent results, this can contribute in making our policymakers and economists provide better solutions in improving the lives of the Filipinos.

1.5

Scope and Limitations

Throughout this paper, only the 1997 Family Income and Expenditure Survey (FIES), will be used to tackle the problem of nonresponse and to examine the impact of the different imputation methods applied in the dataset. With regards to the extent of how these imputation methods will be applied and evaluated, this paper will only cover the partial nonresponse occurring in the National Capital Region (NCR) since NCR is noted as the region with highest nonresponse rate. Also, the variables that will be imputed for this study would be the Total Income (TOTIN2) and Total Expenditures (TOTEX2) in the second visit of the FIES data.

The researchers will only focus on using the 1997 FIES data on the first visit

7 to impute the partial nonresponse that is present on the second visit. This paper also assumes that the first visit data is complete and the pattern of nonresponse follows Missing Completely at Random (MCAR) case. The MCAR case happens if the probability of response to Y is unrelated to the value of Y or to any other variables; making the missing data randomly distributed across all cases (Musil et. al, 2002). If the pattern on nonresponse does not satisfy the MCAR assumption, imputation methods may not achieve its purpose.

As for the imputation methods, only four will be applied for this paper namely: Overall Mean Imputation (OMI), Hot Deck Imputation (HDI), Deterministic Regression Imputation (DRI) and Stochastic Regression Imputation (SRI).

On the aspect of evaluating the efficacy and appropriateness of the four imputation methods, this will only be limited to the following: (a) Bias of the Mean of the Imputed Data, (b) Assessment of the Distributions of the Imputed vs. the Actual Data and (c) the criteria mentioned in the report entitled Compensating for Missing Data(Kalton, 1983) namely the Mean Deviation, Mean Absolute Deviation and the Root Mean Square Deviation.

Chapter 2

Review of Related Literature Much research effort has been devoted in the efficacy of various imputation methods. In the report entitled Compensating for Missing Survey Data, two simultion studies were carried out using the data in the 1978 Income Survey Development Program (ISDP) Research Panel to compare some imputation methods. The first study compared imputation methods for the variable Hourly Rate of Pay while the second dealt with the imputation of the variable Quarterly Earnings. For both studies, the author stratified the data into its imputation classes, constructed data sets with missing values by randomly deleting some of the recorded values in the original dataset and then applied the various imputation methods to fill in the missing values. This process was replicated ten times to ensure consistency of the results. Once the imputation methods have been applied, the three measures for evaluating the effectiveness of imputation methods namely the Mean Deviation, Mean Absolute Deviation and the Root Mean Square Deviation were obtained and averaged across the ten trials (Kalton, 1983).

For the first study of imputing the variable Hourly Rate of Pay, eight methods were used namely the Grand Mean Imputation (GM), the Class Mean Imputa-

9 tion using eight imputation classes (CM8), the Class Mean Imputation using ten imputation classes (CM10), Random Imputation with eight imputation classes (RM8), Random Imputation with ten imputation classes (RM10), Multiple Regression Imputation (MI), Multiple Regression Imputation plus a random residual chosen from a normal distribution (MN) and Multiple Regression Imputation plus a randomly chosen respondent residual (MR). Using the Mean Deviation criteria, the results showed that all mean deviations were negative, indicating that the imputed values underestimated the actual values. Moreover, the results show that the Grand Mean Imputation (GM) has the greatest underestimation among the eight procedures. Meanwhile for the Mean Absolute Deviation and Root Mean Square Deviation, which measures the ability to reconstruct the deleted value, the results showed that the Grand Mean Imputation fared the worst for both criteria. In addition, it also showed that the Multiple Regression Imputation (MI) obtained the best measures for the two criteria and that the procedures with greater number of imputation classes (i.e.CM8 VS. CM10, RC8 VS. RC10) yield slightly better results for the two criteria (Kalton, 1983).

For the second study, which is the imputation of Quarterly Earnings, ten imputation procedures were used. These are the Grand Mean Imputation (GM), the Class Mean Imputation using eight imputation classes (CM8), the Class Mean Imputation using twelve imputation classes (CM12), Random Imputation with eight imputation classes (RM8), Random Imputation with twelve imputation classes (RM12), Multiple Regression Imputation (MI), Multiple Regression Imputation plus a random residual chosen from a normal distribution (MN), Multiple Regres-

10 sion Imputation plus a randomly chosen respondent residual (MR), Mixed Deductive and Random Imputation using eight imputation classes (DI8) and Mixed Deductive and Random Imputation using twelve imputation classes (DI12). Using the first criteria, the Mean Deviation, the results showed that the Grand Mean (GM) obtained a positive bias. This implied that the grand mean imputation is not an effective imputation method for the this study. The results also showed that the regression imputation procedures have almost similar results producing almost unbiased estimates. In addition, the Class Mean Imputation methods (CM8 and CM12) have similar measures with those of the Random Imputation Methods. Nevertheless, all methods have produced relatively small mean deviations except for the last two methods. Comparing the Mean Absolute Deviations and the Root Mean Square Deviations, the results show that the Grand Mean Imputation obtained values similar to the regression procedures with residuals (i.e. Multiple Regression Imputation plus a random residual chosen from a normal distribution or MN, Multiple Regression Imputation plus a randomly chosen respondent residual or MR). The results also show that the RC8. RC12, MN and MR procedures are over one third larger compared to deterministic procedures such as the CM8, CM12 and MI procedures (Kalton, 1983).

To further investigate the relatively larger biases of DI8 and DI12 procedures, the author further divided the date into the deductive and non deductive cases. This shed further light on the Mean Deviations and Mean Absolute Deviations of the various imputation methods. It was found that the mean deviations are positive on the deductive case and negative on the non deductive case for all of

11 the procedures. These then explains why there are relatively small deviations in the previous results since the measures between the cases tend to cancel out. It also showed that the DI8 and DI12 results are similar to those of the RC8, RC12, CM8 and CM12 in the non deductive cases but are largely different in the deductive cases. This explains the larger values of DI8 and DI12 in the previous results (Kalton, 1983).

At the end of the two studies, it showed that the imputation procedures tend to overestimate the Hourly Rate of Pay and underestimate the Quarterly Earnings. Moreover, it showed how the mean imputation appears to be the weakest imputation method among the studies since it has distorted the distribution of the original data. Lastly, Kalton’s study shows the impact of increasing the imputation classes with respect to the criteria used such that it gives a better yield of values for the three criteria.

In contrast to Kalton’s criteria in measuring the performance of imputation procedures, a paper entitled A Comparison of Imputation Techniques for Missing Data by C. Musil, C. Warner, P. Yobas and S. Jones (2002), the authors presented a much simple approach in evaluating the performance of imputation techniques by using the means, standard deviation and correlation coefficients, then comparing the statistics of the original data with the statistics obtained from the five methods namely Listwise deletion, Mean Imputation, Deterministic Regression, Stochastic Regression and EM Method. The Expectation Maximization (EM) Method is an iterative procedure that generates missing values by using expectation (E-step)

12 and maximization (M-step) algorithms. The E-step calculates expected values based on all complete data points while the M-step replaces the missing values with E-step generated values and then recomputed new expected values.

Using the Center for Epidemiological Studies data on stress and health ratings of older adults, the authors imputed a single variable namely the functional health rating. Of the 492 cases, 20% cases were deleted in an effort to maximize the effects of each imputation method. Except for the Listwise Deletion and Mean Imputation, the researchers used the SPSS Missing Value Analysis function for the Deterministic Regression, Stochastic Regression and EM Method. For the correlations, the researchers obtained the correlation values of the original data and the five methods of the imputed variable with the variables, age, gender and self assed health rating. (Musil et. al, 2002) The results show that comparing the mean of the original data with the five methods, all imputed values underestimated the mean. The closest to the original data was the Stochastic Regression, followed very closely by EM Method, Deterministic Regression, Listwise Deletion and Mean Imputation. The same results also hold for the standard deviations. For the correlations, however, the EM Method produced the closest correlation values to the original data followed closely by the Stochastic Regression, Deterministic Regression, Listwise Deletion and Mean Imputation. Hence, the Finding suggests that the Stochastic Regression and EM Method performed better while the Mean Imputation is the least effective (Musil et. al, 2002).

In another study by Nordholt (1998) entitled Imputation Methods, Simulation,

13 Experiments and Practical Examples, the authors described two simulation experiments of the Hot Deck Method. The first study focused on comparing whether the Hot Deck Method performs better than leaving the records with nonresponse out of the data set when analyzing the variable, which is known as the Available Case Method. This was done by constructing a fictitious data set of four values; two of these variables were used for the imputation. Then nonresponse rates were identified namely 5%, 10% and 20% and the simulation process was replicated 50 times. The data set containing the missing values was first analyzed using the Available Case Method then followed by the Hot Deck Imputation. Same with the methodology of Musil et.al.(2002), descriptive statistics such as the mean, variance and correlation were computed. Moreover, the absolute differences between the original and the available case method also with the original and hot deck method were computed. Based on his criteria, the results show that Hot Deck performs better than the Available Case Method. Also, it showed that the Hot Deck, while had closer results with the original data, has the tendency to underestimate the values. In terms of the absolute differences, it was observed that these values increase when the percentage of missing values also increases.

Nordholt’s second simulation study focused on the effects of covariates, otherwise known as imputation classes on the quality of the Hot Deck Imputation. Using the data of the Dutch Housing Demand Survey of Statistics Netherlands, the variable value of the house was chosen as the variable to be imputed due to its importance and the frequency of nonresponse occurring in that variable. For this study, the observations under category 13 (value worth at least 150,000) and cat-

14 egory 22 (value worth at 300,000) are changed into missing values. The rationale for this choice was to ensure that the original value from these categories will note be used as the replacements for the variable to be imputed since it is no longer in the file. Then imputation classes were created once the missing values were already identified. A table showing the number of respondents before and after imputation showed that in every category except for 13 and 22, which was set as missing values, the number of respondents increased after the imputation. This showed that the remaining records have equal probability of becoming a donor record for an imputation and that not all imputations give values that are near category 13 or 22. Nordholt also explored on the Available Case Method and Hot Deck Method for this real life data. Same with the first study, the Hot Deck fared better than the Available Case Method (Nordholt, 1998).

Lastly, Nordholt addressed several questions regarding imputation. Using examples of how imputation is applied on the real life surveys such as the Dutch Housing Demand Survey, European Community Household Panel Survey (ECHP) and the Dutch Structure of Earning Survey, he outline four criteria to decide which variables to be imputed. These are the importance of a variable, the percentage of nonresponse, the predictability of missing values and the cost of imputation. He also mentioned how it is important to estimate the duration of the imputation process due to the need of the study to be timely. The duration, according to Nordholt, is dependent on the number of variables to be imputed, the available capacity, the user friendliness of an imputation package and the desired imputation quality. These issues must be settled first before conducting any imputation

15 process and choosing the appropriate imputation strategy (Nordholt, 1998).

There were two undergraduate theses that conducted a similar study on imputation. The first undergraduate thesis was by Salvino and Yu (1996). They assessed the efficiency of the Mean Imputation versus Hot Deck Imputation Technique by applying these techniques on the 1991 Census on Agriculture and Fisheries (CAF) data. In their research, they generated an incomplete data using the Gauss Software for the imputed variables which were the count for cattle, hogs and chicken. In order to determine which is better between the two, the variances were compared. Looking at the variances, it was determined that the Hot Deck Imputation Technique was better. Also, the design effect was considered by dividing the variance of the Hot Deck Imputation versus the Mean Imputation, since the ratio produced was less than one, they concluded that again, the Hot Deck Imputation Technique is a better option.

Another undergraduate thesis by Cheng and Sy (1999)focused on assessing imputation techniques on a clinical data. Using the data from DATA: A Cellection of Problems from Many Fields for the Student and Research Worker, the authors employed four methods of imputation namely Mean Imputation, Hot Deck Imputation, Linear Regression and Multiple Linear Regression. Using the statistical package, SAS, they performed diagnostic checking for the regression models as well as determining the R2 of the different linear combinations of regression models to come up with the regression equation that was used for the imputation. Once this was determined, they proceeded in using the MS Fox Pro program to

16 create a tool for implementing their imputations. When the results were obtained, they assessed the efficacy of the imputation techniques by looking at the accuracy and precision of the estimates. Accuracy was measured by the percentage error and the variance of these percentage errors were the basis for the precision of the estimates. The results show that the Linear Regression was the best method, followed closely by Multiple Regression, then Hot Deck and finally the Mean Imputation. While this is the case, it can be noted that in this study, the criteria for determining the best imputation model was not extensive as it only used the percentage error and variance of the percentage errors. Also, the study wasn’t able to explore on the use of imputation classes to improve the accuracy and precision of the imputation methods.

Chapter 3

Conceptual Framework

3.1

Definition of Terms

Bias is defined as the difference between the expected value of an estimator and the true value of the parameter being estimated. The bias is expressed by:

ˆ = E[θ] ˆ −θ Bias(θ)

where θˆ is the estimator of the true value of the parameter θ and θ is the true value of the parameter.The bias of an estimator can be positive, negative or even zero. An estimator having nonzero bias is said to be an unbiased estimator. Accuracy is the extent to which estimates are close to the value of the parameter. Precision is is the extent to which estimates are close to one another. Efficiency is defined to be the measurement on how a method is accomplished through a set of criteria. Nonresponse is the failure to collect valid response for a particular unit.

18

3.2

Types of Nonresponse

The types of nonresponse focus on the method in which the observations are nonresponse values. Kalton (1983) stressed the importance to differentiate the types of nonresponse: Unit (or total) nonresponse, item nonresponse, and partial nonresponse. Unit (or Total) nonresponse takes place when no was information collected from a sampling unit. There are many causes of this nonresponse, namely, the failure to contact the respondent (not at home, moved or unit not being found), refusal to give information, inability of the unit to cooperate (might be due to an illness or a language barrier) or lost questionnaires.

Item nonresponse, on the other hand, happens when the information collected from a unit is incomplete due to refusal to answering some of the questions. There are many causes of item nonresponse, namely, refusal to answer the question due to the lack of information necessarily needed by the informant, failure to make the effort required to establish the information by retrieving it from his memory or by consulting his records, refusal to give answers because the questions might be sensitive, the interviewer fails to record an answer or the response is subsequently rejected at an edit check on the grounds that it is inconsistent with other responses (may include an inconsistency arising from a coding or punching error occurring in the transfer of the response to the computer data file).

Lastly, Partial Nonresponse is the failure to collect large sets of items for a responding unit. A sampled unit fails to provide responses for the following, namely, in one or more waves of a panel survey, later phases of a multi-phase

19 data collection procedure (e.g. second visit of the FIES), and later items in the questionnaire after breaking off a telephone interview. Other reasons namely include, data are unavailable after all possible checking and follow-up, inconsistency of the responses that do not satisfy natural or reasonable constraints known as edits, which one or more items are designated as unacceptable and therefore are artificially missing, and similar causes given in Unit (Total) Nonresponse. In this study, the researchers dealt with Partial Nonresponse occurring in the second visit of the 1997 FIES.

3.3

Patterns of Nonresponse

A critical issue in addressing the problem of nonresponse is identifying the pattern of nonresponse. Determining the patterns of nonresponse is important because it influences how missing data should be handled. There are three patterns of nonresponse namely Missing Completely At Random, Missing at Random and Non Ignorable Nonresponse.

A missing data is said to be Missing Completely At Random (MCAR) if the probability of having a missing value for Y is unrelated to the value of Y itself or to any other variable in the data set. Data that are MCAR reflect the highest degree of randomness and show no underlying reasons for missing observations that can potentially lead to bias research findings (Musil et.al, 2002). Hence, the missing data is randomly distributed across all cases such that the occurrence of missing data is independent to other variables in the data set.An example of

20 the MCAR pattern is when a sample unit in the survey fails to provide an answer to the total monthly expenditure because the unit cannot be reached.

Another pattern of nonresponse is the Missing At Random (MAR) case. The missing data is considered to be MAR if the probability of missing data on Y is unrelated to the value of Y after controlling for other variables in the analysis. This means that the likelihood of a case having incomplete information on a variable can be explained by other variables in the data set. An example of the MAR pattern is when a sampling unit fails to provide an answer to the total monthly expenditure because the sampling unit is a male household. The missing information about the total monthly expenditure is dependent on the gender of the sampling unit and not on the total monthly expenditure itself.

Meanwhile, the Non Ignorable Nonresponse (NIN) is regarded as the most problematic nonresponse pattern. When the probability of missing data on Y is related to the value of Y and possibly to some other variable Z even if other variables are controlled in the analysis, such case is termed as NIN. NIN missing data have systematic, nonrandom factors underlying the occurrence of the missing values that are not apparent or otherwise measured. NIN missing data are the most problematic because of the effect in terms of generalizing research findings and may potentially create bias parameter estimates, such as the means, standard deviations, correlation coefficients or regression coefficients (Musil et.al, 2002). An example of the NIN pattern is when a sampling unit from the higher income groups fails to provide information even if the gender of the unit is being

21 controlled. Using the example in the MAR pattern, however, the sampling unit did not also provide answer because he was a high income earner. This is considered NIN since the sampling unit also depends on the income group even if the gender of the unit was controlled (Musil, et al., 2002).

These patterns are considered as an important assumption before any imputation takes place. For an imputation procedure to work and achieve statistically acceptable and reliable estimates, the pattern of nonresponse must either satisfy the MCAR or MAR assumption. For this study, the researchers created missing observations that satisfy the MCAR assumption.

3.4

Nonresponse Bias

In most surveys, there is a large propensity of the post-analysis results to become invalid due to the missing data. Missing data can be discarded, ignored or substituted through some procedure. When data is deleted or ignored in generating estimates, the nonresponse bias becomes a problem (Kalton, 1983). The effect of deleting the missing data on nonresponse bias is illustrated below:

Suppose the population is divided in two groups or strata. The first group consisting of all units in the population for which units will be obtained (Respondents) and the second group are those units for which no measurement will be obtained (Nonrespondents).

22 To arrive at the proper estimation of the nonresponse bias, the following quantities are defined:

Let R be the number of respondents and M (M stands for missing) be the number of nonrespondents in the population, with R + M = N. Assume that a Simple Random Sample (SRS) with replacement is drawn from each group. The cor¯ = responding sample quantities are r and m, with r + m = n. Let R ¯ = M

M N

R N

and

be the proportions of respondents and nonrespondents in the population

and let r¯ =

r n

and m ¯ =

m n

be the response and nonresponse rates in the sample.

The population total and mean are given by Y = Yr + Ym = RY¯r + M Y¯m and ¯ Y¯r + M ¯ Y¯m , where Yr and Y¯r are the total and mean for respondents and Y¯ = R Ym and Y¯m are the same quantities for the nonrespondents. The corresponding sample quantities are y = yr + ym = r¯ yr and y¯ = r¯y¯r + m¯ ¯ ym (Kalton, 1983).

If no compensation is made for nonresponse, the respondent sample mean y¯r is used to estimate Y . Its bias is given by B(¯ yr ) = E[¯ yr ] − Y¯ . The expectation of y¯r can be obtained in two stages, first conditional on fixed r and then over different values of r, i.e. E[¯ yr ] = E1 E2 [¯ yr ] where E2 is the conditional expectation for fixed r and E1 is the expectation over different values of r.

The E[y¯r ] is given by: E[¯ yr ] = E1 [

P E (y 2

r

ri )

] = E1 [Y¯r ] = Y¯r

Hence, the bias of y¯r is given by: ¯ (Y¯r − Y¯m ). B(¯ yr ) = Y¯r − Y¯ = M

23 The equation above shows that y¯r is approximately unbiased for Y¯ if either the ¯ is small or the mean for nonrespondents, Y¯m , proportion of nonrespondents M is close to the respondents, Y¯r . Since the survey analyst usually has no direct empirical evidence on the magnitude of (Y¯r − Y¯m ), the only situtation in which he can have confidence that the bias is small is when the nonresponse rate is low. ¯ many survey results escape sizable However, in practice, even with moderate M biases because (Y¯r − Y¯m ) is fortunately often not large (Kalton, 1983).

In reducing nonresponse bias caused by missing data, there are many procedures that can be applied and one of these procedures is imputation. In this study, imputation methods are applied to eliminate nonresponse and reduce bias to the estimates. Imputation is briefly defined as the substitution of values for the nonresponse observations.

24

3.5

The Imputation Process

Imputation is listed as one of the many procedures that can be used to deal with nonresponse in order to generate more unbiased results. Imputation is the process of replacing a missing value through available statistical and mathematical techniques, with a value that is considered to be a reasonable substitute for the missing information (Kalton,1983).

Imputation has certain advantages. First, utilizing imputation methods help reduce biases in survey estimates. Second, imputation makes analysis easier and the results are simpler to present. Imputation does not make use of complex algorithms to estimate the population parameters in the presence of missing data; hence, much processing time is saved. Lastly, using imputation methods can ensure consistency of results across analyses, a feature that an incomplete data set cannot fully provide.

On the other hand, imputation has also several disadvantages. There is no guarantee that the results obtained after applying imputation methods will be less biased than those based on the incomplete data set. Hence, the use of imputation methods depends on the suitability of the assumptions built into the imputation procedures used. Even if the biases of univariate statistics are reduced, there is no assurance that the distribution of the data and the relationships between variables will remain. More importantly, imputation is just a fabrication of data. Many naive researchers falsely treat the imputed data as a complete data set for n respondents as if it were a straightforward sample of size n.

25

There are four Imputation Methods (IMs) applied in this study, namely, the Overall (Grand) Mean Imputation (OMI), Hot Deck Imputation (HDI), Deterministic Regression Imputation (DRI) and Stochastic Regression Imputation (SRI). For most imputation methods, imputation classes are needed to be defined before performing the imputation methods.

Imputation classes are stratification classes that divide the data into groups before imputation takes place. The formation of imputation classes is very useful if the classes are divided into homogeneous groups. That is, similar characteristics that has some propensity to provide the same response. The variables used to define imputation classes are called matching variables. In getting the values to be substituted to the nonresponse observations, a group of observations coming from a variable with a response are used. These records are called donors. The records with missing observations to be substituted are called recipients.

Problems might arise if imputation classes are not formed with caution. One of them is the number of imputation classes. The imputation class must have a definite number of classes applied to each method. The larger the number of imputation classes, the possibility of having fewer observations in one class increases. This can cause the variance of the estimates under that class to increase. On the other hand, the smaller the number of imputation classes, the possibility of having more observations in that class increases thus making the estimates burdened with aggregation bias.

26

3.5.1

Overall Mean Imputation

The mean imputation method is the process by which missing data is imputed by the mean of the available units of the same imputation class to which it belongs (Cheng and Sy, 1999). One types of this method is the Overall Mean Imputation (OMI) method. The OMI is one of the widely used method in imputing ofr missing data. The OMI method simply replaces each missing data by the overall mean of the available (responding) units in the same population. The overall mean is given by r X

y¯omi =

i=1 r

yri = y¯r

where yomi is the mean of the entire sample of the responding units of the yth variable and yri is the observation under y which are responding units.

There are many advantages and disadvantages of this method. The advantage of using this method is its universality. This means that it can be applied to any data set. Moreover, this method does not require the use of imputation classes. Without imputation classes, the method becomes easier to use and results are generated faster.

27 Figure 1 Distribution of the Data Before and After Imputation

However, there are serious disadvantages of this method. Since missing values are imputed by a single value, the distribution of the data becomes distorted (Figure 1). The distribution of the data becomes too peaked making it unsuitable in many post-analysis. Second, it produces large biases and variances because it does not allow variability in the imputation of missing values. Many related literatures stated that this is the least effective. Thus, this method is never recommended to used.

28

3.5.2

Hot Deck Imputation

One of the most popular and widely known methods used is the Hot Deck Imputation (HDI) method. The HDI method is the process by which the missing observations are imputed by choosing a value from the set of available units. This value is either selected at random (traditional hot deck), or in some deterministic way with or without replacement (deterministic hot deck), or based on a measure of distance (nearest-neighbor hot deck). To perform this method, let Y be the variable that contains missing data and X that has no missing data. In imputing for the missing data: 1. Find a set of categorical X variables that are highly associated with Y . The X variables to be selected will be the matching variables in this imputation. 2. Form a contingency table based on X variables. 3. If there are cases that are missing within a particular cell in the table, select a case from the set of available units from Y variable and impute the chosen Y value to the missing value. In choosing for the imputation to be substituted to the missing value, both of them must have similar or exactly the same characteristics. Cheng and Sy (1999) stated that HDI method give estimates that reflect more accurately to the actual data by making imputation classes. If the matching variables are closely associated with the variable being imputed, the nonresponse bias should be reduced.

29 Example 1: Suppose that a survey is conducted with a sample of ten people. In the survey, three people refused to provide their Grade Point Average (GPA) for the previous term. Missing answer from each nonrespondent are replaced by a known value from a responding unit who has similar characteristics such as sex, degree or course (Course), Dean Lister (DL), Honor student in High School (HS2), and Hours of study classes (HSC). Suppose the set of X matching variables that are highly associated to GPA are the variables DL and HS2. Table 1 shows the data with imputed values. Values in parenthesis are the imputed values that were randomly chosen in their respective imputation classes.

Table 1: Imputed Values of GPA using HDI

Like OMI, there are certain advantages in using this method. One major attraction of this method cited by Kazemi (2005) is that imputed values are all actual

30 values. More importantly, the shape of the distribution is preserved. Since imputation classes are introduced, the chance in distorting the distribution decreases.

On the other hand, it also has a set of disadvantages. First, in order to form imputation classes, all X variables must be categorical. Second, the possibility of generating a distorted data set increases if the method used in imputing values to the missing observations is without replacement as the nonresponse rate increases. Observations from the donor record might be used repeatedly by the missing values causing the shape of the distribution to get distorted. Third, the number of imputation classes must be limited to ensure that all missing values will have a donor for each class.

3.5.3

Regression Imputation

As in MI and HDI methods, this procedure is one of the widely known used imputation methods. The method of imputing missing values via the least-squares regression is known to be the Regression Imputation (RI) Method. There are many ways of creating a regression to be used in imputing for the missing observations. The y-variable for which imputations are needed is regressed on the auxiliary variable (x1 , x2 , x3 , ..., xp ) for the units providing a response on y. These auxiliary variables may be quantitative or qualitative, the latter being incorporated into the regression model by means of dummy variables. There are two basic types of the RI method: (a) Deterministic Regression Imputation and (b) Stochastic Regression Imputation.

31 In comparing for the accuracy and efficiency of the RI method, it will be helpful if the RI methods to be compared have the same imputation class.

Deterministic Regression The use of the predicted value from the model given the values of the auxiliary values that contains no missing data for the record with a missing response in the variable y is called the Deterministic Regression Imputation (DRI). This method is seen as the generalization of the mean imputation method. The model for DRI is given by: P yˆk = βˆ0 + βˆ1 Xik where yˆk the predicted value under the kth nonresponding unit to be imputed, βˆ0 and βˆi are the parameter estimates, Xik is the auxiliary variable that can either be a quantitative variable or a dummy variable under the k-th nonresponding unit.

There are advantages and disadvantages of using DRI. DRI has the potential to produce closer imputed value for the nonresponse observation. In order to make the method effective by imputing a predicted value, which is near the actual value, a high R2 is needed. Though this method has the potential to make closer imputed values, this method is a time-consuming operation and often times unrealistic to consider its application for all the items with missing values in a survey.

Using the DRI can also underestimate the variance of the estimates. It can also distort the distribution of the data. One major disadvantage of this method is

32 that it can produce out-of-range values or unfeasible values (e.g. predicting a negative age).

Stochastic Regression The use of the predicted value from the deterministic regression model has similar undesirable distributional properties in the mean imputation method. To compensate for it, an estimated residual is added to the predicted value. The use of this predicted value plus some type of randomly chosen estimated residual is called the Stochastic Regression Imputation (SRI) method The model for SRI is given by: yˆk = βˆ0 +

Pˆ β1 Xik + eˆk

where yˆk the predicted value under the kth nonresponding unit to be imputed, βˆ0 and βˆi are the parameter estimates, Xik is the auxiliary that can either be a quantitative variable or a dummy variable under the k-th nonresponding unit and eˆk is the randomly chosen residual for the k-th nonresponding unit.

There are various ways in which this could be done depending on the assumptions made about the residuals. The following are some possibilities: 1. Assume that the errors are homoscedastic and normally distributed,N (0, σe2 ). Then σe2 could be estimated by the residual variance from the regression,s2e , and the residual for a recipient could be chosen at random from N (0, s2e ) 2. Assume that the errors are heteroscedastic and normally distributed, with 2 2 σej being the residual variance in some group j. Estimate the σej by s2ej ,

33 and choose a residual for a recipient in group j from N (0, s2ej ). 3. Assume that the residuals all come from the same, unspecified, distribution. Then estimate yk by yˆk + eˆk , where eˆi is the estimated residual for a randomchosen donor. 4. The assumption in (3) accepts the linearity and additivity of the model. If there are doubts about these assumptions, it may be better to take not a random-chosen donor but instead one close to the recipient in terms of his x-values (Kalton, 1983). In the limit, if a donor with the same set of x-values is found, this procedure reduces to assigning that donor’s y-value to the recipient. There are advantages and disadvantages in using SRI. Similar to DRI, this method can produce imputed values that are near to the nonresponse observation if the model has a high R2 . This method is also a time-consuming operation and often times unrealistic to consider its application for all the items with missing values in a survey. This method can also produce out-of-range values other than the predicted value without the added residual. It is possible under SRI that after adding the residual to the deterministic imputation, which is feasible, an unfeasible value could result.

Chapter 4

Methodology 4.1

Source of Data

The purpose of this section is to give an overview about the data that will be used for this study which is the 1997 Family Income and Expenditures Survey (FIES).

4.1.1

General Background

The 1997 FIES is a nationwide survey with two visits per survey period on the same households conducted by the National Statistics Office (NSO) every three years. The objectives of the survey are as follows: 1. to gather data on family income and family living expenditures and related information affecting income and expenditure levels and patterns in the Philippines; 2. to determine the sources of income and income distribution, levels of living and spending patterns, and the degree of inequality among families; 3. to provide benchmark information to update weights in the estimation of

35 consumer price index, and 4. to provide information in the estimation of the country’s poverty threshold and incidence.

4.1.2

Sampling Design and Coverage

The sampling design method for the 1997 FIES is a stratified multi - stage sampling design consisting of 3,416 Primary Sampling Units (PSU’s) for the provincial estimate, the PSU’s referred by the 1997 FIES are the barangays. Then, a subsample of 2,247 PSU’s comprises as the master sample for the regional level estimates (NSO, 1997-2005).

This multi stage sampling design involved three stages. First is the selection of sample barangays. Second is the selection of sample enumeration areas. Enumeration areas pertain to the subdivision of barangays. This was followed by a selection of sample households. The sampling frame and stratification of the three stages were based on the 1995 Census of Population (POPCEN) and 1990 Census of Population and Housing (CPH). From this method, a sample of 41,000 households participated in this survey (NSO, 1997-2005).

4.1.3

Survey Characteristics

The 1997 FIES questionnaire contains about 800 data items, where questions are asked by the interviewer to the respondent of the selected sample household. A re-

36 spondent is defined as the household head or the person who manages the finances of the family or any member of the family who can give reliable information to the questionnaire (NSO, 1997-2005).

The items or variables gathered in the 1997 FIES are listed in Appendix A.

4.1.4

Survey Nonresponse

Two types of nonresponse occurred in the 1997 FIES. The first type of nonresponse which resulted from factors such as being unaware of the question, unwilling to provide the answer or omission of the question during the interview is called the item nonresponse.This type of nonresponse totaled to only 2.1% of the total number of respondents (NSO, 1997-2005).

The other type of nonresponse which is due to households being temporarily away, on vacation, not at home, demolished or transferred residence during the second visit is called as partial nonresponse. This type of nonresponse totaled to only 3.6% of the total number of respondents (NSO, 1997-2005).

The NSO has only devised the deductive imputation for solving the problem of item nonresponse while no specific method was mentioned to compensate for the partial nonresponse (NSO, 1997-2005).

Hence, the researchers will focus on the comparison of imputation procedures for partial nonresponse. The researchers chose which regional data set will be

37 used to apply the imputation methods. In this case, the National Capital Region (NCR) was chosen because it was noted as the region with highest nonresponse rate. The data consist of 4,130 households, 39 categorical variables and the rest are continuous variables pertaining to income and expenditures of the respondents. As to which variables will be imputed, the researchers chose two variables namely the second visit Total Income (TOTIN2) and Total Expenditure (TOTEX2). The selections for these variables were chosen due to its importance to the FIES and the frequency of missing values for these observations.

4.2

The Simulation Method

In order to investigate and make an empirical comparison of the statistical properties of the estimates with imputed values using selected imputation methods, a data set with missing observations was simulated. This simulation method will create an artificial data set with missing observations to indicate which values will be imputed.

The alogrithm for this simulation procedure is as follows: 1. To get the number of observations to be set to missing for each nonresponse rate, the total number of observations from the complete 1997 FIES data set, which is 4130 was multiplied to the indicated nonresponse rate. The nonresponse rates used for this study were 10%, 20% and 30%. The rational for setting different nonresponse rate is because the study aims to investigate the effect of varying nonresponse rates for each imputation method.

38 2. Each observation from the matrix of random numbers was assigned to both observations of the 1997 FIES second visit variables TOTIN2 and TOTEX2. This was done in order to satisfy the assumptions that the data has partial nonresponse and that the missing observations follow the Missing Completely At Random (MCAR) nonresponse pattern. 3. The second visit observations for both variables were sorted in ascending order through their corresponding random number. 4. The first 10% of the sorted second visit data for both variables were selected and set to as missing observations. The same procedure goes for the data set which will contain 20% and 30% nonresponse rates respectively. 5. The missing observations were flagged. This was done to distinguish the imputed from the actual values during the data analysis. This simulation method was implemented with the use of the Decimal Basic program, SIMULATION.BAS (Appendix B) where the files Simulated Values for Income (SIMI) and Simulated Values for Expenditure (SIME), a matrix containing missing observations for the income and expenditure, were stored in order to use it in the application of the imputation methods.

39

4.3

Formation of Imputation Classes

Imputation classes are stratification classes that divide the data in order to produce groups that have similar characteristics. Assuming that the units that have the same characteristics have the propensity to give the same response, the formation of imputation classes would help reduce the bias of the estimates.

The steps undertaken in the formation of the imputation classes are as follows: 1. The researchers identified the potential matching variables, which are the candidate variables that could have an association with the variables of interest (i.e. TOTEX2 and TOTIN2). 2. The categorical variables from the first visit data must fit into the criteria in order to be selected as a candidate variable. Three criteria were used as a basis for selecting the candidate variables. The first criterion is that the variable must be known. Second, the candidate variable must be easy to measure. Lastly, the probability of missing observations for the candidate variable is small. If the variable from the first visit data would fit in the three criteria, then it can be used as a candidate variable. 3. For the variables that have many categories, the researchers reduced the number of categories for these variables. The rationale for this procedure is because having too many categories can increase heterogeneity and the bias of the estimates. This was done with the use of the software Statistica, particularly, the Recode function.

40 4. Measures of association were tested on the matching variables. The Chi Squared Test for Independence was the first test applied on the variables. This was made to determine if the candidate variables is a significant factor for the variables of interest. 5. Other measures for evaluating the degree of association of matching variables to the variables of interest followed. Other measures of association such as the Phi-coefficient, Cramer’s V and Contingency Coefficient were used. The candidate variable with the greatest degree of association will be chosen as the matching variable that will group the data into their respective imputation class. All these tests were performed using the statistical packages Statistica and SPSS. The results of these tests were presented in the next chapter.

41

4.4

Performing the Imputation Methods

4.4.1

Overall Mean Imputation (OMI)

The Overall Mean Imputation (OMI) is an imputation procedure where the missing observations are replaced with the mean of the variable which contains available units. As said in the Conceptual Framework, this imputation method does not require the formation of imputation classes, which makes this method as the simplest procedure among the four methods in this study. The procedures in applying the Overall Mean Imputation (OMI) are as follows: 1. The overall mean for the variables of interest, which is the first visit variables of interest, TOTIN1 and TOTEX1 was computed. The formula that was used for the computation of the overall mean is: r X

y¯omi =

yri

i=1 r

where y¯omi is the overall mean for the first visit variables of interest,TOTEX1 or TOTIN1 while yri is the observation for the first visit variables of interest, TOTEX1 or TOTIN1 and r is the total number of responding units for the first visit variable TOTEX1 or TOTIN1. 2. Using the nonresponse data sets generated, the missing observations for the second visit variables TOTEX2 and TOTIN2 were replaced with the overall means of the first visit TOTEX1 and TOTIN1. The implementation of the Overall Mean Imputation (OMI) was made through the Decimal Basic program OMI.BAS (Appendix B).

42

4.4.2

Hot Deck Imputation (HDI)

The Hot Deck (HDI) Imputation is an imputation procedure where the missing observations are replaced by choosing a value from the set of available units.

The steps undertaken in applying the Hot Deck Imputation (HDI) are as follows: 1. The donor and recipient record for each imputation class and variable were first identified. 2. The missing observations of the second visit TOTIN2 and TOTEX2 were assigned to their respective recipient records for each imputation class while the first visit TOTIN2 and TOTEX2 observations were placed to their respective donor records for each imputation class. 3. The values that were substituted for the missing observations were randomly chosen from the donor record for each imputation class. The implementation of the Hot Deck Imputation (HDI)was made through the Decimal Basic program HOT DECK.BAS (Appendix B).

43

4.4.3

Deterministic and Stochastic Regression Imputation (DRI) and (SRI)

Deterministic Regression Imputation (DRI) is a procedure that involves the generation of a Least Square Regression Equation where Y is regressed on the auxilliary variables (x1 , x2 , ..., xp ) in order to predict for the missing value. On the other hand, Stochastic Regression Imputation (SRI) is an imputation method which employs a similar procedure to that of the deterministic regression but with an additional procedure of adding an error term eˆ to the estimated value in order to generate imputed values for the missing data. The steps employed for the Regression Imputation are as follows: 1. A logarithmic transformation was applied for the first visit variables of interest,TOTEX1 and TOTIN1 as well as for the second visit variables of interest, TOTEX2 and TOTIN2. The rationale for this transformation is that the income and expenditure variables are not normally distributed. Moreover, logarithmic transformations help correct the non-linearity of the regression equation. 2. The formation of regression equation was done after the transformation. For this study, only one predictor variable was used and the general formula for the regression equation is: yˆ = βˆ0 + βˆ1 x + eˆi where yˆ is the predicted observation for the second visit variable TOTIN2 or TOTEX2, βˆ0 and βˆ1 are the parameter estimates, x is the first visit variable,

44 and eˆi is the random residual term. Note that for DRI, eˆi = 0. 3. For the stochastic regression which involves the computation of the error term, the following steps were made: (a) A frequency distribution of the residuals was created. This involved the following steps: i. The residuals were grouped into class intervals and in each interval,the frequencies for each was obtained. ii. The relative frequencies and relative cumulative frequencies were computed. (b) The class means of the frequency distributions were used to obtain the error terms for the regression equation. 4. The diagnostic checking requires the fitted model to satisfy the following assumptions: (a) Linearity (b) Normality of the error terms (c) Independence of error terms (d) Constancy of Variance The results for the diagnostic checking of each regression equation used for this study were presented in Appendix C. 5. The missing observations were replaced by the predicted value using the corresponding regression equation.

45

4.5

Comparison of Imputation Methods

4.5.1

The Bias of the Mean of the Imputed Data

The primary objective of using imputation methods is to be able to generate statistically reliable estimates. To check if the imputation methods produce reliable estimates and determine the effect of the varying nonresponse rates on the performance of imputation methods, one of the three criteria, which is the bias of the sample mean was measured.

To compute for the bias of the mean of the imputed data, the following procedures were implemented: 1. The mean of the imputed data, y¯0 was computed. For Hot Deck and Stochastic Regression Imputation, the average of all the mean of the 1,000 simulated data sets was computed. 2. The mean of the actual data, y¯ was computed. 3. The resulting bias of the mean of the imputed data was computed by getting the difference between (1) and (2). The results of this section will be presented in the next chapter.

46

4.5.2

Comparing the Distributions of the Imputed vs. the Actual Data

In order to determine which imputation method was able to maintain the same distribution of the actual data, a goodnesss - of - fit test was utilized. For this study, the researchers chose the Kolmogorov - Smirnov (K-S) test. The Kolmogorov - Smirnov is a goodness of fit test concerned with the degree of agreement between the distribution of a set of sampled (observed) values and some specified theoretical distribution (Siegel, 1988). In this study, the researchers were concerned with how the imputation methods affected the distribution of the 1997 FIES data.

The following steps are made for the Kolmogorov - Smirnov Test: 1. Income and Expenditure deciles were created. The creation of these deciles was based on the second visit actual 1997 FIES data. 2. The obtained deciles were used as upper bounds of the frequency classes. 3. A Frequency Distribution Table (FDT) for each trial was created. For this part, the researchers used the SPSS aggregate function to generate the FDT. 4. The FDT includes the Relative Cumulative Frequency (RCF) for both the imputed and actual distribution. RCFs are computed by dividing the cumulative frequency by the total number of observations. 5. The absolute value of the difference of the actual data RCF and the imputed RCF was computed. This was computed using Microsoft Excel

47 6. The test statistic for the Kolmogrov - Smirnov Test, which is the maximum deviation, D, was determined by using this formula:

D = max|RCFimputed − RCFactual |

7. Since this is a large sample case and assuming a 0.05 level of significance, the critical value for this is computed using the formula:

1.36 √ , N

N = 4, 130

8. If D is less than the critical value, then the conclusion that the imputed data maintains the same distribution of the actual data follows. To provide additional information to the distribution of the imputed vs. actual data, the comparison of the frequency distribution of the actual (deleted) vs. imputed values was obtained. This was done in order to show the effect of the imputed values to the distribution of the data set.

In performing the test, the following steps are made: 1. Income and Expenditure deciles were created. The deciles that were used in the previous test were the same deciles used here. 2. The obtained deciles were used as upper bounds of the frequency classes. 3. A Frequency Distribution Table (FDT) for both the imputed and actual values was generated. 4. For Hot Deck and Stochastic Regression which had 1,000 sets the Relative Frequencies (RF) for each frequency class were averaged over 1,000 RFs.

48 5. To be able to illustrate how imputation methods were able to reconstruct or distort the actual, deleted values, bar charts were created for each nonresponse rate and variable of interest. The results of this test were be presented in the next chapter.

4.5.3

Other Measures in Assessing the Performance of the Imputation Methods

Lastly, the researchers adopted measures used by Kalton (1983) in his report entitled Compensating for Missing Data for evaluating the effectiveness of imputation methods. These measures are: (a) Mean Deviation (MD), (b) Mean Absolute Deviation (MAD) and (c) Root Mean Square Deviation (RMSD).

The Mean Deviation (MD) measures the bias of the imputed values. This is represented by the formula: X MD =

(ˆ ymi − ymi ) m

, i = 1, 2..., m

where yˆmi is the imputed value for the variables TOTEX2 or TOTIN2 and ymi is the actual value of the variables TOTEX2 or TOTIN2 for case i = 1, 2..., m.

According to Kalton (1983), the Mean Absolute Deviation (MAD) is a criterion for measuring how close the imputed values were able to reconstruct the actual values that were set to missing. This is represented by the formula: X |(ˆ ymi − ymi )| M AD = , i = 1, 2..., m m

49 where yˆmi is the imputed value for the variables TOTEX2 or TOTIN2, and ymi is the actual value of the variables TOTEX2 or TOTIN2 for case i = 1, 2..., m.

The Root Mean Square Deviation (RMSD) is the square root of the sum of the square deviations of the imputed and actual observation. Same as the MAD, it measures the closeness with which the deleted values are reconstructed. This is expressed as: rX RM SD =

(ˆ ymi − ymi )2 m

where yˆmi is the imputed value for the variables TOTEX2 or TOTIN2, and ymi is the actual value of the variables TOTEX2 or TOTIN2 for case i = 1, 2..., m.

These three criteria for measuring the performance of the imputation methods were implemented using the Decimal Basic program.

After each imputation

method is performed, the program proceeds in finding the Mean Deviation, Mean Absolute Deviation and Root Mean Square Deviation and were saved in their corresponding Criteria for Expenditure (CRITEX) and Criteria for Income (CRITIN) files.

50

4.5.4

Determining the Best Imputation Method

To answer the primary objective of this study which is determining the best or the most appropriate imputation technique for FIES 1997, the researchers ranked the four imputation methods based on the criteria discussed in the previous sections. The selection of the best method will be independent for all the variables of interest and nonresponse rates. The ranking of the imputation methods covered the following: Bias of the Mean of the Imputed Data, Estimated Percentage of Correct Distribution of the Imputed Data (PCD) which refers to the proportion, out of the total number of simulated data sets, that the imputed data set was able to reconstruct the actual data set, Mean Deviation (MD), Mean Absolute Deviation (MAD) and Root Mean Square Deviation (RMSD)

The procedure for ranking are as follows: 1. In each criteria mentioned above, the imputation methods were ranked using the scale of 1 to 4,with 1 indicating the best imputation method and 4 being the worst. 2. For each variable of interest (i.e. TOTEX2, TOTIN2), the obtained rankings of a particular imputation method for each criteria is added. 3. The imputation method with the lowest total will be considered as the best imputation method for the respective variable of interest and nonresponse rate. The results of the ranking procedure were presented in the next chapter.

Chapter 5

Results and Discussion 5.1

Descriptive Statistics of Second Visit Data Variables

Table 2 shows the descriptive statistics of the second visit variables of interests (VI), TOTEX2 and TOTIN2. This was computed to provide a brief idea on how much a household spends and earns in a period of time, measure the differences of the statistics between the two variables and to compare the results with other tests later on.

Table 2: Descriptive Statistics of the 1997 FIES Second Visit Variable

Mean

Std. Dev

Min

Max

N

TOTEX2

102,389.8

129,866.6

8,926.00

3,903,978

4,130

TOTIN2

134,119.4

216,934.9

9,067.00

4,357,180

4,130

The average total spending of a household in the National Capital Region (NCR) is about Php 102,389.80 while the average total earnings amounted to Php 134,119.40,

52 a difference of more than thirty thousand pesos. it can be noted that the observations from the TOTIN2 have a larger mean and standard deviation as compared to TOTEX2. The dispersion can be also seen by just looking at the minimum at maximum of the two variables.

5.2

Formation of Imputation Classes

Tables 3, 4, 5 shows the candidate matching variables along their respective categories and scope. The candidate MVs that were tested are the provincial area codes (PROV), recoded education status (CODES1) and recoded total employed household members (CODEP1). The candidate PROV has four categories and it is the only matching variable that was not recoded. The other candidates, namely, CODEP1, which is recoded total employed household members and CODES1, which is the recoded education status are the matching variables that were reduced to smaller number o groups since the original number of categories for these two candidate MVs were 7 and 99 respectively. As mentioned in the previous chapters, the number of categories are further reduced into smaller groups to minimize the heterogeneity and the bias of the estimates.

53 Table 3: The Candidate MV PROV and its Categories

54 Table 4: The Candidate MV CODEP1 and its Categories

Table 5: The Candidate MV CODES1 and its Categories

55 Table 6 shows the results of the Chi-Square Test of Independence performed to determine if the candidate matching variables (MVs) are associated with the VIs. The MV stated in the methodology must be highly correlated to the variables of interest. The first visit VIs were used as the variables to be tested for association rather than second visit VIs since the second visit VIs already contained missing data.

Table 6: Chi-Square Test of Independence for the Matching Variable

The Chi-Squared test of association for the candidates and the variables of interest showed that PROV, CODES1 and CODEP1 are associated to CODIN1 and CODEX1. The p-values for all the candidates were less than 0.0001 indicating that the association is very significant. The results of succeeding measures of association will determine which of the three candidates will be chosen as the MV of the study.

56 Table 7 shows the other measures of association, namely, the Phi-Coefficient, Cramers V and the Contingency Coefficient. These measures were computed in order to assess the degree of association of the candidates to CODIN1 and CODEX1. Table 7: Measures of Association for Matching Variable

The measures of association showed small degrees of association with the variables CODIN1 and CODEX1. This kind of result is expected in real complex data, given larger variability among the observations. From Table 7, it is clearly shown that the CODES1 is the MV which exhibit the largest association among the variables and therefore, the MV that can ensure that the ICs are homogeneous. Thus, CODES1 is the chosen MV for this data.

57 To have a detailed description of the CODES1 imputation classes, the descriptive statistics for each imputation class was obtained. Table 8 shows the descriptive statistics of each imputation class of the data. The descriptive statistics will tell if the best MV decreases the variability of the observations. In checking for the variability of each imputation class, the standard deviation will be used and compared with the value from the overall standard deviation of the variables of interest.

Table 8: Descriptive Statistics of the Data Grouped into Imputation Classes.

The table shown above indicates that IC1 is the imputation class with the least standard deviation compared to the two ICs, IC2 and IC3. IC2 and IC3 produced large standard deviations however it is being neutralized by a low value from IC1 which has the largest proportion of the data. A possible reason why the standard deviation and the mean of IC3 are large is because majority of the extreme values were contained on that class.

58

5.2.1

Mean of the Simulated Data by Nonresponse Rate for Each Variables of Interest

Results in Table 9 show the means for both second visit VIs, TOTEX2 and TOTIN2, under all NRR. This was generated to be used an input in the evaluation of the mean from the imputed data for each IM.

Table 9: Means of the Retained and Deleted Observations

The mean of the observations set to nonresponse and observations retained showed contrasting results. For both variables, TOTEX2 and TOTIN2, when the nonresponse rate increases, the mean of the observations set to missing (deleted) also increases. Conversely, the mean of observations retained decreases when nonresponse rate increases. Perhaps the large values that were set to nonresponse increased the means of the data sets containing nonresponse for the varying rates of nonresponse. Hence, as the number of missing values increases, the deviation between the means of the actual and retained data slowly increases.

59

5.3

Regression Model Adequacy

Table 10 shows the different regression models for all VIs and nonresponse rates (NRRs) that were checked for adequacy. The columns are represented as follows: (a) VI, (b) the nonresponse rate (NRR), (c) IC, (d) the prediction model, (e) the coefficient of determination (R2 ) and (f) the F-statistic and its corresponding p-value in parenthesis.

For the notations used in Table 10, the codes IC1, IC2, IC3 represents the first, second and third imputation class respectively. Meanwhile, for the regression equations used for the regression imputation, yˆi represents the dependent variable, which is the predicted value for the second visit variable TOTIN2 or TOTEX2. Logarithmic transformations were utilized in order to correct the non-linearity for the regression equations. The code (LN F V E1i ) is the logarithmic transformation of the observation from the first visit variable Total Expenditure (TOTEX1) under IC1. Similarly, (LN F V I1i ) is the logarithmic transformation of the first visit observation for the variable Total Income (TOTIN1) under IC1. The same notation also applies for (LN F V E2i ) and (LN F V E3i ) under IC2 and IC3 for the variable TOTEX1, respectively and (LN F V I2i ) and (LN F V I3i ) under IC2 and IC3 for the variable TOTIN1, respectively.

60 Table 10: Model Fitted Results

61 Table 10 showed the regression models used for the regression imputations under their respective VIs and ICs. Before using these equations for imputating missing values, diagnostic checking of the models, which include Linearity, Normality of Error Terms, Independence of Error Terms and Constancy of Variance, were performed.

First, the researchers looked at the coefficient of determination or R2 of each regression equation in order to determine the explanatory power of first visit VI to the second visit VI. A large value of R2 is a good indication of how well the model fits the data. The highest R2 in Table 10 measured 93.2% for the model fitted under TOTEX2, IC3 with 30% NRR). Meanwhile, the lowest coefficient of determination can be found for the fitted values under the variable TOTIN2, under IC1 with 20% NRR, which had an R2 of 70.3%. For all NRR and VIs, the third IC generated the highest R2 while the first IC produced the lowest R2 .

Second, the models were checked if they satisfy the assumption of linearity. This was performed using the ANOVA tables presented in Appendix C. The results of the diagnostic checking showed that all models exhibited the assumption of linearity. The p-values for all the models were less than 0.0001, an indication that the linearity of the models is very significant.

Third, the next phase for diagnostic checking is to check if the regression model satisfy the assumption of normality. For this study, the researchers examined the Normal Probability Plot(NPP) of the regression models which can be found in

62 Appendix C. The normal probability plot in all models moderately follows the S-shaped pattern which indicates that the residuals are not normal but rather lognormal. However, the shape of the NPP improved after ln transformation was applied even though the model was not linear previously. Since the data used is a complex data, the models were used even if assumption of the residuals to be normal is not perfectly achieved.

Fourth, in testing for the assumption of independence of error terms, the Durbin Watson test was implemented. Results in Appendix C show that all of the models satisfy the assumption of independence.

Lastly, to check if the residuals satisfy homoscedasticity or the equality of variances, a scatter plot of the residuals against the predicted values was obtained. Results in Appendix C showed that there were no distinct patterns evident in the scatter plot. The logarithmic transformation resolved the problem of heteroscedasticity.

Hence, given this discussion, the results show that the assumptions for the diagnostic checking of the regression equations used for the regression imputations were satisfied.

63

5.4

Evaluation of the Imputation Methods

To determine the effect of nonresponse rates in the results for each imputation method (IM), evaluation of different IMs was performed. In the evaluation of the different IMs, the results of each IM will be discussed independently. For each IM, the discussion of results will be as follows: (1) bias of the mean of the imputed data, (2) distribution of the imputed data using the Kolmogorov-Smirnov Goodness of Fit Test, and (3) other measures of variability using the mean deviation (MD), mean absolute deviation (MAD), and root mean square deviation (RMSD).

The table of results will contain the following columns: (a) variable of interest (VI), (b) nonresponse rate (NRR), (c) the bias of the mean of the imputed data, Bias (ˆ y 0 ),(d) percentage of correct distribution of the imputed data to the actual data set out of 1000 trials (PCD), (e) MD, (f) MAD, and (g) RMSD.

64

5.4.1

Overall Mean Imputation

Table 11 shows the results of the different criteria in evaluating the imputed data using the overall mean imputation (OMI) method.

Table 11: Criteria Results for the OMI Method

1. Bias of the Mean of the Imputed Data In (c) of Table 11, results showed that as the NRR increases, the bias for TOTEX2 slowly decreases in magnitude. The decrease in magnitude of the respondents’ mean as NRR increase is the rationale behind the decrease of the bias of the mean of the imputed data. As the magnitude of the respondents’ mean decreases, variability caused by imputing a single value (i.e. the mean of TOTEX1, the total expenditure of the first visit data, which is equal to 105,566.9) that is higher than the mean of the actual data set also decreases.

On the other hand, the results shown for TOTIN2 were the opposite of TOTEX2 as NRR increases. The bias of the mean of the imputed data for

65 TOTIN2 rapidly increases in magnitude as NRR increases. The rationale for this is the decrease in magnitude of the respondents’ mean as NRR increases. However, unlike in TOTEX2, the imputed values (i.e. the mean of TOTIN1, the total income for the first visit data, which is equal to 121,820.7) are much lower than the actual mean of the data set.

2. Distribution of the Imputed Data Results in column (d) of Table 11 showed that in all NRRs and VIs, the OMI method failed to maintain the distribution of the actual data. This was expected primarily because for each missing observation for the VIs, the observations were replaced by a single value which is the overall mean of the first visit of the VIs. Results from related studies that performed OMI stated that this method is one of the worst among all IM since it distorts the distribution of the data. The distribution of the data becomes too peaked which makes this method unsuitable for many post-analyses (Cheng and Sy, 1999).

3. Other Measures of Variability The three criteria in Table 11 under the columns (e), (f) and (g) show the other measures of variability of the imputed data. The values for the MAD and RMSD are increasing in magnitude as NRR increases for TOTEX2. The data which have the highest percentage of imputed values have the highest values for the three measures of variability in TOTEX2. It is worth noting that a huge increase in magnitude is seen in all the three criterions from the

66 twenty to thirty percent NRR for TOTEX2.

For TOTIN2, the data which have twenty percent imputed observations have the highest values in all the three measures of variability. Unlike for TOTEX2, surprisingly, values from the three measures of variability under the highest NRR have the lowest results.

5.4.2

Hot Deck Imputation

Table 12 shows the results of the different criteria in evaluating imputed data with imputations using the Hot Deck Imputation (HDI) method with three imputation classes. Table 12: Criteria Results for the HDI Method

1. Bias of the Mean of the Imputed Data Similar to the results in the OMI method for the TOTIN2 variable, as the NRR increases, the bias of the mean of the imputed data rapidly increases. In the TOTEX2 variable, the biases fluctuated as the NRR increases. For TOTEX2 and TOTIN2, the data with the highest NRR has the largest bias.

67 For the TOTEX2 variable, the data with twenty percent NRR provided the least bias. On the other hand, the data with the lowest NRR yielded the smallest bias for TOTIN2.

2. Distribution of the Imputed Data Results in column (d) shows that in TOTIN2, the data which contained ten and twenty percent imputation of the total number of observations, maintained the distribution of the actual data. In TOTEX2, only the data which contained ten percent imputations of the total number of observations maintained the distribution of the actual data for all the one thousand data sets. In the data which contained twenty percent imputations of the total number of observations, 969 out of the 1000 data sets maintained the distribution of the actual data set.

For TOTEX2 and TOTIN2, the data with the highest number of imputed observations failed to maintain the distribution of the actual data. Much worse, none of the simulated data set for TOTEX2 registered the same distribution as the actual. On the other hand, only a lone data set maintained the same distribution as the actual. The researchers look into the possibility that more than one recipient are having the same donor.

68 3. Other Measures of Variability The three criteria in Table 12 under the columns (e), (f) and (g) show the other measures of variability of the imputed data. For the variable TOTEX2, the following results were obtained: (i) data that contains twenty percent imputed value yielded the least values for the MD and RMSD, (ii) the data with the lowest number of imputations yielded the largest value for MD and RMSD and (iii) MAD is the only criterion which the values are increasing as NRR increases.

For the variable TOTIN2, the following results were obtained: (i) all the three criteria increases as NRR increases, (ii) results for the three criteria were larger than for TOTEX2, and (iii) the data with the largest number of imputations generated the highest value in the three criteria.

69

5.4.3

Deterministic Regression Imputation

Table 13 shows the results of the different criteria in evaluating the imputed data using the Deterministic Regression Imputation method with three imputation classes (DRI).

Table 13: Criteria Results for the DRI Method

1. Bias of the Mean of the Imputed Data Looking at Table 13, column (c), the bias slowly increases in magnitude as the NRR increases for TOTEX2 and TOTIN2. Compared to OMI and HDI where the bias increases tremendously as NRR increases, the increase in bias for DRI is much slower. The bias of the data with twenty percent NRR is just twice the bias of the data set with ten percent NRR. For TOTEX2, this method produces larger bias for the mean of the imputed data in all NRR than the OMI and HDI.

70 2. Distribution of the Imputed Data Contrary to the results in the OMI method under this criterion, results in column (e) shows that the imputed data maintained the distribution of the actual data in all NRR and VIs. It is even much better than HDI since all of the imputed data sets under all the NRRs and VIs preserved the same distribution as the actual data. It is interesting to note that the regression models that were used in this study did not show the expected results that were mentioned in the related literature and provided a distinct result. Earlier studies that made use of categorical auxiliary variables, the matching variables that were transformed into dummy variables, concluded that DRI is just the same as the mean imputation. However, in this study, the independent variable was the first visit VIs and for each imputation class there is a fitted model which registered a good R2 .

3. Other Measures of Variability The three criteria in Table 13 under the columns (e), (f) and (g) show the other measures of variability of the imputed data. For these criteria, the following results were obtained: First, results from the three criteria are almost stable as NRR increases for TOTEX2 and TOTIN2. The rate of change of the values for MD, MAD and RMSD is minimal compared to OMI and HDI. Second, the MAD and RMSD have smaller values than for OMI and HDI for TOTEX2 and TOTIN2. Fitting models with high R2 was the key factor that made this method better than the other two IM previously evaluated.

71

5.4.4

Stochastic Regression Imputation

Table 14 shows the results of the different criteria in evaluating the imputed data using the Stochastic Regression Imputation method with three imputation classes (SRI). Table 14: Criteria Results for the SRI Method

1. Bias of the Mean of the Imputed Data Looking at Table 14, column (c), for TOTEX2 and TOTIN2, values produced for this method yielded much better results than for DRI. The bias for TOTEX2 and TOTIN2 do not follow the same scenario for the previous three method that as the NRR increases, the bias increases. The biases fluctuate from one NRR to another. Compared to the three previously evaluated, this method provided the least bias in the highest NRR for both TOTEX2 and TOTIN2. While the other methods reached a four digit bias, SRI generated only a three digit bias. Moreover, there is a huge disparity in the third NRR where it only produced less than twenty percent of the bias produced by its deterministic counterpart.

72 2. Distribution of the Imputed Data Looking at Table 14 column (d), SRI showed better results than HDI which also simulated the data 1000 times. Unlike in HDI, SRI maintained the same distribution for all imputed data sets for the first and third nonresponse rates. The SRI also outperformed HDI for the twenty percent NRR. In earlier studies, the stochastic regression imputation performs better than any of the three methods used here. The random residual was added to the deterministic predicted value to preserve the distribution of the data. 3. Other measures of variability The three criteria in Table 14 under the columns (e), (f), and (g) show the other measures of variability of the imputed data. For this criteria, the following results were obtained: First, similar to the results in measuring the bias of the mean of the imputed data, results in TOTIN2 for all the criteria fluctuates from one NRR to another. Second, in TOTEX2, only the RMSD criterion increase as NRR increases while the MAD and MD fluctuates from one NRR to another. Third, the data with the highest NRR yielded the lowest results for the MD criterion. Fourth, for TOTIN2, the data with twenty percent NRR yielded the largest values for the three criteria.

73

5.5

Distribution of the True vs. Imputed Values

To provide additional information on the distribution of the imputed data that was discussed previously, the distribution of the true (deleted) values (TVs) and the imputed values (IVs) from each of the IMs for all the VIs and NRRs were obtained. Figures 2, 3, and 4 show the bar graphs for the 10%, 20% and 30% NRR in TOTIN2, respectively. For TOTEX2, Figures 5, 6, and 7 show the bar graphs for the 10%, 20% and 30% NRR, respectively. Figure 2: Bar Chart for TOTIN2, 10% NRR

74 Figure 3: Bar Chart for TOTIN2, 20% NRR

Figure 4: Bar Chart for TOTIN2, 30% NRR

75 Figure 5: Bar Chart for TOTEX2, 10% NRR

Figure 6: Bar Chart for TOTEX2, 20% NRR

76 Figure 7: Bar Chart for TOTEX2, 30% NRR

For the OMI method, the figures clearly illustrate the distortion of the distribution. Since the OMI method assigns the mean of the first visit VI to all the missing cases, all the data sets concentrated in one particular frequency class. The three other methods which implemented imputation classes, gave a better outcome than OMI by spreading the distribution of the imputed data.

For the HDI method, all the figures clearly illustrate the over representation in the first frequency class, that is less than 37,859.5 for TOTEX2 and less than 40,570.0 for TOTIN2. Over representation can also be seen in Figure 5 under the second frequency class, in Figures 6 and 7 under the last frequency class.

While there is an over representation of the data for HDI3, an under representation was observed. Looking at Figures 2, 3, and 4, under representation were observed in the seventh frequency class (128,000.0 - 161,669.0) for TOTIN2 .

77

For the two regression imputation methods, unlike HDI and OMI which had major clusters, produced more spread distribution although there are some areas that are under represented. The failure to consider a random residual term in deterministic regression resulted into a severe under representation of the data in particularly in the first frequency class in all the figures. Looking at Figures 2, 3, and 4, under representation can be seen in the last frequency class in TOTIN2. For SRI which added a random residual provided better results than DRI. However, there are some areas that the added random produced significant excess mostly from the last frequency class. These can be seen in Figures 5,6 and 7 for TOTEX2 and Figures 2 and 4 for TOTIN2.

78

5.6

Choosing the Best Imputation

For this section, the rankings of all the tests are the basis to determine which of the following IMs will be chosen as the best IMs for this particular study and data. The selection of the best method will be independent for all VIs and NRRs. The ranking are based on a four-point system wherein the rank value of 4 denotes the worst IM for that specific criterion and 1 denotes the best IM for that criterion. In case of ties, the average ranks will be substituted. The IM with the smallest rank total will be declared the best IM for the particular VI and NRR. The ranking of IM will cover the following criteria: (a) Bias of the mean of the imputed data (N.B.), (b) percentage of correct distributions (PCD), and (c) Other measures of variability, namely, MD, MAD and RMSD. All in all, there are five criteria that each IM will be rank in.

Tables 12, 13 and 14 show the ranking of the different imputation methods for the 10%, 20% and 30% NRR respectively. For each NRR, the table containing the rankings of the IMs will go as follows: (a) VIs, (b) Criteria, (c) OMI, (d) HDI, (e) DRI, and (f) SRI.

79 Table 15: Ranking of the Different Imputation Methods: 10% NRR

80 Table 16: Ranking of the Different Imputation Methods: 20% NRR

81 Table 17: Ranking of the Different Imputation Methods: 30% NRR

82 Rankings show that the two regression IMs provided better results than their model-free counterparts. For all the nonresponse rates under the TOTIN2 variable, the two regression imputation methods tied as the best IM, and surprisingly the HDI finished the worst IM behind OMI. Under the TOTEX2 variable, mixed rankings were seen for all nonresponse rates. The regression methods still provided good results. The SRI method finished first in the 10% and 30% NRR and ranked third in the 20% NRR while the DRI method finished third, first and second in the 10%, 20% and 30% NRR respectively. While the HDI was seen as the worst IM for TOTIN2, the OMI was concluded the worst IM for TOTEX2 by ranking last for both 10% and 20% NRR and third for the 30% NRR.

In conclusion, the best imputation method for this study is the Stochastic Regression Imputation using the 1997 FIES data. It is very closely followed by the Deterministic Regression Imputation. No records in the results show that SRI method ranked last in all the criteria, NRRs and VIs, unlike for DRI which provided the worst IM in the bias of the mean of the imputed data and MD criteria. The researchers selected the HDI as the worst IM in this study. The HDI method fared the worst such that majority of the results in the different criteria under each NRR and VI in particular the said method rated poorly.

Chapter 6

Conclusion This paper discussed a range of imputation methods to compensate for partial nonresponse in survey data and showed empirical proofs on the disadvantages and advantages of the methods. It showed that when applying imputation procedures, it is important to consider the type of analysis and the type of point estimator of interest. Whether the researcher’s goal is to produce unbiased and efficient estimates of means, totals, proportions and official aggregated statistics or a complete data file that can be used for a variety of different analyses and by different users, the researcher should clearly identify first the type of analysis that will suit his or her purpose. In addition, several practical issues that involve the case of implementation, such as difficulty of programming, amount of time it spends and complexity of the procedures used must also be taken into consideration.

Anyone faced with having to make decisions about imputation procedures will usually have to choose some compromise between what is technically effective and what is operationally expedient. If resources are limited, this is a hard choice. This study aims to help future researchers in choosing the most appropriate im-

84 putation technique for the case of partial nonresponse.

For our particular implementation, all of the methods were run to a programming language due to the unavailability of software that can generate imputations for all the methods needed ofr this study. In all of the methods, the overall mean imputation was the easiest to use and create a computer program. The other three methods required the formation of imputation classes. Both regression imputations were the hardest to program and the most time consuming imputation methods.

The performance of several imputation methods in imputing partial nonresponse observations was compared using the 1997 Family Income Expenditure Survey (FIES) data set. A set of criteria were computed for each method based on the data set with imputed values and data set with actual values to find the best imputation method. The criteria in judging the best method were the bias and variance estimates of the imputed data, the preservation of the distribution by the actual data, and the other measures of accuracy and precision incorporated from the study of Kalton (1983).

The results show that the choice of imputation method significantly affected the estimates of the actual data. The similarities among the two best methods, namely, the Deterministic and Stochastic Regression imputation methods were due in part to the adequacy and prediction power of the models.

85 The bias and variance estimates of the imputed data obtained appeared to vary much across imputation methods and it was unexpected that the Hot Deck Imputation method rendered the highest estimates in majority of the nonresponse rates as well as its variables. Stochastic Regression, on the other hand, was the best method in that particular criterion since in majority of the results in the tests produced relatively small biases and variances.

The distributions of the imputed data of each method were checked for the preservation of the distribution using the Kolmogorov-Smirnov Goodness of Fit test. In the methods used in this study, both regression imputation methods retained the distribution of the data especially the Deterministic Regression Imputation that generated exactly the same distribution as the actual data.

In the other tests of accuracy and precision, namely, the mean deviation, mean absolute deviation and root mean square deviation, the different methods provided mixed results in all nonresponse rates. The results for some methods did not consistently and clearly yielded good results. Only half of the methods used provided great results in one particular criterion which is the preservation of the distribution of the data. In the other results, inconsistency was obviously seen due to the alternating rankings from each method.

Given the criteria and procedures in judging the best imputation procedure among the four methods, the selection of the best method was difficult. Consequently, in order to determine the best method of imputing nonresponse observation for

86 each variable in the study, the methods were ranked according to several criteria. Methods that were ranked 1 indicate as the best imputation method while methods ranked 4 shows that it is the worst in that particular criterion.

After comparing the methods, the two regression method namely the Deterministic and Stochastic Regression Imputation gave the outstanding results. Therefore, it can be concluded that that the Stochastic Regression Imputation procedure is considered the best imputation method for this study since the it did not rank poorly in any criteria under all NRRs and VIs.

The efficiency of the imputation method was supported by the R2 of the model and the added random residual in the deterministic imputed value. The random residuals added to the deterministic imputation provided a change in making the estimates less biased than its deterministic counterpart.

Deterministic regression imputation method performed much better than Hot Deck imputation method. It is surprising that the Hot Deck imputation method was less efficient than deterministic regression where in the related studies, it emerged as the better method than deterministic regression. Most likely the selection of donors with replacement caused its poor performance and not the imputation classes. If the imputation classes were the cause of its low ranking, then both regression imputation methods estimates could be as worse as the Hot Deck imputation even if the model is adequate.

Chapter 7

Recommendations for Further Research In this study, we have compared four imputation methods commonly used in dealing with partial nonresponse data and with the assumption of MCAR. However, there are other methods that are currently being developed and improved. For example, the multiple imputation method involves independently imputing more than one value for each nonresponse value. Multiple imputation is an important and powerful form of imputation and has the advantage that variance estimation under imputation can be carried out comparatively easily (Kalton, 1983).

Regarding the variance estimation, further studies should implement the use of proper variance estimates like the Jackknife variance estimator. This variance estimator is more often used in comparing the variance estimates of most imputation methods. The study of Rao and Shao (1992) has proposed an adjusted Jackknife variance estimator to use with the imputation methods related to the Hot Deck imputation procedure. This variance estimator is said to be asymptotically unbiased.

Future researchers may test other methods on the same data set and compare

88 the results with those presented in this paper. They could also compare the results of this study with those of multiple imputation and the Rao-Shao jackknife variance estimator. There is a need, however, for a higher knowledge in statistics and Bayesian statistics in using the above procedures. The complexity of the methods especially both regression imputations could hinder future researchers in the use of modern variance estimator.

It is also suggested that the use of a method to select a matching variable through the use of advanced modern statistical methods like the CHAID analysis. The acronym CHAID stands for Chi-squared Automatic Interaction Detector. It is one of the oldest tree classification methods originally proposed by Kass (1980; according to Ripley, 1996, the CHAID algorithm is a descendent of THAID developed by Morgan and Messenger, 1973). CHAID will ”build” non-binary trees (i.e., trees where more than two branches can attach to a single root or node), based on a relatively simple algorithm that is particularly well suited for the analysis of larger datasets. Also, because the CHAID algorithm will often effectively yield many multi-way frequency tables (e.g., when classifying a categorical response variable with many categories, based on categorical predictors with many classes), it has been particularly popular in marketing research, in the context of market segmentation studies. (Statsoft, 2003)

In pursuing regression imputation, instead of creating models for each imputation class that can really be time-consuming at the same time frustrating since not all models will have the same result, dummy variables should be inserted in

89 the model. These dummy variables are the categories of the matching variables. It would definitely save time and money since only one model is created and tested.

These researchers strongly recommend using a statistical package that can generate faster and a lot easier imputations but generate less biased estimates than programming. It would definitely save time than creating a computer program that eats up a majority of the research time in debugging and prevent computer crashes due to computer memory overload.

Bibliography [1] Cheng, J.H. and Sy, F. ,A Comparison of Several Techniques of Imputation on Clinical Data (Undergraduate Thesis, De La Salle University) 1997.

[2] Kalton, G, (1983) Compensating for Missing Survey Data, Michigan.

[3] Kazemi, Applied

I.

(2005).

Statistics.

Methods Retrieved

for

Missing 12

Data:

March

Center 2007,

for from

http://www.cas.lancs.ac.uk/shortcourses/notes/missingdata/Session5.pdf

[4] Musil, C., Warner, C., Yobas, P. K. and Jones. S. A Comparison of Imputation Techniques for Handling Missing Data, Western Journal of Nursing Research. Vol.24,No.7, 815-829 (2002)

[5] National Statistics Office (NSO)(1997 - 2005). Technical Notes on the 1997 Family Income and Expenditure Survey (FIES). Retrieved 18 June 2007, from http://www.census.gov.ph/data/technotes/notefies.html

[6] Netter, J., Wasserman, W. and Kutner, M.H.. Applied Linear Statistical Models 2nd ed. Homewood, Illinois: Richard D. Irwin, Inc.

91 [7] Nordholt, E.S. (1998). Imputation: Methods, Simulation, Experiments and Practical Examples, International Statistical Review, Vol. 66, No. 2, 157180. [8] Obanil, R. (2006, October 3). Topmost floor of NSO Building Gutted by Fire. The Manila Bulletin Online. Retrieved 28 August 2007, from http://www.mb.com.ph/issues/2006/10/03/MTN2061037203.html [9] Salvino, S. and Yu, A. C. Some Approaches in Dealing With Nonresponse in Survey Operations With Applications to the 1991 Marinduque Census of Agriculture and Fisheries Data (Undergraduate Thesis, De La Salle University)(1996) [10] Siegel, S.(1988). Nonparametric Statistics for the Behavioral Sciences.New York: Mc Graw - Hill [11] No tronic

author. Statsoft

CHAID Textbook.

Analysis

[Electronic

Retrieved

29

version],

July

2007,

Elecfrom

http://www.statsoft.com/textbook/stchaid.html [12] StatSoft, Inc. STATISTICA (data analysis software system), version 7.1. www.statsoft.com.(2005)

Appendix Appendix A Items and Information Gathered in the FIES 1997

93

Appendix B Source Codes of the Imputation Programs

94

Appendix C Model Validation of the Regression Equations used in the Regression Imputation Procedures

Final Revisions

Overview

More details

Related Documents

Final Revisions

Final Revisions

Issue Report Final Revisions

Final Budget Revisions

Revisions

# 14 Charter Revisions Final