Revised Thesis Again

  • November 2019
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Revised Thesis Again as PDF for free.

More details

  • Words: 3,083
  • Pages: 15
Abstract Several imputation methods have been developed for imputing missing responses. It is often not clear which imputation method is best for a particular condition. In choosing an imputation method, several factors should be considered such as the types of estimates that will be generated, the type and pattern of nonresponse, and the availability of the auxiliary data that are highly correlated with characteristic of interest or with the response propensity.

In the study compared the effectiveness of four imputation procedures namely the Overall Mean, Hot Deck, Deterministic and Stochastic Regression Imputation using the first visit variable to be its auxiliary variable. Values for variables second visit Total Income and Expenditures (TOTIN2 and TOTEX2) were set to nonresponse to satisfy the assumption of partial nonresponse. The results of the study provide some support for the following conclusions: (a) for the 1997 FIES data, the Hot Deck Imputation and Overall Mean Imputation methods are not appropriate for handling partial nonresponse data; (b) stochastic regression imputation was selected as the best imputation method; and (c) the imputation classes must be homogeneous to produce less biased estimates.

Chapter 1

The Problem and Its Background 1.1 Introduction Missing data in sample surveys is inevitable. The problem of missing data occurs for various reasons such as when the respondent moved to another location, refused to participate in the survey or is unable to answer specific items in the survey. This failure to obtain responses from the units selected in the sample is called nonresponse. There are several types of nonresponse; (a) unit nonresponse refers to the failure to collect any data from a sample unit; (b) item nonresponse refers to the failure to collect valid responses to one or more items from a responding sample unit (i.e. in cases of surveys with only one phase or considers a single phase ignoring other phases); and (c) partial nonresponse occurs when there is a failure to collect responses for large sets or a block of items (i.e. in cases of surveys with more than one phase, the same respondent cannot answer in the succeeding phases of the survey) for a responding unit.

The effect of nonresponse must not be ignored since it leads to biased estimates which if large would result to inaccuracy of estimates. Bias due to nonresponse is believed to be a function of nonresponse rates and the difference in characteristic between responding and nonresponding units. The larger the nonresponse rate or the wider the difference in characteristic between the responding and nonresponding units, the result will lead to a larger bias.

In practice, there are three ways of handling missing data. These are discarding the missing values, applying weighting adjustments or using imputation techniques. Discarding the missing values or otherwise known as the Available Case Method is based on excluding the nonresponse records when analyzing the variable of interest. The problem with this method is that it does not account for the difference in characteristic between the responding and nonresponding units. Hence, methods for compensating missing data are applied. The first method is called weighting adjustments. Weighting adjustments is based on matching nonrespondents to respondents in terms of data available on nonrespondents and increasing the weights of matched respondents to account for the missing values. Hence, a weight proportionate to the amount of nonresponse to the inverse of the response rate is often multiplied is often multiplied to the inverse of the response rate. This is often applied for unit nonresponse. On the other hand, imputation is also used by statisticians to account for nonresponse, usually in the case of item and partial nonresponse. In imputation, a missing value is replaced by a reasonable substitute for the missing information. Once nonresponse has been dealt with, whether by weighting adjustments or imputation, then researchers can proceed with their data analysis.

The Family Income and Expenditure Survey (FIES) is an example of a survey which has more than one round of data collection. The FIES is a nationwide survey of households conducted every three years with two visits per survey period on the sample unit by the National Statistics Office (NSO) in order to provide information of the country’s income distribution, spending patterns and poverty incidence. Like any other survey, FIES

encounters the problem of missing data, particularly the problem of nonresponse during the second visit. Given the various contributions that this survey can provide, it is then important to have precise estimates of the income and expenditure indicators.

With the 1997 FIES as the data set for this study, this paper will focus on dealing with partial nonresponse through the use of imputation techniques. It aims to examine the effects of imputed values in coming up with estimates for the missing data at various nonresponse rates. Furthermore, the study aims to determine which imputation techniques is appropriate for the FIES data through applying some of the methods mentioned in the study about the 1978 Research Panel Survey for the Income Survey Development Program (ISDP) entitled Compensating for Missing Data by Kalton (1983).

1.2 Statement of the Problem This paper attempts to answer the following questions: 1. Which imputation technique is the most appropriate in handling partial nonresponse for the FIES data? 2. How do varying nonresponse rates affect the results for each imputation method?

1.3 Objectives of the Study The paper will attempt to achieve the following objectives: 1. To compare the imputation techniques namely overall mean imputation, hot deck imputation, deterministic and stochastic regression imputation, in compensating partial nonresponse in the FIES.

2. To investigate the effect of the varying rates of missing observations, particularly the effect of 10%, 20% and 30% nonresponse rates on the precision of the estimates.

1.4 Significance of the Study Nonresponse is a common problem in conducting surveys. The presence of nonresponse in surveys causes to create incomplete data, which could pose serious problems during data analysis, particularly in the generation of statistically reliable estimates. For this reason, the use of imputation techniques enables to account for the difference between respondents and nonrespondents. This then helps reduce nonresponse bias in the survey estimates.

Since most statistical packages require the use of complete data before conducting any procedure for data analysis, the use of imputation techniques can ensure consistency of results across analyses, something that an incomplete data set cannot fully provide.

In a news article by Obanil (2006) entitled Topmost Floor of the NSO Building gutted by Fire posted at Manila Bulletin Online, it mentioned that last October 3, 2006 around 1 Million Pesos worth of documents were destroyed by the fire. Among the documents gutted by the fire is the first-visit questionnaire of the FIES for the NCR which at the time of the fire has not yet been encoded.

In terms of statistical research, most countries in the developing world such as the United States, Canada, UK and the Netherlands already employ imputation techniques in their

respective national statistical offices. In a country such as the Philippines, where data collection is very difficult especially for some regions like the National Capital Region (NCR), imputation will be able to ease the problem of data collection and nonresponse.

More importantly, given the great impact of this survey to the country, employing imputation techniques will help statisticians in providing a method in handling nonresponse, which could lead to a more meaningful generalization about our country’s income distribution, spending patterns and poverty incidence. Hence, having estimates with less bias and more consistent results, this can contribute in making our policymakers and economists provide better solutions in improving the lives of the Filipinos.

1.5 Scope and Limitations Throughout this paper, only the data from the 1997 Family Income and Expenditure Survey (FIES) will be used to tackle the problem of nonresponse and to examine the impact of the different imputation methods applied in the dataset. With regards to the extent of how these imputation methods will be applied and evaluated, this paper will only cover the partial nonresponse occurring in the National Capital Region (NCR) since NCR is noted as the region with highest nonresponse rate. Also, the variables that will be imputed for this study would be the Total Income (TOTIN2) and Total Expenditures (TOTEX2) of the second visit of the FIES data.

The researchers will only focus on using the 1997 FIES data on the first visit to impute the partial nonresponse that is present on the second visit. This paper also assumes that

the first visit data is complete and the pattern of nonresponse follows Missing Completely at Random (MCAR) case. The MCAR case happens if the probability of response to Y is unrelated to the value of Y itself or to any other variables; making the missing data randomly distributed across all cases (Musil et. al, 2002). If the pattern on nonresponse does not satisfy the MCAR assumption, imputation methods may not achieve its purpose.

As for the imputation techniques, only four imputation methods will be applied for this paper namely: Overall Mean Imputation (OMI), Hot Deck Imputation (HDI), Deterministic Regression Imputation (DRI) and Stochastic Regression Imputation (SRI). Other methods of handling nonresponse will not be covered in this paper.

On the aspect of evaluating the efficacy and appropriateness of the four imputation methods, this will only be limited to the following: (a) Bias of the mean of the Imputed Data, (b) Assessment of the Distributions of the Imputed vs. the Actual Data and (c) the criteria mentioned in the report entitled Compensating for Missing Data (Kalton, 1983) namely the Mean Deviation, Mean Absolute Deviation and the Root Mean Square Deviation.

Chapter 2

Review of Related Literature Much research effort has been devoted in the efficacy of various imputation methods. In the report entitled Compensating for Missing Survey Data, two simulation studies were carried out using the data in the 1978 Income Survey Development Program (ISDP) Research Panel to compare some imputation methods. The first study compared imputation methods for the variable Hourly Rate of Pay while the second dealt with the imputation of the variable Quarterly Earnings. For both studies, the author stratified the data into its imputation classes, constructed data sets with missing values by randomly deleting some of the recorded values in the original dataset and then applied the various imputation methods to fill in the missing values. This process was replicated ten times to ensure consistency of the results. Once the imputation methods have been applied, the three measures for evaluating the effectiveness of imputation methods namely the Mean Deviation, Mean Absolute Deviation and the Root Mean Square Deviation were obtained and averaged across the ten trials (Kalton, 1983).

For the first study of imputing the variable Hourly Rate of Pay, eight methods were used namely the Grand Mean Imputation (GM), the Class Mean Imputation using eight imputation classes (CM8), the Class Mean Imputation using ten imputation classes (CM10), Random Imputation with eight imputation classes (RM8), Random Imputation with ten imputation classes (RM10), Multiple Regression Imputation (MI), Multiple Regression Imputation plus a random residual chosen from a normal distribution (MN)

and Multiple Regression Imputation plus a randomly chosen respondent residual (MR). Using the Mean Deviation criteria, the results showed that all mean deviations were negative, indicating that the imputed values underestimated the actual values. Moreover, the results show that the Grand Mean Imputation (GM) has the greatest underestimation among the eight procedures. Meanwhile for the Mean Absolute Deviation and Root Mean Square Deviation, which measures the ability to reconstruct the deleted value, the results showed that the Grand Mean Imputation fared the worst for both criteria. In addition, it also showed that the Multiple Regression Imputation (MI) obtained the best measures for the two criteria and that the procedures with greater number of imputation classes (i.e., CM8 VS. CM10, RC8 VS. RC10) slightly yield better results for the two criteria (Kalton, 1983).

For the second study, which is the imputation of Quarterly Earnings, ten imputation procedures were used. These are the Grand Mean Imputation (GM), the Class Mean Imputation using eight imputation classes (CM8), the Class Mean Imputation using twelve imputation classes (CM12), Random Imputation with eight imputation classes (RM8), Random Imputation with twelve imputation classes (RM12), Multiple Regression Imputation (MI), Multiple Regression Imputation plus a random residual chosen from a normal distribution (MN), Multiple Regression Imputation plus a randomly chosen respondent residual (MR), Mixed Deductive and Random Imputation using eight imputation classes (DI8) and Mixed Deductive and Random Imputation using twelve imputation classes (DI12). Using the first criteria, the Mean Deviation, the results showed that the Grand Mean (GM) obtained a positive bias. This implied that the grand mean

imputation is not an effective imputation method for the study. The results also showed that the regression imputation procedures have almost similar results producing almost unbiased estimates. In addition, the Class Mean Imputation methods (CM8 and CM12) have similar measures with those of the Random Imputation Methods. Nevertheless, all methods have produced relatively small mean deviations except for the last two methods. Comparing the Mean Absolute Deviations and the Root Mean Square Deviations, the results show that the Grand Mean Imputation obtained values similar to the regression procedures with residuals (i.e. Multiple Regression Imputation plus a random residual chosen from a normal distribution or MN, Multiple Regression Imputation plus a randomly chosen respondent residual or MR). The results also show that the RC8. RC12, MN and MR procedures are over one third larger compared to deterministic procedures such as the CM8, CM12 and MI procedures (Kalton, 1983).

To further investigate the relatively larger biases of DI8 and DI12 procedures, the author further divided the date into the deductive and non deductive cases. This shed further light on the Mean Deviations and Mean Absolute Deviations of the various imputation methods. It was found that the mean deviations are positive on the deductive case and negative on the non deductive case for all of the procedures. These then explains why there are relatively small deviations in the previous results since the measures between the cases tend to cancel out. It also showed that the DI8 and DI12 results are similar to those of the RC8, RC12, CM8 and CM12 in the non-deductive cases but are largely different in the deductive cases. This explains the larger values of DI8 and DI12 in the previous results (Kalton, 1983).

At the end of the two studies, it showed that the imputation procedures tend to overestimate the Hourly Rate of Pay and underestimate the Quarterly Earnings. Moreover, it showed how the mean imputation appears to be the weakest imputation method among the studies since it has distorted the distribution of the original data. Lastly, Kalton’s study shows the impact of increasing the imputation classes with respect to the criteria used such that it gives a better yield of values for the three criteria.

In contrast to Kalton’s criteria in measuring the performance of imputation procedures, a paper entitled A Comparison of Imputation Techniques for Missing Data by C. Musil, C. Warner, P. Yobas and S. Jones, the authors presented a much simple approach in evaluating the performance of imputation techniques by using the means, standard deviation and correlation coefficients, then comparing the statistics of the original data with the statistics obtained from the five methods namely Listwise deletion, Mean Imputation, Deterministic Regression, Stochastic Regression and EM Method. The Expectation Maximization (EM) Method is an iterative procedure that generates missing values by using expectation (E-step) and maximization (M-step) algorithms. The E-step calculates expected values based on all complete data points while the M-step replaces the missing values with E-step generated values and then recomputed new expected values (Musil et. al, 2002).

Using the Center for Epidemiological Studies data on stress and health ratings of older adults, the authors imputed a single variable namely the functional health rating. Of the 492 cases, 20% cases were deleted in an effort to maximize the effects of each imputation

method. Except for the Listwise Deletion and Mean Imputation, the researchers used the SPSS Missing Value Analysis function for the Deterministic Regression, Stochastic Regression and EM Method. For the correlations, the researchers obtained the correlation values of the original data and the five methods of the imputed variable with the variables, age, gender and self assed health rating (Musil et. al, 2002). The results show that comparing the mean of the original data with the five methods, all imputed values underestimated the mean. The closest to the original data was the Stochastic Regression, followed very closely by EM Method, Deterministic Regression, Listwise Deletion and Mean Imputation. The same results also hold for the standard deviations. For the correlations, however, the EM Method produced the closest correlation values to the original data followed closely by the Stochastic Regression, Deterministic Regression, Listwise Deletion and Mean Imputation. Hence, the Finding suggests that the Stochastic Regression and EM Method performed better while the Mean Imputation is the least effective (Musil et. al, 2002).

Chapter 4 Methodology p.43 4.4.3

Deterministic and Stochastic Regression Imputation

Deterministic Regression Imputation (DRI) is a procedure that involves the generation of a Least Square Regression Equation where Y ….. …. With an additional procedure of adding an error term eˆ to the predicted value in order to generate imputed values for missing data.

p.44 The diagnostic checking requires the fitted model to satisfy the following assumptions:

p.45 refer to the errata sheet and remove the discussion of the variance change title to: Bias of the mean of the imputed data

Chapter 5: Results and Discussion p. 54 Table 3: Chi-Square Test of Independence for the Matching Variable p.55 Table 4: Measures of Association for Matching Variable

PROV CODES1 CODEP1

Phi-Coefficient Cramer's V Contingency Test CODIN1 CODEX1 CODIN1 CODEX1 CODIN1 CODEX1 0.192 0.183 0.111 0.105 0.188 0.18 0.386 0.408 0.273 0.288 0.36 0.378 0.295 0.216 0.17 0.125 0.283 0.211

p.56 (changed font and font size) Descriptive Statistics VI

IC

Mean 93588.3 2 186940. 9 643191. 2 74866.6 8 135510. 8 413184. 0

IC1 TOTIN2

IC2 IC3 IC1

TOTEX2

IC2 IC3

Minimum Maximum 9067.000

1340900

14490.00

4215480

54790.00

4357180

9025.000

731937.0

13575.00

3203978

40505.00

2726603

Std. Dev 75619.5 2 281852. 3 829409. 3 47517.6 9 151984. 3 532577. 1

p.57 (edited, changed font and font size) VI

NRR

10% TOTEX2 20% 30% 10% TOTIN2 20% 30%

Table 7: Model Fitted Results

Observations retained n Mean 3717 102748.610 3304 102219.791 2891 100709.947 3717 134821.662 3304 133624.722 2891 130685.596

Observations deleted n Mean 413 99160.235 826 103069.697 1239 106309.365 413 127799.121 826 136098.155 1239 142131.636

Valid n 2635 1434 61 2635 1434 61

Related Documents

Revised Thesis Again
November 2019 16
Thesis Revised
December 2019 13
Chapter 1 Thesis Revised
November 2019 18
Thesis Proposal - Revised
December 2019 44
~again~
November 2019 33