Revised Defense Presentation

  • November 2019
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Revised Defense Presentation as PDF for free.

More details

  • Words: 6,024
  • Pages: 112
IMPUTATION PROCEDURES FOR PARTIAL NONRESPONSE The Case of the Family Income and Expenditure Survey (FIES)

Diana Camille B. Cortes James Edison T. Pangan

THE PROBLEM AND ITS BACKGROUND

Statement of the Problem • Which imputation technique is the most appropriate for the FIES data? • How do varying nonresponse rates affect the results for each imputation method?

Objectives of the Study • To compare the imputation techniques namely Overall Mean Imputation, Hot Deck Imputation, Deterministic Regression and Stochastic Regression, based on its efficiency and ability to recapture the deleted values by generating the missing values on the FIES 1997 second visit data using the first visit data of the same survey. • To investigate the effect of the varying rates of missing observations, particularly the effect of 10%, 20% and 30% nonresponse rates on the precision of the estimates.

Scope and Limitations • The Family Income and Expenditure Survey (FIES) 1997 was used to tackle the problem of nonresponse and to examine the impact of the different imputation methods. • The researchers focused on using the FIES 1997 data on the first visit to impute the partial nonresponse that is present on the second visit.

Scope and Limitations • This paper also assumes that the first visit data is complete and the pattern of nonresponse follows the Missing Completely at Random (MCAR) case. • Only four imputation methods will be applied for this paper namely: Overall Mean Imputation (OMI), Hot Deck Imputation (HDI), Deterministic Regression Imputation (DRI) and Stochastic Regression Imputation (SRI).

Scope and Limitations • This paper only covered the partial nonresponse occurring in the National Capital Region (NCR) • The variables imputed for this study would be the Total Income (TOTIN2) and Total Expenditures (TOTEX2) in the second visit of the FIES 1997 data

Scope and Limitations •

On the aspect of evaluating the efficacy and appropriateness of the four imputation methods, this was limited to the following: 1. Nonresponse Bias of the Imputed Data 3. Assessment of the Distributions of the Imputed vs. the Actual Data 3. The criteria set in the report entitled Compensating for Missing Data (Kalton, 1983) namely the Mean Deviation, Mean Absolute Deviation and the Root Mean Square Deviation.

CONCEPTUAL FRAMEWORK

Definition of Terms: • •

• • •

Nonresponse is the failure to collect valid response for a particular unit. Bias is defined as the difference between the expected value of an estimator and the true value of the parameter being estimated. Accuracy is the extent to which estimates are close to the value of the parameter. Precision is the extent to which estimates are close to one another. Efficiency is defined to be the measurement on how a method is accomplished through a set of criteria.

Types of Nonresponse •

Unit (Total) Nonresponse (UN) takes place wherein no information collected from a sampling unit.



Reasons include: 1. Failure to contact respondent 2. Inability to cooperate 3. Questionnaires are lost

Types of Nonresponse •

Item Nonresponse (IN) is the failure to collect complete information due to refusal of answering some of the questions.



Reasons include: 1. Lack of information necessarily needed by the informant 2. Failure to make the effort of retrieving it from his memory or by consulting his records 3. Refusal of answering because of the sensitivity of the questions 4. Failure to record an answer 5. Responses are subsequently rejected for an edit check

Types of Nonresponse • •

Partial Nonresponse (PN) is the failure to collect large sets of items for a responding unit. Reasons include: 1. 2. 3. 4. 5.

Fails to provide information in one or more wave of a panel survey or later phases of a multi-phase survey data collection procedure Later items in the questionnaire after breaking off a telephone interview Unavailability of the data after all possible checking and follow-up Inconsistency of the responses that do not satisfy natural or reasonable constraints known as edits Similar causes to total nonresponse

Patterns of Nonresponse • NonIgnorable Nonresponse (NIN) occurs if the probability of missing data on Y is related to the value of Y and possibly to some other variable Z even if other variables are controlled in the analysis. Unlike MCAR and MAR, this type does not exhibit randomness but rather systematic, nonrandom factors underlying the occurrence of the missing values that are not apparent or measured.

Patterns of Nonresponse • Nonresponse patterns are essential assumptions since it is influential in handling missing data particularly in the implementation of the imputation procedures to be used in this study. • There are three patterns of nonresponse: Missing Completely At Random (MCAR), Missing At Random (MAR) and NonIgnorable Nonresponse (NIN)

Patterns of Nonresponse • Missing Completely At Random (MCAR) occurs if the probability of Y is unrelated to Y itself or to other variables in the data. Data with this kind of nonresponse has the highest degree of randomness and show no underlying reasons for missing observations that can potentially lead to bias research findings

Patterns of Nonresponse • Missing At Random (MAR) occurs if the probability of Y is unrelated to Y itself after controlling other variables in the data. This means that the likelihood of a case having incomplete information on a variable can be explained by other variables in the data set. Similar to the MCAR assumption, data from this type also show some randomness.

Nonresponse Bias • Nonresponse Bias becomes a burden when missing data is either ignored, deleted or discarded in the analysis of the data. • The population is divided into two groups, namely, the respondents and nonrespondents.

Nonresponse Bias • Let R and M be the number of respondent and nonrespondents with N = R + M in the population. • Consider a Simple Random Sample (SRS) in the variable y, where y contains missing data from a population of N is drawn. • Let r and m with n = r + m be the corresponding sample quantities.

Nonresponse Bias • The proportion of respondents and nonrespondents in the population is given by:

and

Nonresponse Bias • The proportion of respondents and nonrespondents in the sample is given by:

and

Nonresponse Bias • The population total and mean of the population are given by: and

Nonresponse Bias • The corresponding sample total and mean are given by:

Nonresponse Bias • If no compensation is made, the respondent sample mean is used to estimate the population mean. Its bias is given by:

Nonresponse Bias The expectation of the respondent sample mean can be obtained in two stages, first conditional on fixed r and then over different values of r, that is:

where E2 is the conditional expectation for fixed r and E1 is the expectation over different values of r

Nonresponse Bias The expectation of the respondent sample mean is given by:

Nonresponse Bias Hence, the bias of the respondent sample mean is given by:

Imputation Process •

Imputation is the process of replacing a missing value through available statistical and mathematical techniques, with a value that is considered to be a reasonable substitute for the missing information.



Imputation is listed as one of the many procedures that can be used to deal with nonresponse in order to generate unbiased results.

Imputation Process • Listed below are the advantages (benefits) and disadvantages (dangers) of using Imputation Advantages • • •

Helps reduce biases in survey estimates Imputation makes analysis easier and presentation simpler Ensures consistency of the results across analyses, a feature that an incomplete data set cannot fully provide

Disadvantages • • •

Biases might be greater after using imputation The distribution of the data might be distorted Falsely treating the imputed data as if it were a complete data set.

Imputation Process •

Imputation Procedures or Methods (IM) are techniques applied to replace missing values.



These techniques can either implement statistical or simply mathematical procedures like replacing an observation by a constant value (e.g. mean)



There are four IMs applied in this study, namely, the Overall (Grand) Mean Imputation (OMI), Hot Deck Imputation (HDI), Deterministic Regression Imputation (DRI) and Stochastic Regression Imputation (SRI).

Imputation Process • • • • •

Imputation Class (IC) is a stratification class that divides the data into groups before imputation takes place. Formation of ICs can be very useful if it were divided into homogeneous groups. Variables used to define IC are called Matching Variables (MV). The group of observations with a response are called donors. The group of observations that will be substituted by a response are called recipients.

Imputation Process • 



Problems might arise if one does not form IC with caution and one of them is the determination of a definite number of IC. As the number of imputation class increases, the tendency to inflate the variances within the class due to the likelihood of decreasing the number of observations increases. As the number of imputation class decreases, the tendency to inflate the aggregation bias within the class due to the likelihood of increasing the number of observations increases.

Imputation Process Overall Mean Imputation (OMI) •

OMI simply replaces each missing data by the overall mean of the available (responding) units in the same population.



The overall mean is given by:

Imputation Process Overall Mean Imputation (OMI) •

The IC is the entire population itself.



In many related literature, IC is not a requirement and therefore excluded in performing this method.

Imputation Process Overall Mean Imputation (OMI) Advantages

Disadvantages

• Can be applied to any data set



Distribution of the data becomes distorted

• Easier to use and generate results faster



Produces large biases and variances because it does not allow variability in the imputation of missing values.

Imputation Process Hot Deck Imputation (HDI) •

HDI method is the process by which the missing observations are imputed by choosing a value from the set of available units.



Values are either selected at random (traditional hot deck), in some deterministic way (deterministic hot deck) or in some function of distance (nearest-neighbor hot deck).

Imputation Process Hot Deck Imputation (HDI) •

In performing this method, let Y be the variable that contains missing observations and let Xi be the ith-variable that has no missing observations. The following procedures is as follows:

5.

Find a set of categorical X variables that are highly associated with Y. The X variables to be selected will be the matching variables in this imputation.

7.

Form a contingency table based on X variables.

Imputation Process Hot Deck Imputation (HDI)

3.

If there are cases that are missing within a particular cell in the table, select a case from the set of available units from Y variable and imputed the chosen Y value to the missing value.

5.

In choosing for the imputation to be substituted to the missing value, both of them must have similar or exactly the same characteristics.

Imputation Process Hot Deck Imputation (HDI) Disadvantages

Advantages • The shape of distribution is preserved



All X variables must be categorical

• Nonexistence of out-ofrange values



Distortion of the distribution of the data is possible due to the multiple use of one observation from the donor record.



IC must be limited to ensure that all missing values will have a donor.

• Imputed values are all actual values

Imputation Process General Regression Imputation •

The method of imputing missing values via the least-squares regression is known to be the regression imputation (RI) method.

• The y-variable for which imputations are needed is regressed on the auxiliary variable for the units providing a response on y. • These auxiliary variables may be quantitative or qualitative, the latter being incorporated into the regression model by means of dummy variables. • Accuracy and Efficiency can be compared effectively if other imputation procedures contain IC.

Imputation Process General Regression Imputation: Deterministic Regression • The use of the predicted value from the model given the values of the auxiliary values that contains no missing data for the record with a missing response in the variable y is called the Deterministic Regression Imputation (DRI). • The general model based on imputation classes is in the form:

Imputation Process General Regression Imputation: Stochastic Regression •

The use of the predicted value from the deterministic regression model has similar undesirable distributional properties in the mean imputation method. To compensate for it, an estimated residual is added to the predicted value.



The use of this predicted value plus some type of randomly chosen estimated residual is called the Stochastic Regression Imputation (SRI) method.

Imputation Process General Regression Imputation: Stochastic Regression •

The model for SRI is given by:

Imputation Procedures General Regression Imputation (GRI) Disadvantages

Advantages • Has the potential to produce closer imputed values if the model has a high R2.



Distribution becomes too peaked and the variances is underestimated.



Time-consuming operation and often times unrealistic to consider its application for all items with missing values in a survey.



Possibility in generating unfeasible values.

METHODOLOGY

The Simulation Method • The simulation method is a procedure to create an artificial data set with missing observations to indicate which values will be imputed. • The objective of creating a simulation method is to be able to make an empirical comparison of the statistical properties of the estimates with imputed values

The Simulation Method •

The algorithm for the simulation procedure are as follows: 1. To get the number of observations to be set to missing for each nonresponse rate, the total number of observations was multiplied to the indicated nonresponse rate. The nonresponse rates used for this study were 10%, 20% and 30%.

The Simulation Method 1. Each observation from the matrix of random numbers was assigned to both observations of the 1997 FIES second visit variables TOTIN2 and TOTEX2. 3. The second visit observations for both variables were sorted in ascending order through their corresponding random number.

The Simulation Method 1. The first 10% of the sorted second visit data for both variables were selected and set to as missing observations. The same procedure goes for the data set which will contain 20% and 30% nonresponse rates respectively. 5. The missing observations were flagged. This was done to distinguish the imputed from the actual values during the data analysis.

Formation of Imputation Classes •

The steps undertaken in the formation of imputation classes are as follows: 1. The researchers identified the potential matching variables, which are the candidate variables that could have an association with the variables of interest (i.e. TOTIN2 and TOTEX2).

Formation of Imputation Classes 2. The categorical variables from the first visit data must fit into the criteria in order to be selected as a candidate variable. The criteria are as follows: a. The variable must be known b. The variable must be easy to measure c. The probability of missing observations is small

Formation of Imputation Classes 3. For the variables that have many categories, the

researchers reduced the number of categories for these variables. The rationale for this procedure is because having too many categories can increase heterogeneity and the bias of the estimates. This was done with the use of the software Statistica, particularly, the Recode function.

Formation of Imputation Classes 1.

Measures of association were tested on the matching variables. The Chi Squared test was the first test applied on the variables. This was made to determine if the candidate variables is a significant factor for the variables of interest.

5.

Other measures of association such as the Phi-coefficient, Cramer’s V and Contingency Test followed. From these tests, the candidate variable with the greatest degree of association will be chosen as the matching variable that will group the data into their respective imputation class.

Overall Mean Imputation 1. The overall mean for the variables of interest, TOTIN1 and TOTEX1, for the first visit was computed. The equation used for the computation is:

Overall Mean Imputation 2. Using the nonresponse data sets generated, the missing observations for the second visit variables, TOTIN2 and TOTEX2, were replaced with the overall means of the first visit, TOTIN1 and TOTEX1.

Hot Deck Imputation 1. The donor and recipient record for each imputation class and variable were first identified. 4. The missing observations in TOTIN2 and TOTEX2 were assigned to their respective recipient records for each imputation class while the observations in TOTIN1 and TOTEX1 were placed to their respective donor records for each imputation class. 3. The values that were substituted for the missing observations were randomly chosen from the donor record for each imputation class.

Regression Imputation 1. A logarithmic transformation was applied to TOTIN1,

TOTIN2, TOTEX1 and TOTEX2. The rationale for this transformation is because the income and expenditure variables are not normally distributed, moreover logarithmic transformations help correct the nonlinearity of the regression equation.

Regression Imputation 1. The formation of regression equation was done after the transformation. For this study, only one predictor variable was used and the general formula for the regression equation is:

Regression Imputation 1. For the stochastic regression which involves the computation of the error term, the following steps were made: a. A frequency distribution of the residuals was created. b. The class means of the frequency distributions were used to obtain the error terms for the regression equation.

Regression Imputation 1. Model validation of the regression equations follow. This diagnostic checking requires to satisfy the following assumptions: a. b. c. d.

Linearity Normality of Error Terms Independence of Error Terms Constancy of Variance

5. The missing observations were replaced by the predicted value using the corresponding regression equation.

Comparison of Imputation Techniques •

The imputation methods were compared using the following: a. Bias of the Mean of the Imputed Data b. The Distribution of Imputed vs. Actual Data c. Other Measures in Assessing the Performance of the Imputation Methods

Bias of the mean of the imputed data • To check if the imputation methods produce reliable estimates and determine the effect of the varying nonresponse rates on the performance of imputation methods, one of the three criteria, which is the bias of the sample mean was measured.

Bias of the mean of the imputed data 2. The mean of the imputed data was computed. For hot deck and stochastic regression imputation, the average of all the mean of the 1000 simulated data sets was computed. 4. The mean of the actual data, was computed. 3. The resulting bias of the mean of the imputed data was computed by getting the difference between (1) and (2).

Comparing the Distribution of the Imputed vs. Actual Data • A goodness – of –fit test was utilized for the comparison of the distributions. • The Kolmogorov - Smirnov Test was used. • The Kolmogorov - Smirnov Test is a goodness-of-fit test concerned with the degree of agreement between the distribution of a set of sampled (observed) values and some specified theoretical distribution (Siegel, 1988)

Comparing the Distribution of the Imputed vs. Actual Data 1. Income and Expenditure deciles were created. The creation of these deciles was based on the second visit actual FIES 1997 data. 2. The obtained deciles were used as upper bounds of the frequency classes.

Comparing the Distribution of the Imputed vs. Actual Data 1. A Frequency Distribution Table (FDT) for each trial was created. 4. The FDT includes the Relative Cumulative Frequency (RCF) for both the imputed and actual distribution. RCFs are computed by dividing the cumulative frequency by the total number of observations.

Comparing the Distribution of the Imputed vs. Actual Data 1. The absolute value of the difference of the actual data RCF and the imputed RCF was computed. 6. The test statistic for the Kolmogrov - Smirnov Test, which is the maximum deviation, D, was determined by using this equation:

Comparing the Distribution of the Imputed vs. Actual Data 7. Since this is a large sample case and assuming a 0.05 level of significance, the critical value for this is computed using the formula:

8. If D is less than the critical value, then the conclusion that the imputed data maintains the same distribution of the actual data follows.

Comparing the Distribution of the Imputed vs. Actual Data •

To provide additional information to the distribution of the distribution of the imputed vs. actual data, the comparison of the frequency distribution of the true (deleted) values vs. imputed values was taken.

3.

Income and Expenditure deciles were created. The deciles that were used in the previous test were the same deciles used here.

5.

The obtained deciles were used as upper bounds of the frequency classes.

7.

A Frequency Distribution Table (FDT) for both the imputed values and actual values was generated.

Comparing the Distribution of the Imputed vs. Actual Data •

For the hot deck and stochastic regression which had 1000 sets, the relative frequencies (RF) for each frequency class were averaged over 1000 RFs.



To be able to illustrate how imputation methods were able to reconstruct or distort the actual, deleted values, bar charts were created for each nonresponse rate and variable of interest.

Other Measures in Assessing the Performance of the Imputation Methods • Kalton used three criteria in assessing the performance of the imputation methods namely: a. Mean Deviation (MD) b. Mean Absolute Deviation (MAD) c. Root Mean Square Deviation (RMSD)

Other Measures in Assessing the Performance of the Imputation Methods •

The Mean Deviation (MD) measures the bias of the imputed values. This is represented by the equation:

Other Measures in Assessing the Performance of the Imputation Methods •

The Mean Absolute Deviation (MAD) is a criterion for measuring the closeness with which the deleted are reconstructed. This is represented by the equation:

Other Measures in Assessing the Performance of the Imputation Methods •

The Root Mean Square Deviation (RMSD) is the square root of the sum of the square deviations of the imputed and actual observation. Same as the MAD, it measures the closeness with which the deleted values are reconstructed. This is expressed as:

Determining the Best Imputation Method •

The four imputation methods were ranked to answer the primary objective of the study. The selection of the best method is independent for all the variables of interest and nonresponse rates. The ranking covered the following: a. b. c. d. e.

Nonresponse Bias Estimated Percentage of Correct Distribution Mean Deviation Mean Absolute Deviation Root Mean Square Deviation

Determining the Best Imputation Method 1. In each criteria mentioned above, the imputation were ranked using the scale of 1 to 4, with 1 indicating the best imputation method and 4 being the worst. 2. For each variable of interest and nonresponse rate, these rankings for each criteria were obtained and summarized.

Determining the Best Imputation Method 1. The obtained rankings of a particular imputation method for each criteria is added. 4. The imputation method with the lowest total will be considered as the best imputation method for the respective variable of interest and nonresponse rate.

RESULTS AND DISCUSSION

Descriptive Statistics of Second Visit Data

•Average Total Spending in NCR:

Php 102,389.80

•Average Total Earnings in NCR:

Php 134,119.40

•TOTIN2 has a larger mean and standard deviation against TOTEX2 •There is greater dispersion and variability in TOTIN2 than TOTEX2

Formation of Imputation Classes •

Three candidate matching variables were selected namely Provincial Area Code (PROV), Recoded Education Status (CODES1), Recoded Total Employed Household Members (CODEP1)



The Chi-Squared test of association for the candidates and the variables of interest showed that PROV, CODES1 and CODEP1 are associated to CODIN1 and CODEX1.



The p-values for all the candidates were less than 0.0001 indicating that the association is very significant.

Formation of Imputation Classes

•Only the candidate matching variable CODES1 measured at a minimum of 20% for all the three tests. •The other candidate matching variables showed weak association

Descriptive Statistics of the Data Grouped into Imputation Classes (Table 5) •

The purpose of the descriptive statistics is to tell if the selected matching variable decreases the variability of observations.



Variability will be checked by using the standard deviation and comparing it with the value from the overall standard deviation of the variables of interest



First IC produced lesser spread compared to the other two ICs. The second and third IC had larger values of standard deviation however it is compensated by the lower standard deviation of the first IC

Mean of the Simulated Data (Table 6) •

When the nonresponse rate increases, the mean of the observations deleted for both variables increases.



When the nonresponse rate increases, the mean of the observations retained for both variables decreases.



The results showed that as the number of missing values increase, the deviation between the means of the actual and retained data slowly increases.

Regression Model Adequacy (Table 7) •

All the regression equations used for this study were able to satisfy the model validation assumptions of linearity, normality of error terms, independence of error terms and constancy of variance.



The highest r2 measured is at 93.2% from the third imputation class of TOTEX2 under the 30% nonresponse rate.



The lowest r2 measured is at 70.3% from the first imputation class of TOTIN2 under the 20% nonresponse rate.



The third imputation class generated the highest r2 while the first imputation class generated the lowest r2 for all variables of interest and nonresponse rates.

Results for the Overall Mean Imputation •

For the nonresponse bias and variance: – As the nonresponse rate increases for both TOTIN2 and TOTEX2, the value of the bias decreases with TOTEX2 having a slower decrease in the bias than TOTIN2. – The variance for all nonresponse rates and variables of interest are all zero because the population mean of the imputed data set is constant.

Results for the Overall Mean Imputation •

For the distribution of the imputed data: – the OMI method failed to maintain the distribution of the actual data. – a single value to be imputed for the missing data distorts the distribution of the data such that the distribution becomes too peaked.

Results for the Overall Mean Imputation •

For the other measures of variability (i.e. Mean Deviation, Mean Absolute Deviation and Root Mean Square Deviation): –

In the three criteria, the values for TOTEX2 are increasing as the nonresponse rate increases.



For TOTIN2, the data with 20% NRR have the highest values for all the three criteria.

Results for the Overall Mean Imputation •

For the other measures of variability (i.e. Mean Deviation, Mean Absolute Deviation and Root Mean Square Deviation): –

In the Mean Deviation, the values show that the OMI for 10% and 20% NRR, underestimates the actual data which is contrasting from the bias which overestimates the actual data for the variable TOTEX2 while for the 30% the inverse result shows.



In the variable TOTIN2, the values show that the OMI for 10% and 20% NRR, overestimates the actual data which is contrasting from the bias which underestimates the actual data.

Results for the Hot Deck Imputation •

For the nonresponse bias and variance: – The bias of the population mean of the imputed data decreases for both variables the NRR increases. – For the variable TOTEX2 with 30% NRR, the bias becomes negative. – The biases for the 10% and 20% NRR under HDI performed better than OMI.

Results for the Hot Deck Imputation •

For the nonresponse bias and variance: – The variance of the imputed data increases by more than one hundred percent as the nonresponse rate increases. – The data with 10% NRR provided the least spread of the population means and the data which contained the largest number of imputation or 30% NRR provided the worst spread.

Results for the Hot Deck Imputation •

For the distribution of the imputed data: –

For the variable TOTIN2 with 10% and 20% nonresponse, the HDI was able to maintain the distribution of the actual data.



For the variable TOTEX2, the HDI was able to maintain the distribution of the actual data for the 10% NRR.



For both TOTIN2 and TOTEX2 under the 30% NRR, the HDI failed to maintain the distribution of the actual data with 1% and 0% respectively

Results for the Hot Deck Imputation •

For the other measures of variability (i.e. Mean Deviation, Mean Absolute Deviation and Root Mean Square Deviation): –

For the variable TOTIN2, the results for the MD show that the values were underestimated for all NRR . The values under the MD decreases as the NRR increases. For the MAD and RMSD, the values obtained were unusually large as compared to OMI.



For the variable TOTEX2, the values for the MD show that the HDI for 10% and 20% NRR, overestimates the actual data which is consistent from the bias while for the 30% the inverse result shows. The MAD and RMSD showed that the HDI was better than OMI.

Results for the Deterministic Regression Imputation (DRI) •

For the nonresponse bias and variance: –

For both variables TOTIN2 and TOTEX2, the nonresponse bias show that the DRI underestimated the actual values.



Unlike the results in OMI and HDI where the bias increases tremendously as the nonresponse rate increases, the increase in bias for this method is much slower

Results for the Deterministic Regression Imputation (DRI) •

For the nonresponse bias and variance: –

For the variable TOTEX2, DRI more biased estimates for all NRR than OMI and HDI.



Same with the OMI, the variance for this method is also zero since the population mean is constant due to a single simulation of the missing observations.

Results for the Deterministic Regression Imputation (DRI) •

For the distribution of the imputed data: –

The DRI was able to maintain the distribution of the actual data for all NRR and variables of interest (TOTIN2 and TOTEX2).



This result is contrary to previous studies which indicate that DRI could give the same results as that of the OMI.

Results for the Deterministic Regression Imputation (DRI) •

For the other measures of variability (i.e. Mean Deviation, Mean Absolute Deviation and Root Mean Square Deviation): –

The MD for both TOTIN2 and TOTEX2 for all NRR underestimates the actual observations. The underestimation for all NRR is almost stable because the rate of change is very small as compared to OMI and HDI.



The MAD and RMSD for both TOTIN2 and TOTEX2 provided smaller values for all NRR which shows that this method is better than the OMI and HDI.

Results for the Stochastic Regression Imputation (SRI) •

For the nonresponse bias and variance: –

For both TOTIN2 and TOTEX2, SRI showed that there is no relationship between the nonresponse rate and nonresponse bias estimates of the population mean. The biases fluctuate from one nonresponse to another. It also showed that this method has the least bias for the 30% NRR.



In all the methods and nonresponse rate, it is clearly seen that there is a huge disparity between the variances of the SRI and HDI. Variances from the HDI are almost ten times larger compared to SRI.

Results for the Stochastic Regression Imputation (SRI) •

For the distribution of the imputed data: –

For both TOTIN2 and TOTEX2, the SRI was able to maintain the distribution of the actual data for the 10% and 30% NRR.



For the 20% NRR of the variable TOTEX2, the SRI was better than the HDI in retaining the distribution of the actual data.

Results for the Stochastic Regression Imputation (SRI) •

For the other measures of variability (i.e. Mean Deviation, Mean Absolute Deviation and Root Mean Square Deviation): –

For the variables TOTIN2 and TOTEX2, the MD fluctuates from one NRR to another. However, the values in SRI is only second to the DRI yet SRI performed better than OMI and HDI.



The same results for the MAD and RMSD wherein the SRI ranked second to the DRI but outperformed the OMI and HDI.

Distribution of the True Values vs. Imputed Values •





For the OMI, under all the nonresponse rates and variables of interest, the tables illustrate the distortion of the distribution as the missing values replaced by a single value is concentrated on a single frequency class. For the HDI method, in all nonresponse rates, most of the imputed observations clustered in the first frequency class for both variables TOTIN and TOTEX. The clustering under HDI was also formed for the 10% and 30% NRR in last frequency class for TOTEX2 and for the all nonresponse rates in second frequency class for TOTIN2.

Distribution of the True Values vs. Imputed Values •

For the regression imputations, both regressions in all NRRs and variables of interest produced more spread distribution although there are some areas that are under represented.



The failure to consider a random residual term in deterministic regression resulted into a severe under representation of the data in particular the first frequency class under all NRRs and variables of interest.

Ranking the Imputation Methods •

For TOTIN2 under all NRR:  SRI and DRI tied at first, followed by OMI and then HDI.



For TOTEX2 under 10% NRR:  SRI ranked first followed by HDI, DRI and OMI.



For TOTEX2 under 20% NRR:  HDI and DRI tied at first followed by SRI and OMI.



For TOTEX2 under 30% NRR:  SRI ranked first followed by DRI, OMI and HDI.

Ranking the Imputation Methods •

Overall: –

The best imputation method for this study is the Stochastic Regression Imputation using the 1997 FIES data.



The worst imputation method for this study is the Hot Deck Imputation.

CONCLUSION

In Summary… •

There are a lot of considerations to make before using any imputation methods such as the type of analysis, the type of estimator of interest that will suit his or her purpose.



Practical issues such as resources available, difficulty in programming, amount of time it takes to implement each method, and complexity of procedures should also be taken into consideration when selecting which imputation method to use.

In Summary… •

Results show that the choice of imputation method significantly affected the estimates of the actual data.



The bias and variance estimates of the imputed data appeared to vary much across imputation methods and it was unexpected that the HDI rendered the highest estimates in majority of the nonresponse rates as well as its variables.



In terms of the distribution, both regression imputation methods retained the distribution of the data especially the DRI.

In Summary… •

In the other tests of accuracy and precision, namely, the Mean Deviation, Mean Absolute Deviation and Root Mean Square Deviation, the different methods provided mixed results in all nonresponse rates.



After comparing and ranking the four methods, the SRI procedure is considered the best imputation method for this study. This can be attributed to the random residuals added to the deterministic imputation which helped in making the estimates less biased than its deterministic counterpart.

In Summary… •

Surprisingly and in contrast with most previous studies, the Hot Deck Imputation method was the least effective for this study. The selection of donors with replacement might be the cause of its poor performance.



Nevertheless, anyone faced with having to make decisions about imputation procedures will usually have to choose some compromise between what is technically effective and what is operationally expedient.

RECOMMENDATIONS

Recommendations for Further Research •

Explore the use and effectiveness of the Multiple Imputation Method.



Implement the use of the Jackknife Variance Estimation.



For selecting a matching variable, advanced modern statistical methods like the Chi-squared Automatic Interaction Detector (CHAID) analysis can be utilized for further studies.

Recommendations for Further Research •

For regression imputation, instead of creating models for each imputation class, dummy variables should be inserted in the model.



Also using a statistical package that can generate faster and easier imputations in order to save time in debugging and computer crashes due to memory overload.

THANK YOU VERY MUCH!!!

Diana Camille B. Cortes James Edison T. Pangan

Related Documents

Revised Defense Presentation
November 2019 10
Revised Defense Presentation
November 2019 11
Defense
June 2020 25
Defense
December 2019 55