IMPUTATION PROCEDURES FOR PARTIAL NONRESPONSE The Case of the Family Income and Expenditure Survey (FIES)
Diana Camille B. Cortes James Edison T. Pangan
THE PROBLEM AND ITS BACKGROUND
Statement of the Problem • Which imputation technique is the most appropriate for the FIES data? • Will the imputation methods generate less biased estimates in contrast to ignoring nonresponse? • How do varying nonresponse rates affect the results for each imputation method?
Objectives of the Study • To compare the imputation techniques namely Overall Mean Imputation, Hot Deck Imputation, Deterministic Regression and Stochastic Regression, based on its efficiency and ability to recapture the deleted values by generating the missing values on the FIES 1997 second visit data using the first visit data of the same survey. • To investigate the effect of the varying rates of missing observations, particularly the effect of 10%, 20% and 30% nonresponse rates on the precision of the estimates.
Scope and Limitations • The Family Income and Expenditure Survey (FIES) 1997 was used to tackle the problem of nonresponse and to examine the impact of the different imputation methods. • The researchers focused on using the FIES 1997 data on the first visit to impute the partial nonresponse that is present on the second visit.
Scope and Limitations • This paper also assumes that the first visit data is complete and the pattern of nonresponse follows the Missing Completely at Random (MCAR) case. • Only four imputation methods will be applied for this paper namely: Overall Mean Imputation (OMI), Hot Deck Imputation (HDI), Deterministic Regression Imputation (DRI) and Stochastic Regression Imputation (SRI).
Scope and Limitations • This paper only covered the partial nonresponse occurring in the National Capital Region (NCR) • The variables imputed for this study would be the Total Income (TOTIN2) and Total Expenditures (TOTEX2) in the second visit of the FIES 1997 data
Scope and Limitations •
On the aspect of evaluating the efficacy and appropriateness of the four imputation methods, this was limited to the following: 1. Nonresponse Bias and Variances of the Imputed Data 3. Assessment of the Distributions of the Imputed vs. the Actual Data 3. The criteria set in the report entitled Compensating for Missing Data (Kalton, 1983) namely the Mean Deviation, Mean Absolute Deviation and the Root Mean Square Deviation.
CONCEPTUAL FRAMEWORK
Nonresponse Bias • Nonresponse Bias becomes a burden when missing data is either ignored, deleted or discarded in the analysis of the data. • To understand the concept better, it will only tackle the general idea and would not mention anything about the types and patterns of nonresponse.
Nonresponse Bias • Consider a Simple Random Sample (SRS) in the variable y, where y contains missing data from a population of N is drawn. • The population is divided into two groups, namely, the respondents and nonrespondents.
Nonresponse Bias • Let R and M be the number of respondent and nonrespondents with N = R + M in the population. • Let r and m with n = r + m be the corresponding sample quantities.
Nonresponse Bias • The proportion of respondents and nonrespondents in the population is given by:
and
Nonresponse Bias • The proportion of respondents and nonrespondents in the sample is given by:
and
Nonresponse Bias • The population total and mean of the population are given by: and
Nonresponse Bias • The corresponding sample total and mean are given by:
Nonresponse Bias • If no compensation is made, the respondent sample mean is used to estimate the population mean. Its bias is given by:
Nonresponse Bias The expectation of the respondent sample mean can be obtained in two stages, first conditional on fixed r and then over different values of r, that is:
where E2 is the conditional expectation for fixed r and E1 is the expectation over different values of r
Nonresponse Bias The expectation of the respondent sample mean is given by:
Nonresponse Bias Hence, the bias of the respondent sample mean is given by:
Patterns of Nonresponse • Nonresponse patterns are essential assumptions since it is influential in handling missing data particularly in the implementation of the imputation procedures to be used in this study. • There are three patterns of nonresponse: Missing Completely At Random (MCAR), Missing At Random (MAR) and NonIgnorable Nonresponse (NIN)
Patterns of Nonresponse • Missing Completely At Random (MCAR) occurs if the probability of Y is unrelated to Y itself or to other variables in the data. Data with this kind of nonresponse has the highest degree of randomness and show no underlying reasons for missing observations that can potentially lead to bias research findings
Patterns of Nonresponse • Missing At Random (MAR) occurs if the probability of Y is unrelated to Y itself after controlling other variables in the data. This means that the likelihood of a case having incomplete information on a variable can be explained by other variables in the data set. Similar to the MCAR assumption, data from this type also show some randomness.
Patterns of Nonresponse • NonIgnorable Nonresponse (NIN) occurs if the probability of missing data on Y is related to the value of Y and possibly to some other variable Z even if other variables are controlled in the analysis. Unlike MCAR and MAR, this type does not exhibit randomness but rather systematic, nonrandom factors underlying the occurrence of the missing values that are not apparent or measured.
Types of Nonresponse • Noncoverage (NC) denotes the failure to include some units of the survey population in the sampling frame. • NC is not usually a type of nonresponse, however, Kalton (1983) loosely stated it for convenience purposes. • This can be seen in surveys where listing of units are incomplete or units are failed to cover in the sampling frame.
Types of Nonresponse •
Unit (Total) Nonresponse (UN) takes place wherein no information collected from a sampling unit.
•
Reasons include: 1. Failure to contact respondent 2. Inability to cooperate 3. Questionnaires are lost
Types of Nonresponse •
Item Nonresponse (IN) is the failure to collect complete information due to refusal of answering some of the questions.
•
Reasons include: 1. Lack of information necessarily needed by the informant 2. Failure to make the effort of retrieving it from his memory or by consulting his records 3. Refusal of answering because of the sensitivity of the questions 4. Failure to record an answer 5. Responses are subsequently rejected for an edit check
Types of Nonresponse • •
Partial Nonresponse (PN) is the failure to collect large sets of items for a responding unit. Reasons include: 1. 2. 3. 4. 5.
Fails to provide information in one or more wave of a panel survey or later phases of a multi-phase survey data collection procedure Later items in the questionnaire after breaking off a telephone interview Unavailability of the data after all possible checking and follow-up Inconsistency of the responses that do not satisfy natural or reasonable constraints known as edits Similar causes to total nonresponse
Imputation Procedures •
Imputation is the process of replacing a missing value through available statistical and mathematical techniques, with a value that is considered to be a reasonable substitute for the missing information.
•
Imputation is listed as one of the many procedures that can be used to deal with nonresponse in order to generate unbiased results.
Imputation Procedures • Listed below are the advantages (benefits) and disadvantages (dangers) of using Imputation Advantages • • •
Helps reduce biases in survey estimates Imputation makes analysis easier and presentation simpler Ensures consistency of the results across analyses, a feature that an incomplete data set cannot fully provide
Disadvantages • • •
Biases might be greater after using imputation The distribution of the data might be distorted Falsely treating the imputed data as if it were a complete data set.
Imputation Procedures •
Imputation Procedures or Methods (IM) are techniques applied to replace missing values.
•
These techniques can either implement statistical or simply mathematical procedures like replacing an observation by a constant value (e.g. mean)
•
There are four IMs applied in this study, namely, the Overall (Grand) Mean Imputation (OMI), Hot Deck Imputation (HDI), Deterministic Regression Imputation (DRI) and Stochastic Regression Imputation (SRI).
Imputation Procedures • • • • •
Imputation Class (IC) is a stratification class that divides the data into groups before imputation takes place. Formation of ICs can be very useful if it were divided into homogeneous groups. Variables used to define IC are called Matching Variables (MV). The group of observations with a response are called donors. The group of observations that will be substituted by a response are called recipients.
Imputation Procedures •
Problems might arise if one does not form IC with caution and one of them is the determination of a definite number of IC.
As the number of imputation class increases, the tendency to inflate the variances within the class due to the likelihood of decreasing the number of observations increases. As the number of imputation class decreases, the tendency to inflate the aggregation bias within the class due to the likelihood of increasing the number of observations increases.
Imputation Procedures Overall Mean Imputation (OMI) •
OMI simply replaces each missing data by the overall mean of the available (responding) units in the same population.
•
The overall mean is given by:
Imputation Procedures Overall Mean Imputation (OMI) •
The need for an IC to be homogeneous is unnecessary.
•
The IC is the entire population itself.
•
In many related literature, IC is not a requirement and therefore excluded in performing this method.
Imputation Procedures Overall Mean Imputation (OMI) Advantages
Disadvantages
• Can be applied to any data set
•
Distribution of the data becomes distorted
• Easier to use and generate results faster
•
Produces large biases and variances because it does not allow variability in the imputation of missing values.
Imputation Procedures Hot Deck Imputation (HDI) •
HDI method is the process by which the missing observations are imputed by choosing a value from the set of available units.
•
Values are either selected at random (traditional hot deck), in some deterministic way (deterministic hot deck) or in some function of distance (nearest-neighbor hot deck).
Imputation Procedures Hot Deck Imputation (HDI) •
In performing this method, let Y be the variable that contains missing observations and let Xi be the ith-variable that has no missing observations. The following procedures is as follows:
5.
Find a set of categorical X variables that are highly associated with Y. The X variables to be selected will be the matching variables in this imputation. Form a contingency table based on X variables.
6.
Imputation Procedures Hot Deck Imputation (HDI)
3.
If there are cases that are missing within a particular cell in the table, select a case from the set of available units from Y variable and imputed the chosen Y value to the missing value.
5.
In choosing for the imputation to be substituted to the missing value, both of them must have similar or exactly the same characteristics.
Imputation Procedures Hot Deck Imputation (HDI) Disadvantages
Advantages • The shape of distribution is preserved
•
All X variables must be categorical
• Nonexistence of out-ofrange values
•
Distortion of the distribution of the data is possible due to the multiple use of one observation from the donor record.
•
IC must be limited to ensure that all missing values will have a donor.
• Imputed values are all actual values
Imputation Procedures General Regression Imputation •
The method of imputing missing values via the least-squares regression is known to be the regression imputation (RI) method.
•
This technique is seen as the generalization of the group mean imputation (GMI), another type of mean imputation other than OMI that uses imputation classes.
•
There are many ways of creating a regression model, however, using Kalton’s (1983) study, the value for which imputations are needed y is regressed on the matching variables for the units providing a response on y.
Imputation Procedures General Regression Imputation •
Missing value may be imputed into two basic ways: a. To use the predicted value from the model given the values of the matching variables for the record with a missing response or otherwise known as Deterministic Regression Imputation (DRI) b. To use the predicted value plus some type of randomly chosen residual or otherwise known as Stochastic Regression Imputation (SRI)
Imputation Procedures General Regression Imputation •
In comparing the accuracy and efficiency of this method, it will be helpful if the methods to be compared have the same imputation class.
•
The general model based on imputation classes is in the form:
Imputation Procedures General Regression Imputation Stochastic Regression Imputation
• Since the predicted value from the model corresponds to the mean value imputation in the restricted model which have undesirable distributional properties, a good case therefore exists for including the estimated residual.
Imputation Procedures General Regression Imputation (GRI) Deterministic Regression Imputation
Stochastic Regression Imputation •
Even if the deterministic predicted value is feasible, the stochastic value may need not be.
•
After adding the residual, unfeasible values can be the result.
• Distribution becomes too peaked and the variances is underestimated.
Imputation Procedures General Regression Imputation (GRI) Disadvantages
Advantages • Has the potential to produce closer imputed values
•
Time-consuming operation and often times unrealistic to consider its application for all items with missing values in a survey.
•
A high coefficient of determination is required to make the method effective.
METHODOLOGY
The Simulation Method • The simulation method is procedure to create an artificial data set with missing observations to indicate which values will be imputed. • The objective of creating a simulation method is to be able to make an empirical comparison of the statistical properties of the estimates with imputed values
The Simulation Method •
The algorithm for the simulation procedure are as follows: 1. To get the number of observations to be set to missing for each nonresponse rate, the total number of observations was multiplied to the indicated nonresponse rate. The nonresponse rates used for this study were 10%, 20% and 30%.
The Simulation Method 1. Each observation from the matrix of random numbers was assigned to both observations of the 1997 FIES second visit variables TOTIN2 and TOTEX2. 3. The second visit observations for both variables were sorted in ascending order through their corresponding random number.
The Simulation Method 1. The first 10% of the sorted second visit data for both variables were selected and set to as missing observations. The same procedure goes for the data set which will contain 20% and 30% nonresponse rates respectively. 5. The missing observations were flagged. This was done to distinguish the imputed from the actual values during the data analysis.
Formation of Imputation Classes •
The steps undertaken in the formation of imputation classes are as follows: 1. The researchers identified the potential matching variables, which are the candidate variables that could have an association with the variables of interest (i.e. TOTIN and TOTEX).
Formation of Imputation Classes 2. The categorical variables from the first visit data must fit into the criteria in order to be selected as a candidate variable. The criteria are as follows: a. The variable must be known b. The variable must be easy to measure c. The probability of missing observations is small
Formation of Imputation Classes 3. For the variables that have many categories, the
researchers reduced the number of categories for these variables. The rationale for this procedure is because having too many categories can increase heterogeneity and the bias of the estimates. This was done with the use of the software Statistica, particularly, the Recode function.
Formation of Imputation Classes 1.
Measures of association were tested on the matching variables. The Chi Squared test was the first test applied on the variables. This was made to determine if the candidate variables is a significant factor for the variables of interest.
5.
Other tests such as the Phi-coefficient, Cramer’s V and Contingency Test. From these tests, the candidate variable with the greatest degree of association will be chosen as the matching variable that will group the data into their respective imputation class.
Overall Mean Imputation 1. The overall mean for the variables of interest, TOTIN and TOTEX, for the first visit was computed. The equation used for the computation is:
Overall Mean Imputation 2. Using the nonresponse data sets generated, the missing observations for the second visit variables TOTIN and TOTEX were replaced with the overall means of the first visit TOTIN and TOTEX.
Hot Deck Imputation 1. The donor and recipient record for each imputation class and variable were first identified. 4. The missing observations of the second visit were assigned to their respective recipient records for each imputation class while the first visit observations were placed to their respective donor records for each imputation class. 3. The values that were substituted for the missing observations were randomly chosen from the donor record for each imputation class.
Regression Imputation 1. A logarithmic transformation was applied for the first
and second visit of the variables TOTIN and TOTEX. The rationale for this transformation is because the income and expenditure variables are not normally distributed, moreover logarithmic transformations help correct the non-linearity of the regression equation.
Regression Imputation 1. The formation of regression equation was done after the transformation. For this study, only one predictor variable was used and the general formula for the regression equation is:
Regression Imputation 1. For the stochastic regression which involves the computation of the error term, the following steps were made: a. A frequency distribution of the residuals was created. b. The class means of the frequency distributions were used to obtain the error terms for the regression equation.
Regression Imputation 1. Model validation of the regression equations follow. This diagnostic checking requires to satisfy the following assumptions: a. b. c. d.
Linearity Normality of Error Terms Independence of Error Terms Constancy of Variance
5. The missing observations were replaced by the predicted value using the corresponding regression equation.
Comparison of Imputation Techniques •
The imputation methods were compared using the following: a. Bias and Variance of the Estimates b. The Distribution of Imputed vs. Actual Data c. Kalton’s Criteria in Assessing the Performance of Imputation Techniques
Bias and Variance of the Estimates • The variances of the actual data and the imputed data were obtained. • The variances of the imputed and actual data were compared to assess the ability of the imputation techniques to mirror the actual data, moreover to determine the effect of the varying nonresponse rates on the performance of the imputation techniques.
Bias and Variance of the Estimates 2. The mean of the imputed data was computed. For hot deck and stochastic regression imputation, the average of all the mean of the 1000 simulated data sets was computed. 4. The mean of the actual data, was computed. 3. The resulting bias of the mean of the imputed data was computed by getting the difference between (1) and (2).
Bias and Variance of the Estimates •
For the overall mean and deterministic regression imputation, the variance is zero. On the other hand, for hot deck and stochastic regression imputation, the variance is given by:
and
Comparing the Distribution of the Imputed vs. Actual Data • A goodness – of –fit test was utilized for the comparison of the distributions. • The Kolmogorov - Smirnov Test was used. • The Kolmogorov - Smirnov Test is a goodness-of-fit test concerned with the degree of agreement between the distribution of a set of sampled (observed) values and some specified theoretical distribution (Siegel, 1988)
Comparing the Distribution of the Imputed vs. Actual Data 1. Income and Expenditure deciles were created. The creation of these deciles was based on the second visit actual FIES 1997 data. 2. The obtained deciles were used as upper bounds of the frequency classes.
Comparing the Distribution of the Imputed vs. Actual Data 1. A Frequency Distribution Table (FDT) for each trial was created. 4. The FDT includes the Relative Cumulative Frequency (RCF) for both the imputed and actual distribution. RCFs are computed by dividing the cumulative frequency by the total number of observations.
Comparing the Distribution of the Imputed vs. Actual Data 1. The absolute value of the difference of the actual data RCF and the imputed RCF was computed. 6. The test statistic for the Kolmogrov - Smirnov Test, which is the maximum deviation, D, was determined by using this equation:
Comparing the Distribution of the Imputed vs. Actual Data 7. Since this is a large sample case and assuming a 0.05 level of significance, the critical value for this is computed using the formula:
8. If D is less than the critical value, then the conclusion that the imputed data maintains the same distribution of the actual data follows.
Comparing the Distribution of the Imputed vs. Actual Data •
To provide additional information to the distribution of the distribution of the imputed vs. actual data, the comparison of the frequency distribution of the true (deleted) values vs. imputed values was taken.
3. Income and Expenditure deciles were created. The deciles that were used in the previous test were the same deciles used here. 2. The obtained deciles were used as upper bounds of the frequency classes.
Comparing the Distribution of the Imputed vs. Actual Data 1. A Frequency Distribution Table (FDT) for both the imputed values and actual values was generated. 3. For the hot deck and stochastic regression which had 1000 sets, the relative frequencies (RF) for each frequency class were averaged over 1000 RFs.
Kalton’s Criteria in Assessing the Performance of the Imputation Methods •
Kalton used three criteria in assessing the performance of the imputation methods namely: a. Mean Deviation (MD) b. Mean Absolute Deviation (MAD) c. Root Mean Square Deviation (RMSD)
Kalton’s Criteria in Assessing the Performance of the Imputation Methods •
The Mean Deviation (MD) measures the bias of the imputed values. This is represented by the equation:
Kalton’s Criteria in Assessing the Performance of the Imputation Methods •
The Mean Absolute Deviation (MAD) is a criterion for measuring the closeness with which the deleted are reconstructed measures the bias of the imputed values. This is represented by the equation:
Kalton’s Criteria in Assessing the Performance of the Imputation Methods •
The Root Mean Square Deviation (RMSD) is the square root of the sum of the square deviations of the imputed and actual observation. Same as the MAD, it measures the closeness with which the deleted values are reconstructed. This is expressed as:
Determining the Best Imputation Method •
The four imputation methods were ranked to answer the primary objective of the study. The selection of the best method is independent for all the variables of interest and nonresponse rates. The ranking covered the following: a. b. c. d. e.
Nonresponse Bias Estimated Percentage of Correct Distribution Mean Deviation Mean Absolute Deviation Root Mean Square Deviation
Determining the Best Imputation Method 1. In each criteria mentioned above, the imputation were ranked using the scale of 1 to 4, with 1 indicating the best imputation method and 4 being the worst. 2. For each variable of interest and nonresponse rate, these rankings for each criteria were obtained and summarized.
Determining the Best Imputation Method 1. The obtained rankings of a particular imputation method for each criteria is added. 4. The imputation method with the lowest total will be considered as the best imputation method for the respective variable of interest and nonresponse rate.
RESULTS AND DISCUSSION
Descriptive Statistics of Second Visit Data
•Average Total Spending in NCR:
Php 102,389.80
•Average Total Earnings in NCR:
Php 134,119.40
•TOTIN2 has a larger mean and standard deviation against TOTEX2 •There is greater dispersion and variability in TOTIN2 than TOTEX2
Formation of Imputation Classes •
Three candidate matching variables were selected namely Provincial Area Code (PROV), Recoded Education Status (CODES1), Recoded Total Employed Household Members (CODEP1)
•
The Chi-Squared test of association for the candidates and the variables of interest showed that PROV, CODES1 and CODEP1 are associated to CODIN1 and CODEX1.
•
The p-values for all the candidates were less than 0.0001 indicating that the association is very significant.
Formation of Imputation Classes
•Only the candidate matching variable CODES1 measured at a minimum of 20% for all the three tests. •The other candidate matching variables showed weak association
Descriptive Statistics of the Data Grouped into Imputation Classes (Table 5) •
The purpose of the descriptive statistics is to tell if the selected matching variable decreases the variability of observations.
•
Variability will be checked by using the standard deviation and comparing it with the value from the overall standard deviation of the variables of interest
•
First IC produced lesser spread compared to the other two ICs. The second and third IC had larger values of standard deviation however it is compensated by the lower standard deviation of the first IC
Mean of the Simulated Data (Table 6) •
When the nonresponse rate increases, the mean of the observations deleted for both variables increases.
•
When the nonresponse rate increases, the mean of the observations retained for both variables decreases.
•
The results showed that as the number of missing values increase, the deviation between the means of the actual and retained data slowly increases.
Regression Model Adequacy (Table 7) •
All the regression equations used for this study were able to satisfy the model validation assumptions of linearity, normality of error terms, independence of error terms and constancy of variance.
•
The highest r2 measured is at 93.2% from the third imputation class of TOTEX2 under the 30% nonresponse rate.
•
The lowest r2 measured is at 70.3% from the first imputation class of TOTIN2 under the 20% nonresponse rate.
•
The third imputation class generated the highest r2 while the first imputation class generated the lowest r2 for all variables of interest and nonresponse rates.
Results for the Overall Mean Imputation •
For the nonresponse bias and variance: – As the nonresponse rate increases for both TOTIN2 and TOTEX2, the value of the bias decreases with TOTEX2 having a slower decrease in the bias than TOTIN2. – The variance for all nonresponse rates and variables of interest are all zero because the population mean of the imputed data set is constant.
Results for the Overall Mean Imputation •
For the distribution of the imputed data: – the OMI method failed to maintain the distribution of the actual data. – a single value to be imputed for the missing data distorts the distribution of the data such that the distribution becomes too peaked.
Results for the Overall Mean Imputation •
For the other measures of variability (i.e. Mean Deviation, Mean Absolute Deviation and Root Mean Square Deviation): –
In the three criteria, the values for TOTEX2 are increasing as the nonresponse rate increases.
–
For TOTIN2, the data with 20% NRR have the highest values for all the three criteria.
Results for the Overall Mean Imputation •
For the other measures of variability (i.e. Mean Deviation, Mean Absolute Deviation and Root Mean Square Deviation): –
In the Mean Deviation, the values show that the OMI for 10% and 20% NRR, underestimates the actual data which is contrasting from the bias which overestimates the actual data for the variable TOTEX2 while for the 30% the inverse result shows.
–
In the variable TOTIN2, the values show that the OMI for 10% and 20% NRR, overestimates the actual data which is contrasting from the bias which underestimates the actual data.
Results for the Hot Deck Imputation •
For the nonresponse bias and variance: – The bias of the population mean of the imputed data decreases for both variables the NRR increases. – For the variable TOTEX2 with 30% NRR, the bias becomes negative. – The biases for the 10% and 20% NRR under HDI performed better than OMI.
Results for the Hot Deck Imputation •
For the nonresponse bias and variance: – The variance of the imputed data increases by more than one hundred percent as the nonresponse rate increases. – The data with 10% NRR provided the least spread of the population means and the data which contained the largest number of imputation or 30% NRR provided the worst spread.
Results for the Hot Deck Imputation •
For the distribution of the imputed data: –
For the variable TOTIN2 with 10% and 20% nonresponse, the HDI was able to maintain the distribution of the actual data.
–
For the variable TOTEX2, the HDI was able to maintain the distribution of the actual data for the 10% NRR.
–
For both TOTIN2 and TOTEX2 under the 30% NRR, the HDI failed to maintain the distribution of the actual data with 1% and 0% respectively
Results for the Hot Deck Imputation •
For the other measures of variability (i.e. Mean Deviation, Mean Absolute Deviation and Root Mean Square Deviation): –
For the variable TOTIN2, the results for the MD show that the values were underestimated for all NRR . The values under the MD decreases as the NRR increases. For the MAD and RMSD, the values obtained were unusually large as compared to OMI.
–
For the variable TOTEX2, the values for the MD show that the HDI for 10% and 20% NRR, overestimates the actual data which is consistent from the bias while for the 30% the inverse result shows. The MAD and RMSD showed that the HDI was better than OMI.
Results for the Deterministic Regression Imputation (DRI) •
For the nonresponse bias and variance: –
For both variables TOTIN2 and TOTEX2, the nonresponse bias show that the DRI underestimated the actual values.
–
Unlike the results in OMI and HDI where the bias increases tremendously as the nonresponse rate increases, the increase in bias for this method is much slower
Results for the Deterministic Regression Imputation (DRI) •
For the nonresponse bias and variance: –
For the variable TOTEX2, DRI more biased estimates for all NRR than OMI and HDI.
–
Same with the OMI, the variance for this method is also zero since the population mean is constant due to a single simulation of the missing observations.
Results for the Deterministic Regression Imputation (DRI) •
For the distribution of the imputed data: –
The DRI was able to maintain the distribution of the actual data for all NRR and variables of interest (TOTIN2 and TOTEX2).
–
This result is contrary to previous studies which indicate that DRI could give the same results as that of the OMI.
Results for the Deterministic Regression Imputation (DRI) •
For the other measures of variability (i.e. Mean Deviation, Mean Absolute Deviation and Root Mean Square Deviation): –
The MD for both TOTIN2 and TOTEX2 for all NRR underestimates the actual observations. The underestimation for all NRR is almost stable because the rate of change is very small as compared to OMI and HDI.
–
The MAD and RMSD for both TOTIN2 and TOTEX2 provided smaller values for all NRR which shows that this method is better than the OMI and HDI.
Results for the Stochastic Regression Imputation (SRI) •
For the nonresponse bias and variance: –
For both TOTIN2 and TOTEX2, SRI showed that there is no relationship between the nonresponse rate and nonresponse bias estimates of the population mean. The biases fluctuate from one nonresponse to another. It also showed that this method has the least bias for the 30% NRR.
–
In all the methods and nonresponse rate, it is clearly seen that there is a huge disparity between the variances of the SRI and HDI. Variances from the HDI are almost ten times larger compared to SRI.
Results for the Stochastic Regression Imputation (SRI) •
For the distribution of the imputed data: –
For both TOTIN2 and TOTEX2, the SRI was able to maintain the distribution of the actual data for the 10% and 30% NRR.
–
For the 20% NRR of the variable TOTEX2, the SRI was better than the HDI in retaining the distribution of the actual data.
Results for the Stochastic Regression Imputation (SRI) •
For the other measures of variability (i.e. Mean Deviation, Mean Absolute Deviation and Root Mean Square Deviation): –
For the variables TOTIN2 and TOTEX2, the MD fluctuates from one NRR to another. However, the values in SRI is only second to the DRI yet SRI performed better than OMI and HDI.
–
The same results for the MAD and RMSD wherein the SRI ranked second to the DRI but outperformed the OMI and HDI.
Distribution of the True Values vs. Imputed Values •
•
•
For the OMI, under all the nonresponse rates and variables of interest, the tables illustrate the distortion of the distribution as the missing values replaced by a single value is concentrated on a single frequency class. For the HDI method, in all nonresponse rates, most of the imputed observations clustered in the first frequency class for both variables TOTIN and TOTEX. The clustering under HDI was also formed for the 10% and 30% NRR in last frequency class for TOTEX2 and for the all nonresponse rates in second frequency class for TOTIN2.
Distribution of the True Values vs. Imputed Values •
For the regression imputations, both regressions in all NRRs and variables of interest produced more spread distribution although there are some areas that are under represented.
•
The failure to consider a random residual term in deterministic regression resulted into a severe under representation of the data in particular the first frequency class under all NRRs and variables of interest.
Ranking the Imputation Methods •
For TOTIN2 under all NRR: SRI and DRI tied at first, followed by OMI and then HDI.
•
For TOTEX2 under 10% NRR: SRI ranked first followed by HDI, DRI and OMI.
•
For TOTEX2 under 20% NRR: HDI and DRI tied at first followed by SRI and OMI.
•
For TOTEX2 under 30% NRR: SRI ranked first followed by DRI, OMI and HDI.
Ranking the Imputation Methods •
Overall: –
The best imputation method for this study is the Stochastic Regression Imputation using the 1997 FIES data.
–
The worst imputation method for this study is the Hot Deck Imputation.
CONCLUSION
In Summary… •
There are a lot of considerations to make before using any imputation methods such as the type of analysis, the type of estimator of interest that will suit his or her purpose.
•
Practical issues such as resources available, difficulty in programming, amount of time it takes to implement each method, and complexity of procedures should also be taken into consideration when selecting which imputation method to use.
In Summary… •
Results show that the choice of imputation method significantly affected the estimates of the actual data.
•
The bias and variance estimates of the imputed data appeared to vary much across imputation methods and it was unexpected that the HDI rendered the highest estimates in majority of the nonresponse rates as well as its variables.
•
In terms of the distribution, both regression imputation methods retained the distribution of the data especially the DRI.
In Summary… •
In the other tests of accuracy and precision, namely, the Mean Deviation, Mean Absolute Deviation and Root Mean Square Deviation, the different methods provided mixed results in all nonresponse rates.
•
After comparing and ranking the four methods, the SRI procedure is considered the best imputation method for this study. This can be attributed to the random residuals added to the deterministic imputation which helped in making the estimates less biased than its deterministic counterpart.
In Summary… •
Surprisingly and in contrast with most previous studies, the Hot Deck Imputation method was the least effective for this study. The selection of donors with replacement might be the cause of its poor performance.
•
Nevertheless, anyone faced with having to make decisions about imputation procedures will usually have to choose some compromise between what is technically effective and what is operationally expedient.
RECOMMENDATIONS
Recommendations for Further Research •
Explore the use and effectiveness of the Multiple Imputation Method.
•
Implement the use of the Jackknife Variance Estimation.
•
For selecting a matching variable, advanced modern statistical methods like the Chi-squared Automatic Interaction Detector (CHAID) analysis can be utilized for further studies.
Recommendations for Further Research •
For regression imputation, instead of creating models for each imputation class, dummy variables should be inserted in the model.
•
Also using a statistical package that can generate faster and easier imputations in order to save time in debugging and computer crashes due to memory overload.
THANK YOU VERY MUCH!!!
Diana Camille B. Cortes James Edison T. Pangan