CHAPTER 2 CONCEPTUAL FRAMEWORK Nonresponse Bias In most surveys, post-analysis results are invalid due to missing data. Missing data can either be thrown away, ignored or substituted through some procedure. When data are thrown away or ignored in generating estimates, nonresponse bias becomes a problem. This section examines the nonresponse bias in processing the data excluding the nonresponse observations. (Kalton, 1983)
For simplicity, a simple random sample (SRS) in the variable y, where y contains missing data, from a population of size N is drawn. The need to define the types and patterns of nonresponse, which will be discussed later, are unimportant for this section. The data will further be assumed to be divided into two groups, the set of nonrespondents and respondents. In reality, the division of the data into two simple groups is an oversimplification for some units at least, chance plays whether they respond or not. The simplified model is appealing, however, its tractability leads to some informative results. (Kalton, 1983)
Let R be the number of respondents and M be the number of nonrespondents (M for missing) in the population, with N = R + N; the R = R/N and corresponding sample quantities are r and m, with r + m = n. Let = M/N be the proportions of respondents and nonrespondents in the population M
and let r = r/n and m = m/n be the response and nonresponse rates in the sample.
The population total and mean are given by
Y =Y r Y m=R Yr M Ym and
Ym , where Y r and Yr are the total and mean for respondents Y = R Yr M
and
Y m and
corresponding
Ym are the same quantities for the nonrespondents. The sample
quantities
are
y= y r y m=r yr m ym and
y =r yr m ym . (Kalton, 1983)
If no compensation is made for nonresponse, the respondent sample mean yr is used to estimate expectation of
Y . Its bias is given by
B Yr =E [ yr ]−Y . The
yr can be obtained in two stages, first conditional on fixed r and
then over different values of r, i.e.
E [ y ]=E 1 E 2 yr where E2 is the conditional
expectation for fixed r and E1 is expectation over different values of r. Thus,
E yr =E 1 [ ∑ E 2 y ri /r ]=E 1 Yr =Yr
Hence the bias of
yr is given by
Yr−Ym B yr =Yr −Y = M
The equation shows that
yr approximately unbiased for Y if either the
is small or the mean of the nonrespondents, proportion of nonrespondents M
Ym , is close to that for respondents,
Yr . Since the survey analyst usually
has no direct empirical evidence on the magnitude of
Yr −Ym , the only
situtation in which he can have confidence that the bias is small is when the many nonresponse rate is low. However, in practice, even with moderate M survey results escape sizable biases because
Yr −Ym is fortunately often not
large. (Kalton, 1983)
In reducing nonresponse bias caused by missing data, there are many procedures that can be applied and one of this is imputation. In this study, imputation procedures are applied to eliminate nonresponse and reduce bias to the estimates. Imputation is briefly defined as the substitution of values for the nonresponse observations. The discussion of imputation procedures will be provided later.
Nonresponse and Its Patterns
In surveys, nonresponse observations follow a definite pattern. For this study, missing data and nonresponse can be used interchangeably. There are three patterns a nonresponse data can have. It can be that the missing data for a variable Y are “Missing Completely at Random” (MCAR) if the probability of having a missing value for Y is unrelated to the value of Y itself or to any other variable in the data set. Data that are MCAR reflect the highest degree of randomness and show no underlying reasons for missing observations that can potentially lead to bias research findings. With MCAR, the occurrence of missing data is unrelated
to the other variables in the data set or other systematic factors; missing data are randomly distributed across all cases.
Another pattern of a nonresponse data is the Missing At Random (MAR). The missing data for a variable Y is considered MAR if the probability of missing data on Y is unrelated to the value of Y after controlling for other variables in the analysis. MAR data show some randomness to the pattern of data omission. The likelihood of a case having incomplete information on a variable can be explained by other variables in the data set, although presence or absence of missing values on a variable is not related to the participants’ true status on the missing variable.
The difference of MCAR and MAR is the relationship of the variable Y to the other variables in the data set. Nonresponse in MCAR is completely independent to the other variables. There is no relationship of the missing values in Y variable to the responding values and the other variables in the data set. In MAR, there is a relationship between the missing observations in Y and with the other variables. The variables could explain the incomplete information from the Y variable.
The last pattern of nonresponse and considerably the worst of the three is the probability of missing data on Y is related to the value of Y even if other variables are controlled in the analysis. Such case is termed as NonIgnorable Nonresponse (NIN). NIN missing data have systematic, nonrandom factors
underlying the occurrence of the missing values that are not apparent or otherwise measured. NIN missing data are the most problematic because of the effect in the generalizability of research findings and may potentially create bias parameter estimates, such as the means, standard deviations, correlation coefficients or regression coefficients.
These patterns are considered as an important assumption in imputation. For an imputation procedure to work and achieve statistically acceptable estimates, the pattern of nonresponses must either satisfy the MCAR or MAR assumption. For this study, the researchers’ created nonresponse that follows the MCAR assumption.
Nonresponse and Its Types
Another important assumption in imputation is the types of nonresponse. While the patterns of nonresponse focus on the relationships of the nonresponse variable to other variables, the types of nonresponse focus on the method in which the observations are nonresponse values. Kalton (1983) stressed the importance to differentiate the types of nonresponse: noncoverage, total (unit) nonresponse, item nonresponse, partial nonresponse.
Noncoverage (NC) denotes the failure to include some units of the survey population in the sampling frame. As a consequence, units that are excluded in the
frame have no chance of appearing in the sample. NC is not usually a type of nonresponse; however, Kalton (1983) loosely classifies this for convenience purposes. NC can be seen in surveys where units are failed to cover in the sampling frame or the listing of units are incomplete. (Kalton, 1983)
Unit (or total) nonresponse (UN) takes place wherein no information collected from a sampling unit. There are many causes of this nonresponse, namely, the failure to contact the respondent (not at home, moved or unit not being found), refusal to collect information, inability of the unit to cooperate (might be due to an illness or a language barrier) or questionnaires that are lost. (Kalton, 1983)
Item nonresponse (IN) emerges when the information collected from a unit is incomplete due to the refusal of answering some of the questions. There many causes of this nonresponse, namely, refusal to answer the question due to the lack of information necessarily needed by the informant, failure to make the effort required to establish the information by retrieving it from his memory or by consulting his records, refuses to give answers because the questions might be sensitive, embarrassing or considers to his perception of the survey’s objectives, the interviewer fails to record an answer (might skipped questions), or because the response is subsequently rejected at an edit check on the grounds that it is inconsistent with other responses (may include an inconsistency arising from a
coding or punching error occurring in the transfer of the response of the computer data file).(Kalton, 1983)
Partial nonresponse (PN) is the failure to collect large sets of items for a responding unit. A sampled unit fails to provide responses for the following, namely, in one or more waves of a panel survey, later phases of a multi-phase data collection procedure (e.g. second visit of the FIES), and later items in the questionnaire after breaking off a telephone interview.(Kalton, 1983)
Other reasons namely include, data are unavailable after all possible checking and follow-up, inconsistency of the responses that do not satisfy natural or reasonable constraints known as edits which one or more items are designated as unacceptable and therefore are artificially missing, and similar causes in unit (total) nonresponse. In this study, the type of nonresponse to be assumed is the PN.
The Imputation Procedures [needs editing and proper citation]
Earlier, imputation is listed as one of the many procedures that can be used to eliminate nonresponse in order to generate more unbiased results. Imputation defined by Kalton is the process of replacing a missing value through available statistical and mathematical techniques, with a value that is considered to be a reasonable substitute for the missing information. (Kalton, 1983)
Imputation has certain advantages.
In the first place, the aim of
imputation is to reduce biases in survey estimates. Secondly, imputation makes analysis easier and the results are simpler to present. Complex algorithms to estimate the population parameters in the presence of missing data are not required, and hence much processing time is saved. Lastly, the results obtained from different analyses are bound to be consistent with one another, a feature which need not apply to the results of analyses from an incomplete data set. (Kalton, 1983)
On the other hand, imputation has its pitfalls. Again, there is no guarantee that the results obtained after applying imputation will be less biased than those based on the incomplete data set. There is a possibility that the biases from the results using imputation could be greater. It all depends on the suitability of the assumptions built into the imputation procedures used. Even if the biases of univariate statistics are reduced, there is no assurance that the distribution of the data and the relationships between variables will remain. Again, it depends on what imputation procedure was used. More importantly, imputation is just a fabrication of data. Many naive researchers falsely treat the imputed data as a complete data set for n respondents as if it were a straightforward sample of size n. (Kalton, 1983)
Given that imputed values are to be substituted for missing responses, there are a variety of methods in which the imputed value may be determined. These methods are called imputation procedures or method (IM). IMs are techniques applied to replace missing values. These techniques can either implement statistical or simply mathematical procedures like replacing an observation by a constant value (e.g. mean).
There are four IMs applied in this study, namely, the overall (grand) mean imputation (OMI), hot deck imputation (HDI), deterministic regression and stochastic regression. For most imputation method, imputation classes are needed to be defined in order to proceed in performing the IM.
Imputation classes (IC) are stratification classes that divide the data into groups before imputation takes place. IC is very useful if the classes are divided into homogeneous groups. That is, similar characteristics that has some means on propensity to respond. The variables used to define IC are called matching variables (MV). In getting the values to be substituted to the nonresponse observations, a group of observations coming from a variable with a response are used. These records are called donors. The missing observations to be substituted are called recipients.
Problems might arise if ICs are not formed correctly to IM that rely on them. One of them is the number of IC. The IC must have a definite number of
classes applied to each method needing it. The larger the number of IC, the possibility of having fewer observations in one class increases. When this thing happens, variance of the estimates of the imputed data under that class increases. On the other hand, the smaller the number IC, the possibility of having more observations in that class increases thus making the estimates burdened with aggregation bias.
Overall Mean Imputation
The mean imputation method is the process by which missing data is imputed by the mean of the available units of the same imputation class to which it belongs. (Cheng, 1999) One of the types of this method is the OMI method. The OMI method simply replaces each missing data by the overall mean of the available (responding) units in the same population. The overall mean is given by
m
∑ y ri
y omi = i=1 r
= yr
where yomi is the mean of the entire sample of the responding units of the y-th variable and yri is the observation under y which are responding units.
In performing this method, the need for an IC to be homogeneous is unnecessary. The IC for this method is the entire population itself. In fact, in many related literature, imputation classes is not a requirement and often ignored in performing this method.
There are many advantages and disadvantages of this method. The advantage of using this method is being universal. This means that it can be applied to any data set. Another is the unimportance of the ICs to be homogeneous or the variables to be highly correlated. Without ICs, the method becomes easier to use and results are generated faster. Among the related literature included in this study, this is the most used method in imputing for missing data.
Figure 3.1 Distribution of the data before and after imputation has been applied
However, there are serious disadvantages of this method. Since missing values are imputed by a single value, the distribution of the data becomes distorted (see Figure 3.1). The distribution of the data becomes too peaked making it unsuitable in many post-analysis. Second, it produces large biases and variances because it does not allow variability in the imputation of missing values. Many related literatures stated that this is the least effective and recommended that this method is never to be used.
Hot Deck Imputation
One of the most popular and widely known methods used is the HDI method. The HDI method is the process by which the missing observations are imputed by choosing a value from the set of available units. This value is either selected at random (traditional hot deck), or in some deterministic way with or without replacement (deterministic hot deck), or based on a measure of distance (nearest-neighbor hot deck). To perform this method, let Y be the variable that contains missing data and X that has no missing data. In imputing for the missing data:
1. Find a set of categorical X variables that are highly associated with Y. The X variables to be selected will be the matching variables in this imputation. 2. Form a contingency table based on X variables
3. If there are cases that are missing within a particular cell in the table, select a case from the set of available units from Y variable and impute the chosen Y value to the missing value. In choosing for the imputation to be substituted to the missing value, both of them must have similar or exactly the same characteristics.
Cheng (1999) stated that HDI procedure gets estimates reflect more accurately to the actual data by making ICs. If the matching variables are closely associated with the variable which has missing data, the nonresponse bias should be reduced which is similar to the advantage of imputation classes stated earlier.
Example 3.1
Suppose that a study is conducted among ten people. Assume that three people in the survey refused to answer some of the questions in the study. Replacing the missing answer from each unobserved unit by a known value from an observed unit who has the same or most of the characteristics such as sex, degree or course (Course), Dean Lister (DL), Honor student in High School (HS2), and Hours of study classes (HSC). Suppose the set of X matching variables are DL and HS2. The values in the brackets with a negative sign outside are the imputed values in the table.
Table 1 Students GPA Person
Sex
DL
HS2
HSC
GPA
1
M
Y
Y
2
-[3.999]
2
F
Y
N
1
3.567
3
F
N
N
0
1.298
4
F
N
Y
1
2.781
5
M
N
Y
1
2.334
6
M
N
N
0
1.111
7
M
N
Y
1
-[2.781]
8
F
Y
N
1
3.246
9
F
Y
N
1
-[3.246]
10
F
Y
Y
2
3.999
The use of hot deck imputation is justified. First, since the imputed came from the same class, nonresponse bias and variance of the estimates decrease. This is because the observation coming from the imputation classes are homogeneous. If the OMI method was used here, biasness and variance of the estimates in the data would definitely increase because of the heterogeneity of the observations from IC. Most importantly, the distribution of the data was preserved. In OMI, it can be sure that the distribution will be distorted since the only one value would be substituted for the missing values.
Like OMI, there are certain advantages and disadvantages in using this method. One major attraction of this method cited by Kazemi (2005) is that imputed values are all actual observed values. Another is the nonexistence of out-
of-range values or impossible values. Out-of-range values are one type of problem of the DRI procedure which will be tackled next. More importantly, the shape of the distribution is preserved. Since imputation classes are introduced, the chance in distorting the distribution decreases.
On the other hand, it also has a set of disadvantages. In order to form imputation classes, all X variables must be all categorical. Second, the possibility of generating a distorted data set increases if the method used in imputing values to the missing values is without replacement as the nonresponse rate increases. Observations from the donor record might be used repeatedly by the missing values causing the shape of the distribution to get distorted. Third, the number of IC must be limited to ensure that all missing values will have a donor for each class.
General Regression Imputation
As in MI and HDI methods, this procedure is one of the widely known used imputation methods. The method of imputing missing values via the leastsquares regression is known to be the regression imputation (RI) method. This technique is seen as the generalization of the group mean imputation (GMI), another type of mean imputation other than OMI that uses ICs.
There are many ways of creating a regression model. In Kalton’s study, with this procedure the value for which imputations are needed (y) is regressed on the matching variables (x1, x2, … , xp) for the units providing a response on y. The imputation classes in this method are the categories of the matching variables that were transformed to dummy variables in the model. The matching variables may be quantitative or qualitative, the latter being incorporated into the regression model by means of dummy variables. The missing value may then be imputed into two basic ways: (a) to use the predicted value from the model given the values of the matching variables for the record with a missing response or (b) to use this predicted value plus some type of randomly chosen residual. The former one is called the deterministic regression imputation (DRI) and latter one is called the stochastic regression imputation (SRI). (Kalton, 1983)
In comparing for the accuracy and efficiency of this method, it will be helpful if the methods to be compared have the same IC as with this method. In his study, there were two quantitative matching variables that were considered each with a few categories so that no categorization will be needed. The general model underlying based on imputation classes is in the form:
m
yk = B 0∑ B i X ik ek i=1
where B 0 and B i are the parameter estimates computed from the m responding units, x ik is the dummy independent variable which are the matching
variables in the data under the k-th nonresponding units of the i-th matching variable, ek the random residual and yk the predicted value under the k-th nonresponding unit to be imputed. (Kalton, 1983)
Stochastic Regression
The use of the predicted value from the model corresponds to the meanvalue imputation in the restricted model, and hence has the same undesirable distributional properties. A good case therefore exists for including the estimated residual. There are various ways in which this could be done depending on the assumptions made about the residuals. The following are some of the more obvious possibilities:
(1) Assume that the errors are homoscedastic and normally distributed, 2 2 N 0, e . Then e could be estimated by the residual variance from the
regression, s 2e , and the residual for a recipient could be chosen at random from N 0, s 2e .
(2) Assume that the errors are heteroscedastic and normally distributed, with 2 2 2 ej being the residual variance in some group j. Estimate the ej by s ej ,
and choose a residual for a recipient in group j from N 0, s 2ej
(3) Assume that the residuals all come from the same, unspecified, distribution. Then estimate y k by yk ek , where ek , is the estimated residual for a random-chosen donor.
(4) The assumption in (3) accepts the linearity and additivity of the model. If there are doubts about these assumptions, it may be better to take not a random-chosen donor but instead one close to the recipient in terms of his x-values (see Kalton, 1983). In the limit, if a donor with the same set of xvalues is found, this procedure reduces to assigning that donor's y-value to the recipient.
There are many advantages and disadvantages of RI. RI has the potential to produce closer imputed values for the nonresponses, however, missing data is known to be assumption-free, rough-and-ready and imputation class approaches. Though this method has the potential to make closer imputed values, however, this method is a time-consuming operation and often times unrealistic to consider its application for all the items with missing values in a survey. In order to make the method effective by imputing a predicted value which is near the actual value, a high R2 is needed. (Kalton, 1983)
On the part of the deterministic and stochastic regression, a few disadvantages may be recalled. In DRI, the distortion of the distribution becomes
too peaked and the variance is underestimated. This distortion of the distribution is caused by the imputation of the best prediction. On its stochastic counterpart, even if the deterministic imputed value was feasible, the stochastic predicted value may need not be. Its possible that after adding the residual to the deteministic imputation, unfeasible value could namely result. (Nordholt, 1998)