page 1 Chapter 20 Logistic Regression for Binary Response Variables Response variable is binary, denoted by 0,1. Example 1 Case Study 20.1: Survival in the Donner Party. What is the relationship between age and sex of individuals in the Donner Party and whether or not they survived? Example 2 Birthweight data: birthweights of 189 newborns were recorded along with a number of other variables concerning the mother: weight, smoker or not, race, etc. Suppose we were interested only in whether the baby was underweight (defined to be under 2500 gms) or not and what characteristics of the mother, if any, are associated with having underweight babies. Example 3 Pronghorn data: In a study of winter habitat selection by pronghorn in the Red Rim area in south-central Wyoming (Source: Manly, McDonald, and Thomas 1993, Resource Selection by Animals, Chapman and Hall, pp. 16-24; data from Ryder 1983), presence/absence of pronghorn during winters of 1980-81 and 1981-82 were recorded for a systematic sample of 256 plots of 4 ha. each Other variables recorded for each plot were: density (in thousands/ha) of sagebrush, black greasewood, Nuttal’s saltbrush, and Douglas rabbitbrush, and the slope, distance to water, and aspect. What variables are most strongly associated with presence/absence of pronghorn and can you formulate a model to predict the probability that pronghorn will be present on a plot?
Could use linear regression to regress 0/1 response on quantitative and categorical explanatory variables. What are the problems with this approach?
Better approach: logistic regression model Y = 0,1 response variable X 1 ,..., X p : explanatory variables For a particular set of values of the explanatory variables,
µ (Y X 1 ,…, X p ) = π = proportion of 1’s
page 2 •
Rather than model µ (Y ) = π as a linear function of the explanatory variables, model
⎛ π ⎞ logit(π) = ln⎜ ⎟ as a linear function of the explanatory variables; that is: ⎝1− π ⎠ logit(π) = β 0 + β1 X 1 + … + β p X p
•
Why use logit(π)? Could you use another function?
Given the value of η = logit(π), can calculate
π=
eη 1 + eη
Example 3: In the pronghorn example, suppose, hypothetically, the true relationship between probability of use and distance to water (in meters) followed the logit model: logit(π) = 3 - .0015Water Then π =
e 3 − .0015 Water 1 + e 3 − .0015Water
.
Calculate π for the following distances (note that it might be easier to compute 1-π = 1 , then subtract from 1), 3 − .0015Water 1+ e
Water = 100 m.
1000 m.
3000m.
0.0
0.2
Probability of use 0.4 0.6 0.8
1.0
page 3
0
•
1000
2000 3000 Distance to water (m)
4000
5000
What is the interpretation of the coefficient -.0015 for the variable distance to water? For every 1 m. increase in the distance to water, the log-odds of use decrease by .0015; for every 1 km. increase in distance to water, log-odds of use decrease by 1.5.
•
More meaningful: for every 1 m. increase in the distance to water, the odds of use change by a multiplicative factor of e −.0015 = .999; for every 1 km. increase in distance to water, the odds of use change by a multiplicative factor of e −1.5 = .223 (we could also reverse these statements; for example, the odds of use increase by a factor of e1.5 = 4.48 for every km. closer to water a plot is.)
Variance The 0/1 response variable Y is a Bernoulli random variable:
µ (Y X 1 ,…, X p ) = π SD( Y X 1 , …, X p ) = π (1 − π ) Variance of Y is not constant.
The logistic regression model is an example of a generalized linear model (in contrast to a general linear model which is the usual regression model with normal errors and constant variance). A generalized linear model is specified by: 1. a link function which specifies what function of µ (Y ) is a linear function of X 1 , … , X p . In logistic regression, the link function is the logit function.
page 4 2. the distribution of Y for a fixed set of values of X 1 , … , X p . In the logit model, this is the Bernoulli distribution.
The usual linear regression model is also a generalized linear model. The link function is the identity: that is, f ( µ ) = µ , and the distribution is normal with constant variance. There are other generalized linear models which are useful (e.g., Poisson response distribution with log link function). A general methodology has been developed to fit and analyze these models.
Estimation of Logistic Regression Coefficients Estimation of parameters in linear regression model was by least squares. If we assume normal errors with constant variance, the least squares estimators are the same as the maximum likelihood estimators (MLE’s). Maximum likelihood estimation is based on a simple principle: the estimates of the parameters in a model are the values which maximize the probability (likelihood) of observing the sample data we have. Example: Suppose we select a random sample of 10 UM students in order to estimate what proportion of students own a car. We find that 6 out of the 10 own a car. What is the MLE of the proportion p of all students who won a car? What do we think it should turn out to be?
We model the responses from the students as 10 independent Bernoulli trials with probability of success p on each trial. Then the total number of successes, say Y, in 10 trials follows a binomial model:
⎛10 ⎞ Pr(Y = y ) = ⎜⎜ ⎟⎟ p y (1 − p)10− y , y = 0,1,…,10 ⎝ y⎠ The maximum likelihood principle says to find the value of p which maximizes the probability of observing the number of successes we actually observed in the sample; that is, find p to maximize:
⎛10 ⎞ Pr(Y = 4) = ⎜⎜ ⎟⎟ p 4 (1 − p) 6 ⎝4⎠ This is called the likelihood function. The following is a graph of Pr(Y=4) versus p. At what value of p does it appear to be maximized?
0.15 0.10 0.0
0.05
Pr(Y = 4)
0.20
0.25
page 5
0.0
0.2
0.4
0.6
0.8
1.0
p
We can also find the exact solution using calculus. Note that finding the value of p to maximize Pr(Y = 4) is equivalent to finding the value of p which maximizes ⎡⎛10 ⎞⎤ ⎡⎛10 ⎞⎤ ln[Pr(Y = 4)] = ln ⎢⎜⎜ ⎟⎟⎥ + ln p 4 (1 − p ) 6 = ln ⎢⎜⎜ ⎟⎟⎥ + 4 ln p + 6 ln(1 − p ) ⎣⎝ 4 ⎠⎦ ⎣⎝ 4 ⎠⎦
[
]
Take the derivative with respect to p and setting it equal to 0:
Now, solve for p:
By repeating this process but replacing 10 by n (the number of trials) and 4 by y (the number of successes), we find a general expression for the MLE of p in n independent Bernoulli trials:
pˆ =
•
y n
Back to logistic regression: we use the maximum likelihood principle to find estimators of the β’s in the logistic regression model. The likelihood function is the probability of observing the particular set of failures and successes that we observed in the sample. But there’s a difference from the binomial model above: the model says that the probability of success is possibly different for each subject because it depends on the explanatory variables X 1 , … , X p . Dropping the subscript i to avoid confusion, the probability of success for an individual is:
page 6
Pr(Y = 1) = π
•
where π =
e
β 0 + β1 X 1 +…+ β p X p
1+ e
β 0 + β1 X 1 +…+ β p X p
.
Note that we can also write the model for a single individual as Pr(Y = y ) = π y (1 − π )1− y , y = 0, 1.
•
If the responses are independent, then the probability of observing the set of 0/1 responses we observed is n
Pr(Y1 = y1 , Y2 = y2 , …, Yn = yn ) = π 1y1 (1 − π 1 ) y1 π 2y2 (1 − π 2 ) y2 …π nyn (1 − π n ) yn = ∏ π iyi (1 − π i ) yi i =1
where each of the π i ’s is, in turn, a function of the β i ’s. Therefore, we maximize this likelihood function as a function of the β i ’s. This cannot be done analytically and must be done numerically using a computer program. •
SPSS can fit logit models: use Analyze…Regression…BinaryLogistic. Identify the binary dependent variable and all the covariates.
•
If any of the covariates are categorical then click Categorical to tell SPSS how to create indicator variables (or other way of coding) for each categorical variable. If a binary covariate is already coded as 0-1, it’s not necessary to identify it as categorical.
•
Interaction terms can be added when specifying the covariates in the logistic regression command in SPSS. Select two or more variables (keep the Ctrl key pressed while selecting the second and subsequent variables), then press the “a*b>” button.
Example: Birthweight data. The response variable was the 0/1 variable “low” which was an indicator of low birthweight, An additive model with all the covariates was fit. The variables were: Name low age lwt race smoke ptl ht ui ftv
Meaning Birthweight < 2500 gm. Age (years) Weight at last menstrual period (lbs) Race (1=White, 2=Black, 3=Other) Smoke (0/1) Number of premature labors Hypertension (0.1) Uterine irritability (0/1) Number of physician visits in first trimester
page 7
There is a lot of SPSS out put. For now, we’re interested in only two tables: one that indicates how the categorical variable Race was coded, and the final table of model coefficients (under Block 1). Categorical Variables Codings
Race (1=White, 2=Black, 3=Other)
White Black Other
Frequency 96 26 67
Parameter coding (1) (2) .000 .000 1.000 .000 .000 1.000
I chose Indicator coding for Race with the first category (White) as the reference category. The two columns of this table are the two indicator variables for Race.
Variables in the Equation
Step a 1
age lwt race race(1) race(2) smoke ptl ht ui ftv Constant
B -.030 -.015
S.E. .037 .007
1.272 .880 .939 .543 1.863 .768 .065 .481
.527 .441 .402 .345 .698 .459 .172 1.197
Wald .637 4.969 7.116 5.820 3.990 5.450 2.474 7.136 2.793 .143 .161
df 1 1 2 1 1 1 1 1 1 1 1
Sig. .425 .026 .028 .016 .046 .020 .116 .008 .095 .705 .688
Exp(B) .971 .985 3.569 2.412 2.557 1.722 6.445 2.155 1.067 1.617
95.0% C.I.for EXP(B) Lower Upper .903 1.044 .971 .998 1.270 1.017 1.163 .875 1.642 .876 .761
10.033 5.723 5.624 3.388 25.291 5.301 1.497
a. Variable(s) entered on step 1: age, lwt, race, smoke, ptl, ht, ui, ftv.
The columns in this table are: 1. B: the estimated coefficients 2. S.E. The standard errors (SE’s) for the estimated coefficients (the β i ’s) are estimated using asymptotic results for maximum likelihood estimators. An approximate confidence interval for each coefficient can be calculated using β i ± z SE where z is the appropriate percentile of the standard normal distribution. We don’t use the t-distribution; that only applies to the linear regression model with normal errors. 2 ⎛ βˆi ⎞ ⎟ . It is used as a two-sided test of the hypothesis 3. Wald: the Wald statistic is ⎜ ⎜ SE ( βˆ ) ⎟ i ⎠ ⎝ H 0 : β i = 0 given that all the other variables are in the model. Under this null hypothesis βˆ / SE ( βˆ ) has approximately a N(0,1) distribution. The square of a N(0,1) i
i
page 8 is a chi-square distribution with 1 d.f. so the Wald statisic can be compared to a χ 2 (1) to get a test of H 0 : β i = 0 .
4. df: The d.f. for the chi-square distribution which the Wald statistic is compared to. It is 1 except for categorical explanatory variables with more than 2 levels (however, the test for each indicator variable involved in the categorical variable still has 1 d.f.). 5. Sig.: The P-value for the test of H 0 : β i = 0 using the Wald statistic. This is the area to the right of the Wald statistic in a χ 2 (1) distribution. 6. Exp(B): exp(β i ) has the interpretation that it represents how many times the odds of a positive response increase for every one-unit increase in xi , given that the values of the other variables in the model remain the same. For example, the coefficient for Smoke is 2.557. Since Smoke is 0/1, this means that, according to this model, the odds that a smoker will have a low birthweight baby are estimated to be 2.557 times higher (95% CI: 1.163 to 5.624; see next column) than the odds a non-smoker will, even after adjusting for Last weight, Race and the other variables in the model. Similarly, every one-year increase in age reduces the odds of a low birthweight baby to .971 of what they were (95% CI: .903 to 1.044), after adjusting for the other variables. 7. 95.0% C.I. for Exp(B): A confidence interval for exp(β i ) (the true value) can be
[ (
)
(
)]
calculated as exp β i − z SE ,exp β i + z SE . SPSS will calculate a confidence interval for each exp(β i ) in the model if, under Options, you select “CI for exp(B).”
Comparing two models In linear regression, we used the extra sum-of-squares F test to compare two nested models (two models are nested if one is a special case of the other). The analogous test in logistic regression compares the values of –2ln(Maximized likelihood function). What is this quantity? Recall that
page 9 the MLE’s of the β i ’s are the values that maximize the likelihood function of the data (defined a couple of pages ago). So, we find that maximum value of the likelihood function, take the natural log and multiply by –2. •
The quantity –2 ln(Maximized likelihood) is also called the deviance of a model since larger values indicate greater deviation from the assumed model. Comparing two nested models by the difference in deviances is a drop-in-deviance test.
SPSS gives the value of –2 ln(Maximized likelihood) in the output. For the full model for low birthweight examined previously: Model Summary Step 1
-2 Log Cox & Snell likelihood R Square 201.285a .162
Nagelkerke R Square .228
a. Estimation terminated at iteration number 5 because parameter estimates changed by less than .001.
•
The difference between the values of –2ln(Maximized likelihood function) for a full and reduced model has approximately a chi-square distribution if the null hypothesis that the extra parameters are all 0 is true. The d.f. is the difference in the number of parameters for the two models.
Example: Compare the full birthweight model to one without Smoke. The value of –2ln(Maximized likelihood function) for the model without Smoke is: Model Summary Step 1
-2 Log Cox & Snell likelihood R Square 206.906a .137
Nagelkerke R Square .192
a. Estimation terminated at iteration number 5 because parameter estimates changed by less than .001.
The test statistic is Drop-in-deviance = 206.906 – 201.285 = 5.621 The P-value is the area to the right of 5.621 in χ 2 (1) distribution (since there is one less parameter in the reduced model). Thus, P =.018, which indicates moderately strong evidence of an effect of smoking on the probability of having a low-birthweight baby after adjusting for the other variables in the model. •
The drop-in-deviance test is a likelihood ratio test (LR test) because it is based on the natural log of the ratio of the maximized likelihoods (the difference of logs is the log of the ratio). The extra sum-of squares F-test in linear regression also turns out to be a likelihood ratio test.
page 10 •
If the full and reduced models differ by only one parameter, as in this example, then the likelihood ratio test is testing the same thing as the Wald test above. In this example, the test statistic is slightly different than the Wald test value of 5.450 and P = .020. The likelihood ratio test is preferred. The two tests will generally give similar results, but not always.
•
The relationship between the Wald test and the likelihood ratio test is analogous to relationship between the t-test for a single coefficient and the F-test in linear regression. However, in linear regression, the two tests are exactly equivalent. Not so in logistic regression.
One can obtain a LR test for each coefficient in SPSS by selecting “Backward-LR” under Method. Also be sure that Display: At each step is checked under Options. This performs a backward stepwise logistic regression. All we are using it for, though, is the output at the first step where it reports a likelihood ratio test for each variable with all the other variables in the model. The table we are looking for looks like this. The LR test for each term is listed under Step 1. Model if Term Removed
Variable Step age 1 lwt race smoke ptl ht ui ftv Step age 2 lwt race smoke ptl ht ui Step lwt 3 race smoke ptl ht ui Step lwt 4 race smoke ht ui
Model Log Likelihood -100.965 -103.402 -104.376 -103.453 -101.916 -104.403 -102.014 -100.713 -100.993 -103.404 -104.386 -103.458 -101.974 -104.406 -102.053 -104.056 -105.155 -103.864 -102.108 -104.732 -102.449 -105.583 -106.413 -105.755 -105.957 -104.124
Change in -2 Log Likelihood .645 5.520 7.468 5.621 2.546 7.521 2.742 .142 .559 5.381 7.344 5.489 2.521 7.384 2.680 6.126 8.325 5.743 2.231 7.479 2.912 6.949 8.609 7.293 7.698 4.031
df 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1
Sig. of the Change .422 .019 .024 .018 .111 .006 .098 .706 .455 .020 .025 .019 .112 .007 .102 .013 .016 .017 .135 .006 .088 .008 .014 .007 .006 .045
Model Selection Both AIC and BIC can be used as model selection criteria. As with linear regression models, they are only relative measures of fit, not absolute measures of fit. AIC = Deviance + 2p BIC = Deviance + plog(n)
page 11 where p is the number of parameters in the model. Stepwise model selection methods are available in SPSS using likelihood ratio tests or Wald’s test. The LR methods are preferred. Other software programs, like S-Plus, have stepwise procedures using AIC or BIC. Interactions Interaction terms can be added when specifying the covariates in the logistic regression command in SPSS. Select two or more variables (keep the Ctrl key pressed while selecting the second and subsequent variables), then press the “a*b>” button. Residuals There are two standard ways to define a residual in logistic regression. 1. Pearson residual =
yi − πˆ i . πˆ i (1 − πˆ i )
This is called a “Standardized Residual” in the Save option under Regression…Binary Logistic in SPSS, but it is labeled “Normalized Residual” in the data sheet. The Pearson residual is the raw (unstandardized) residual y i − πˆ i standardized by its estimated standard deviation.
⎧ − 2 ln πˆ i if yi = 1 ⎪ 2. Deviance residual = ⎨ ⎪− − 2 ln(1 − πˆ ) if y = 0 i i ⎩ This is called the “Deviance Residual” in SPSS. The deviance residuals have the property that the sum of the deviance residuals squared is the deviance D (= -2 ln(maximized likelihood)) for the model. The Pearson residual is more easily understood, but the deviance residual directly gives the contribution of each point to the lack of fit of the model. There is also a “Studentized residual” in SPSS (but labeled “Standard residual” in the data sheet). It’s usually almost identical to the deviance residual so we won’t consider it here. The residuals in a logistic regression (either Pearson or deviance) are not as useful as in a linear regression unless the data are grouped (see Chapter 21). The residuals are not assumed to have a normal distribution, as in linear regression. Since the residuals don’t have a normal distribution, the usual ±2 or ±3 cutoffs don’t necessarily apply. Instead, we can simply look for outliers among the distribution of Pearson or deviance residuals. We can plot the values against the predicted probabilities. Measures of influence Measures of influence attempt to measure how much individual observations influence the model. Observations with high influence merit special examination. Many measures of influence have been suggested and are used for linear regression. Some of these have analogies
page 12 for logistic regression. However, the guidelines for deciding what is a big enough value of an influence measure to merit special attention are less developed for logistic regression than for linear regression and the guidelines developed there are sometimes not appropriate for logistic regression. Cook’s Distance: This is a measure of how much the residuals change when the case is deleted. Large values indicate a large change when the observation is left out. Plot Di against case number or against πˆ i and look for outliers.
The leverage is a measure of the potential influence of an observation. In linear regression, the leverage is a function of how far its covariate vector x i is from the average. It is a function only of the covariate vector. In logistic regression, the leverage is a function of x i and of π i (which must be estimated) – it isn’t necessarily the observations which have the most extreme x i ’s which have the most leverage.
Assessing the overall fit of the model Assessing the fit of a logistic regression model is difficult. Since the response is 0/1, plots of the response versus explanatory variables are not as useful as when the response is continuous. There is no good measure analogous to R2; SPSS will report a couple of R2-like measures that have been proposed for logistic regression, but these don’t have the same interpretation as in linear regression and are not particularly good measures of fit. In theory, a chi-square goodness-of-fit test could be carried out by comparing the observed count in each category (meaning each combination of the values of the explanatory variables) against the expected count. However, unless there are multiple subjects in each combination of the values of the covariates (the explanatory variables), the observed counts are 0’s and 1’s; the expected counts are simply the πˆ i ’s. But the chi-square approximation for the goodness-of-fit statistic does not hold when the expected counts are all less than 1 (most of them should be above 5 to use the chi-square approximation). A way around this difficulty is to group the data. The Hosmer-Lemeshow test groups individuals by their value of πˆ i into ten groups of about equal size. That is, the tenth of the observations with the lowest values of πˆ i are put in the first group, the next tenth lowest values in the second, and so on. It then totals the πˆ i ’s for everyone in each group (that is, it totals the expected number of successes) and does the same for the 1 − πˆ i ’s (the expected number of failures). It then compares the observed and expected counts in the 20 cells using Pearson’s X2 and compares the result to a chi-square with 8 d.f. The expected counts still need to be large enough for the chi-square approximation to hold, but this is much more likely to happen since we group the data. Since the cases are grouped, this test is “crude” in a sense, but it is a useful tool. Another goodness-of-fit tool is the classification table. Each observation is classified according to its estimated π: if πˆ i > .5, then its classified as a “success” (a 1), otherwise as a “failure” (a 0). The table of actual group versus predicted group is used to assess how well the classifier does. This is a very crude way of assessing how well the model fits since it does not measure how far πˆ i is from yi ; only whether it is on the same side of .5. as yi . In addition, if yi =1 is rare, then
page 13 the estimated probabilities may all be less than 1 and the classification table isn’t useful in assessing fit (the same is true if yi =0 is rare).
page 14 Example Birthweight data: using only LastWeight, Race, Smoke
First fit a rich model Model Summary Step 1
-2 Log Cox & Snell likelihood R Square 210.491a .120
Nagelkerke R Square .169
a. Estimation terminated at iteration number 5 because parameter estimates changed by less than .001. Variables in the Equation Step a 1
lwt race race(1) race(2) smoke lwt * race lwt by race(1) lwt by race(2) lwt by smoke race * smoke race(1) by smoke race(2) by smoke lwt2 Constant
B -.012
S.E. .045
-3.481 -2.969 -.679
2.717 2.907 1.933
.017 .025 .008
.021 .022 .016
1.308 .667 .000 1.849
.953 1.180 .000 3.131
Wald .073 1.733 1.642 1.043 .123 1.271 .634 1.269 .257 1.911 1.882 .320 .167 .349
df
Sig. .786 .420 .200 .307 .726 .530 .426 .260 .612 .385 .170 .572 .683 .555
1 2 1 1 1 2 1 1 1 2 1 1 1 1
Exp(B) .988 .031 .051 .507 1.017 1.026 1.008 3.698 1.949 1.000 6.353
a. Variable(s) entered on step 1: lwt, race, smoke, lwt * race , lwt * smoke , race * smoke , lwt2.
Compare to a main effects model Model Summary Step 1
-2 Log Cox & Snell likelihood R Square 215.015a .099
Nagelkerke R Square .139
a. Estimation terminated at iteration number 4 because parameter estimates changed by less than .001.
Variables in the Equation Step a 1
lwt race race(1) race(2) smoke Constant
B -.013
S.E. .006
-.971 .320 1.060 .861
.412 .526 .378 .797
Wald 4.415 8.730 5.543 .370 7.850 1.168
a. Variable(s) entered on step 1: lwt, race, smoke.
•
Carry out drop in deviance test:
df 1 2 1 1 1 1
Sig. .036 .013 .019 .543 .005 .280
Exp(B) .987 .379 1.377 2.886 2.366
page 15 Categorical Variables Codings
Race (1=White, 2=Black, 3=Other)
Frequency 96 26 67
White Black Other
Parameter coding (1) (2) 1.000 .000 .000 1.000 .000 .000
The largest difference is between White and Other. The difference between Black and Other is smaller and not statistically significant. We might consider coding race as simply White/Nonwhite. Model Summary Step 1
-2 Log Cox & Snell likelihood R Square 215.383a .097
Nagelkerke R Square .136
a. Estimation terminated at iteration number 4 because parameter estimates changed by less than .001.
Variables in the Equation
Step a 1
lwt Nonwhite smoke Constant
B -.0122 1.077 1.103 -.275
S.E. .0060 .373 .372 .834
Wald 4.118 8.342 8.775 .109
df 1 1 1 1
Sig. .042 .004 .003 .742
Exp(B) .988 2.937 3.013 .760
95.0% C.I.for EXP(B) Lower Upper .976 1.000 1.414 6.102 1.452 6.252
a. Variable(s) entered on step 1: lwt, Nonwhite, smoke.
Hosmer and Lemeshow Test Step 1
Chi-square 7.071
df 8
Sig. .529
Some residual analysis should also be done to see that there are not any outliers or particularly influential points. •
Write the logit equation from the last model for each of the four groups represented by the two categorical explanatory variables:
page 16
Compare the odds of a low birthweight baby for a nonwhite smoking mother who weighs 150 pounds versus a white, nonsmoking mother who weighs 120 pounds.