Multiple Regression Mechanics

  • June 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Multiple Regression Mechanics as PDF for free.

More details

  • Words: 1,888
  • Pages: 8
Green // Statistics The Mechanics of Multiple Regression One of the most important concepts in statistics is the idea of “controlling” for a variable. This lecture is designed to give you a feel for what “controls” are and how they are implemented in the context of multiple regression. Let’s begin by considering an example. In the weeks leading up to the November 2003 election, a group called ACORN sought to bolster support for a ballot proposition in Kansas City. The measure authorized a rise in sales tax in order to fend off cuts to public transportation. ACORN canvassed voters in a predominantly black section of Kansas City, targeting registered voters who had voted in at least one of the five most recent elections. The campaign consisted primarily of door-to-door canvassing conducted during the final two weeks before Election Day. I was asked to evaluate the effectiveness of this campaign. ACORN identified 28 precincts of potential interest to their campaign; I randomly assigned 14 to the treatment group and 14 to the control group. After the election, voter turnout records were gathered. Voting rates among those living in the treatment and control precincts were calculated. The data may be found at Kansas City Dataset The data may be modeled in a few different ways. The simplest model describes the voter turnout rate (Y) as a linear function of the experimental treatment (X) plus a disturbance term: Y = a + bX + U. Here is an “individual value plot” of the data. Note that all of the X values are either 0 (control) or 1 (treatment), but the plot scatters them a bit in order to make the individual values easier to see.

Individual Value Plot of VOTE03 vs TREATMEN 0.50 0.45

VOTE03

0.40 0.35 0.30 0.25 0.20 0.00

1.00 TREATMEN

Using regression, we obtain the following results: Regression Analysis: VOTE03 versus TREATMEN The regression equation is VOTE03 = 0.289 + 0.0355 TREATMEN Predictor Constant TREATMEN

Coef 0.28884 0.03554

S = 0.0665291

SE Coef 0.01778 0.02515

R-Sq = 7.1%

T 16.24 1.41

P 0.000 0.169

R-Sq(adj) = 3.6%

The critical numbers here are .036, which suggests that the expected rate of turnout increases by 3.6 percentage-points as we move from control to treatment, and .025, which conveys the uncertainty surrounding this experimental effect. The p-value of .169 tells us that there is a 16.9% chance of observing a treatment effect as large as this in absolute value even if the true experimental effect were zero. Ordinarily, we would use a 1-tailed test here, because one would suppose that canvassing would increase turnout; in that case, the one-tailed p-value is approximately .09. For what it’s worth, that falls a bit short of the conventional statistical significance threshold of .05. (Note that it is just a coincidence that the estimated treatment effect of .036 coincides with the adjusted R-squared of 3.6%. Why, speaking of R-squared, is it not of central concern as we interpret these regression statistics?)

How can we make this analysis more precise? One answer is to gather more data. Another is to control for other predictors of voter turnout that are not consequences of the treatment. (We’ll see why we don’t want to control for consequences of the treatment in next week’s lectures.) Fortunately, we happen to have just such a predictor at hand. The Kansas City voter file contains extensive information about the past voter turnout of every voter. I calculated the average voting rate over several elections from 1998 through the summer of 2003. Since these votes occurred before the experiment, we need not be concerned that they represent consequences of the treatment. Let’s call the past vote average Z and control for it in our revised regression model: Y = a + bX + cZ + U. In terms of sheer Minitab mechanics, this model is easy to estimate. Regression Analysis: VOTE03 versus TREATMEN, VOTEAVG The regression equation is VOTE03 = - 0.310 + 0.0452 TREATMEN + 1.17 VOTEAVG Predictor Constant TREATMEN VOTEAVG

Coef -0.31046 0.04518 1.1723

S = 0.0381032

SE Coef 0.08199 0.01446 0.1591

R-Sq = 70.7%

T -3.79 3.12 7.37

P 0.001 0.004 0.000

R-Sq(adj) = 68.4%

Take a close look at these regression results, and compare them to the results presented above. The estimated treatment effect is somewhat larger than before. This pattern is specific to this example, not a general feature of multiple regression. You should not expect coefficients to grow when control variables are added to a regression equation – especially when analyzing experimental data (Why?). In this case, the estimated treatment effect grows from .036 to .045, but it could have gone the other way. It just happens to be the case that randomly assigned treatments were more likely to go to precincts with below average VOTEAVG scores. In the plot below, we see that the correlation between TREATMEN and VOTEAVG is slightly negative. Thus, some of the positive influence of the treatment is understated in the first regression, because it ignores the fact that the treated precincts had slightly lower voting propensities before the experiment got underway.

Individual Value Plot of VOTEAVG vs TREATMEN 0.65

VOTEAVG

0.60

0.55

0.50

0.45

0.40 0.00

1.00 TREATMEN

The next thing to notice about the multiple regression output is that the standard error of the estimated treatment effect has dropped from .025 to .014. That may not sound like much, but remember that a drop of this magnitude is tantamount to a dramatic expansion in sample size. In order to reduce the standard error by a factor of 1.78, one would have to increase the sample size by a factor of 3.19. Why the decline in the standard error? A big part of the answer has to do with “s”, the estimated standard deviation of the disturbances. The first regression estimated s to be . 067; the second regression brought the standard deviation of the unobservables down to . 038. Was there a cost to adding an additional control variable, or “covariate”? The answer is yes. Any time we add an additional covariate we are penalized in two ways. Here’s the formula for the standard error of a regression of Y on X, which you should compare to the formula for the standard error of a regression of Y on X controlling for Z. 1 ei2 ∑ n −k = ( n −1)Var ( X )

s (n −1)Var ( X )

=

.0665291 ( 27 )(. 2593 )

= .02514

= estimated standard error of b when Y is regressed on X.

1 ∑ei2 n −k = 2 (n − 1)Var ( X )(1 − R XZ )

s ( n − 1)Var ( X )(1 − R

2 XZ

)

=

.0381032 ( 27 )(. 2593 )(1 − .0081 )

= .01446

= estimated standard error of b when Y is regressed on X, controlling for Z. What are the two penalties? First, the number of degrees of freedom (n-k) decreases as we add more variables. In this formula, n is the number of observations and k is the number of parameters that are being estimated. All other things being equal, a decline in the number of degrees of freedom tends to make the numerator bigger, which in turn makes the standard error bigger. Second, correlation between the independent variables makes the denominator smaller, which in turn makes the standard error bigger. Notice that when this correlation is zero, the two standard errors have the same denominators. When the correlation is 1, the standard error in the second formula becomes infinite. Let’s try to get a feel for what multiple regression is actually doing. Here is the scatterplot of VOTE03 with VOTEAVG, with different markers for the treatment and control groups. Imagine that we were to pass two parallel regression lines through these data, one for the red points and another for the black points. These parallel lines would have a slope of 1.17. The vertical distance between the two lines would be .045; this vertical distance reveals the apparent effect of the experimental treatment. Because the treatment variable in this example is a dummy variable, it can be thought of as a variable the generates a shift in the intercept.

Scatterplot of VOTE03 vs VOTEAVG TREATMEN 0.00 1.00

0.50 0.45

VOTE03

0.40 0.35 0.30 0.25 0.20 0.40

0.45

0.50 0.55 VOTEAVG

0.60

0.65

The predicted values from the multiple regression are depicted in the following graph. Notice that the points in each experimental group are arrayed along (invisible) parallel regression lines. Scatterplot of FITS1 vs VOTEAVG 0.50

TREATMEN 0.00 1.00

0.45

FITS1

0.40 0.35 0.30 0.25 0.20 0.40

0.45

0.50 0.55 VOTEAVG

0.60

0.65

How does multiple regression know how to space the two parallel lines? How does the computer choose which coefficients to attach to each independent variable? As in the case of regression of Y on a single variable, least squares regression selects the coefficients that minimize the sum of squared residuals. Fortunately, this algorithm is easy to implement algebraically. In order to see the mechanics at work, break down multiple regression into a series of biviarate regressions. In order to estimate the experimental effect (i.e., the coefficient on the TREATMEN variable), we perform the following operations: 1. Regress TREATMEN on VOTEAVG. 2. Calculate the residuals from this regression. Residuals are computed as the actual values of TREATMEN minus the predicted values of TREATMEN. 3. Regress VOTE03 on the residuals calculated in Step 2 to obtain the multiple regression estimate. Let’s give it a try. Step 1: Regression Analysis: TREATMEN versus VOTEAVG The regression equation is TREATMEN = 1.00 - 1.00 VOTEAVG Predictor Constant VOTEAVG

Coef 1.005 -0.996

S = 0.516746

SE Coef 1.094 2.149

T 0.92 -0.46

R-Sq = 0.8%

P 0.367 0.647

R-Sq(adj) = 0.0%

Step 2: When performing this regression, ask Minitab to save residuals under the “storage” option. (Or you can do this manually by plugging in the values from the regression equation.) In this case, Minitab stores the residuals as RESI1. Step 3. Regression Analysis: VOTE03 versus RESI1 The regression equation is VOTE03 = 0.307 + 0.0452 RESI1 Predictor Constant RESI1

Coef 0.30661 0.04518

SE Coef 0.01228 0.02466

T 24.97 1.83

P 0.000 0.078

S = 0.0649703

R-Sq = 11.4%

R-Sq(adj) = 8.0%

How does our pseudo-regression procedure compare to the real thing? Regression Analysis: VOTE03 versus VOTEAVG, TREATMEN The regression equation is VOTE03 = - 0.310 + 1.17 VOTEAVG + 0.0452 TREATMEN Predictor Constant VOTEAVG TREATMEN

Coef -0.31046 1.1723 0.04518

S = 0.0381032

SE Coef 0.08199 0.1591 0.01446

R-Sq = 70.7%

T -3.79 7.37 3.12

P 0.001 0.000 0.004

R-Sq(adj) = 68.4%

Notice that the slope coefficient for the TREATMEN variable is the same as the slope for the RESI1 variable. Success. (On the other hand, the standard errors and t-ratios are wrong, so beware. Eventually, you’ll learn how to calculate all of the regression output, but so far you only know how to generate the slopes.) What is the underlying theory behind multiple regression? The key idea is to isolate the component of the TREATMEN variable that is uncorrelated with VOTEAVG. This aim is accomplished by performing a bivariate regression and calculating residuals. Those residuals represent the component of TREATMEN that is not predicted by VOTEAVG. Can you now figure out how to calculate the multiple regression slope for VOTEAVG?

Related Documents