Example of Using SPSS to Generate a Simple Regression Analysis Given a desire of a Retail Chain management team to develop a strategy to forecasting annual sales, the following data from a random sample of existing stores has been gathered: STORE 1 2 3 4 5 6 7 8 9 10 11 12 13 14
SQUARE FOOTAGE 1726.00 1642.00 2816.00 5555.00 1292.00 2208.00 1313.00 1102.00 3151.00 1516.00 5161.00 4567.00 5841.00 3008.00
ANNUAL SALES ($) 3681.00 3895.00 6653.00 9543.00 3418.00 5563.00 3660.00 2694.00 5468.00 2898.00 10674.00 7585.00 11760.00 4085.00
We can enter the data into SPSSPc by typing it directly into the data editor, or by cutting and pasting:
Next, by clicking on ‘Variable View’, we can apply variable and value labels where appropriate:
Assuming, for now, that if a relationship exists between the two variables, it is linear in nature, we can generate a simple Scatterplot (or Scatter Diagram) for the data. This is accomplished with the command sequence:
Which yields the following (editable) scatterplot:
Regression Analysis for Site Selection Simple Scatterplot of Data 14000 12000 10000
Sales Revenue of Store
8000 6000 4000 2000 0 0
1000
2000
3000
4000
5000
6000
7000
Square Footage of Store
We can generate a simple straight line equation from the output resulting when using the Enter Command in regression:
Which yields: Variable s Ente re d/Re mov ebd Model 1
Variables Entered Square Footage of a Store
Variables Removed .
Method Enter
a. All requested variables entered. b. Dependent Variable: Sales Revenue of Store
M ode l Summary Model 1
R R Square .954 a .910
Adjusted R Square .902
Std. Error of the Estimate 936.8500
a. Predictors: (Constant), Square Footage of Store
ANOVAb Model 1
Sum of Squares 1.06E+08 10532255 1.17E+08
Regression Residual Total
df 1 12 13
Mean Square 106208119.7 877687.937
F 121.009
Sig. .000 a
a. Predictors: (Constant), Square Footage of Store b. Dependent Variable: Sales Revenue of Store
SS T
SS E b0
Model 1
SS R
(Constant) Square Footage of Store
Coefficientsa
Unstandardized Coefficients B Std. Error 901.247 513.023 1.686 .153
Standardi zed Coefficien ts Beta .954
t 1.757 11.000
Sig. .104 .000
95% Confidence Interval for B Lower Bound Upper Bound -216.534 2019.027 1.352 2.020
a. Dependent Variable: Sales Revenue of Store
b1
^
So then
and where
Yi = 901.247 + 1.686X (noting that no direct interpretation of the Y intercept at 0 Square Footage is possible, so that the intercept represents the portion of the annual sales varying due to factors other than store size)
SST = SSR (regression sum of squares) + SSE (error sum of squares) = sum of the squared differences between each observed value for Y and Y-Bar SSR = sum of the squared differences between each predicted value of Y and Y-Bar SSE = sum of the squared differences between each observed value of Y and each predicted value for Y Coefficient of Determination = SSR/SSt = 0.91 (sample) Standard Error of the Estimate = SYX = SQRT { SSE / n - 2} = 936.85
Testing the General Assumptions of Regression and Residual Analysis 1. Normality of Error - similar to the t-test and ANOVA, regression is robust to departure from the normality of errors around the regression line. This assumption is often tested by simply plotting the Standardized Residuals (each residual divided by its standard error) on a histogram with a superimposed normal distribution, or on a normalo probability plot. SPSS allows us to perform both functions automatically (while, incidentally, saving the residual values in the original data file if this option is toggled):
Normal P-P Plot of Regression Standardized Residual
Histogram
Dependent Variable: Sales Revenue of Store
Dependent Variable: Sales Revenue of Store
1.00
5
.75
4
3
Expected Cum Prob
.50
Frequency
2
1
Std. Dev = .96 Mean = 0.00 N = 14.00
0 -2.00
-1.50
-1.00
-.50
0.00
.50
.25
0.00 0.00
1.00
.25
.50
.75
1.00
Observed Cum Prob
Regression Standardized Residual
Of course, the assessment of normality by visually scanning the data leaves some statisticians unsettled; so I usually add an appropriate test of normality conducted on the data: Variable Stand._Resid.
n 14
A-D 0.348
p-value 0.503
2. Homoscedasticity - the assumption that the variability of data around the regression line be constant for all values of X. In other words, error must be independent of X. Generally, this assumption may be tested by plotting the X values against the raw residuals for Y. In SPSS, this must be done by plotting a Scatterplot from the saved variables:
Click Here
Results in data automatically added to the data file:
Then, simply produce the requisite scatterplot as before:
2000
1000
Unstandardized Residual
0
-1000
-2000 1000
2000
3000
4000
5000
6000
Square Footage of Store
Notice how there is no 'fanning' pattern to the data, implying homoscedasticity.
Other authors, including those who wrote the SPSS routine, choose to plot the X values against the Studentized Residuals (Standardized Residuals Adjusted for their distance from the average X value) rather than the Unstandardized (raw) Residuals. SPSS will generate this plot automatically (select this under the ‘Plots’ panel):
Scatterplot of Studentized Residuals and Square Footage (X) 1.5
Studentized Residual
1.0 .5 0.0 -.5 -1.0 -1.5 -2.0 -2.5 1000
2000
3000
4000
5000
6000
Square Footage of Store Note the equivalence of results between the two plots. Statistically speaking, the X values and Residuals may be inferred to be 0.00. We can infer this using the correlation utility in SPSSPc, which tests the null hypothesis that the Pearson rho for the population is equal to 0.00:
Corre lations
Square Footage of Store
Unstandardized Residual
Studentized Residual
Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N
Square Footage of Store 1.000 . 14 .000 1.000 14 .015 .959 14
Unstanda rdized Studentize Residual d Residual .000 .015 1.000 .959 14 14 1.000 .999** . .000 14 14 .999** 1.000 .000 . 14 14
**. Correlation is significant at the 0.01 level (2-tailed).
It should be noted that the distribution of the data also suggest that an assumption of linearity is also reasonable at this point. 3) Independence of the Errors - assumes that no autocorrelation is present. Generally, evaluated by plotting the residuals in the order or sequence in which the original data were collected. This approach, when meaningful, uses the Durbin-Watson Statistic and associated Tables of Critical values. SPSS can generate this value when requested as part of the Model Summary: M ode l Summaryb Change Statistics Model 1
R R Square .954a .910
Adjusted R Square .902
Std. Error of the Estimate 936.8500
R Square Change .910
F Change 121.009
df1
df2 1
12
Sig. F Change .000
Durbin-W atson 2.446
a. Predictors: (Constant), Square Footage of Store b. Dependent Variable: Sales Revenue of Store
A number of other statistics are also available in SPSS regarding Residual Analysis:
Re siduals Statisticsa Minimum Maximum Mean Predicted Value 2759.3672 10749.96 5826.9286 Std. Predicted Value -1.073 1.722 .000 Standard Error of 250.7362 512.8126 345.3026 Predicted Value Adjusted Predicted Value 2771.8208 10518.55 5804.4373 Residual -1888.14 1070.6108 -3.25E-13 Std. Residual -2.015 1.143 .000 Stud. Residual -2.092 1.288 .011 Deleted Residual -2033.82 1442.1392 22.4913 Stud. Deleted Residual -2.512 1.329 -.014 Mahal. Distance .003 2.967 .929 Cook's Distance .001 .355 .086 Centered Leverage Value .000 .228 .071 a. Dependent Variable: Sales Revenue of Store
Std. Deviation 2858.2959 1.000
N 14 14
81.3831
14
2830.7178 900.0964 .961 1.035 1049.3911 1.111 .901 .103 .069
14 14 14 14 14 14 14 14 14
Inferences About the Model and Interval Estimates We can determine the presence of a significant relationship between X and Y by testing to determine whether the observed slope is significantly greater than 0, the hypothesized slope of the regression line if no relationship existed. This can be done with a t-test, which divides the observed slope by the standard error of the slope (supplied by SPSS): Coe fficie ntsa
Model 1
(Constant) Square Footage of Store
Unstandardized Coefficients B Std. Error 901.247 513.023 1.686 .153
Standardi zed Coefficien ts Beta .954
t 1.757 11.000
Sig. .104 .000
95% Confidence Interval for B Lower Bound Upper Bound -216.534 2019.027 1.352 2.020
a. Dependent Variable: Sales Revenue of Store
or with an ANOVA model, which provides identical results:
ANOVAb Model 1
Regression Residual Total
Sum of Squares 1.06E+08 10532255 1.17E+08
df 1 12 13
Mean Square 106208119.7 877687.937
F 121.009
Sig. .000 a
a. Predictors: (Constant), Square Footage of Store b. Dependent Variable: Sales Revenue of Store
noting that t2, as expected, equals F; and the p-values are therefore equal. Note that SPSS also provides the confidence interval associated with the slope. Finally, SPSS allows you to calculate and store both Confidence and Prediction Limits for the observed data. After you generate the scatterplot, left double-click on the chart; this will take you to the chart editor:
Next:
Then:
Click on ‘Fit Options’
Regression Analysis for Site Selection Scatterplot of Data Including Confidence & Prediction Limits 12000
10000
Sales Revenue of Store
8000
6000
4000
2000
Rsq = 0.9098
1000
2000
3000
4000
5000
6000
Square Footage of Store
LCL 3135.52558 2976.95430 5102.73145 9232.70820 2309.22155 4028.95209 2349.56701 1942.80866 5663.35086 2737.79303 8677.59067 7827.42925 9632.63839 5426.83323
UCL 4487.50548 4362.80609 6196.07384 11302.74446 3850.24435 5219.51308 3880.71656 3575.92595 6765.16486 4177.06134 10529.18763 9376.22071 11867.28348 6519.44789
LPL 1661.27256 1514.25297 3536.24581 7979.09247 897.92860 2497.98206 935.07592 560.87909 4100.00127 1293.06683 7362.03125 6418.64584 8422.94738 3860.07783
UPL 5961.75850 5825.50741 7762.55948 12556.36019 5261.53731 6750.48311 5295.20765 4957.85553 8328.51446 5621.78754 11844.74705 10785.00412 13076.97449 8086.20329