17. Introduction to Regression Parameter Estimation Pitfalls
17.2
Regression Often in engineering we are interested in the relationship between two variables – If we want to understand what factors influence a variable of interest – In the absence of actual measurements we can use other information to infer what is the likely value. – Compare a observed data to simulated or modelled data
Often due to randomness and other unknown factors the relationship may not be unique Relationship is presented in form of scattergram
17.3
Scattergrams
17.4
Scattergrams
17.5
Scattergrams
17.6
Scattergram
17.7
Regression Analysis If the relationship between these variables is not unique it requires probabilistic description Use technique of Regression Analysis to determine the mean and variance of one variable as function of another variable If the function is simply a linear function it is called Linear Regression More generally regression may be nonlinear, but this is beyond scope of this course.
17.8
Linear Regression Range of possible values of Y increases with X However, knowing value of X does not give perfect information in Y Range of values could be covered by a probability distribution Mean values of Y increase with increasing values of x If this relationship is linear, we have linear regression: E (Y | X = x) = α + β x
17.9
Linear Regression E (Y | X = x) = α + β x
Regression equation, represents regression of Y on X α (intercept) and β (slope) are constants known as regression coefficients Estimated from the data
17.10
Linear Regression E (Y | X = x) = α + β x
From scatter, expect variance of Y, depend on X (general) First, lets consider case where Var(Y|X=x) is constant Estimated from the data
17.11
Linear Regression – Parameter Estimation “Best” straight line is one that passes through data points with least error Each data point (xi,yi) error between observed yi and from straight line, yi’= α+ βxi is |yi’ –yi| For several data points, total error is represented by cumulative squared error n
Δ = ∑ yi' − yi 2
2
i =1 n
= ∑ ( yi − α − β xi )
2
i =1
Obtain straight line with least squared error by minimising Δ 2 Method of Least Squares
E (Y | X = x) = α + β x
17.12
Linear Regression – Least Squares Parameter Estimates Parameter Estimates: αˆ = y − βˆ x βˆ =
∑ ( x y ) − nxy ∑ ( x ) − nx i
i
2 i
2
x , y = sample mean of X and Y n = sample size
Least Squares Regression Equation:
E (Y | x) = αˆ + βˆ x
E (Y | X = x) = α + β x
17.13
Example Water resource management in agricultural regions predictions uses management models – Determine water allocations, cropping regime’s etc.
An important input is the crop water use Management models run over periods of many years Need crop water use every year Crop water use is difficult to measure – – – –
Not directly available annually Can be related to crop yield, available annually Higher Crop Yield=Crops Use More Water A study from 1960’s measured crop water use and compared it to crop yield over several farms over a number of years
17.14
Example – Regression Analysis Water Use (mm) [y] 60.0 62.0 72.0 82.0 88.0 92.0 102.0 103.0 108.0 116.0 170.0
300 250 Water Use (mm)
Sugar Cane Yield (t/hectare) [x] 150.0 120.0 129.4 140.0 149.0 120.0 137.7 153.9 158.9 125.0 200.0
200 150 100 50 0 100.0
150.0
200.0
250.0
300.0
350.0
Sugar Cane Yield (t/hectare)
• Using regression model to predict water use based on annual sugar cane yield E[Water Use | Yield ] = αˆ + βˆ * Yield [ x]
Develop a regression model to predict crop water use based on sugar cane yield17.15
E[Water Use | Yield ] = αˆ + βˆ * Yield yˆ = αˆ + βˆ * x
αˆ = y − βˆ x βˆ =
∑ ( x y ) − nxy ∑ ( x ) − nx i
i
2 i
Sum Average Variance Beta(hat) Alpha(hat) Conditional Var R2
2
n
Δ 2 = ∑ ( yi − α − β xi )
2
sY2|x =
i =1
Sugar Cane Yield (t/hectare) [x] 150.0 120.0 129.4 140.0 149.0 120.0 137.7 153.9 158.9 125.0 200.0 1583.9 144.0
Water Use (mm) [y] 60.0 62.0 72.0 82.0 88.0 92.0 102.0 103.0 108.0 116.0 170.0 1055.0 95.9 1034.98
xy
x2
9000.0 22500.0 7440.0 14400.0 9319.7 16754.8 11480.0 19600.0 13112.0 22201.0 11040.0 14400.0 14044.7 18959.5 15849.1 23677.5 17165.1 25260.8 14500.0 15625.0 34000.0 40000.0 156950.7 233378.7
Δ n−2 2
y(pred)
[y(obs)-y(pred)]2
101.6 73.1 82.1 92.1 100.7 73.1 89.9 105.3 110.1 77.9 149.1
1731.9 123.3 101.5 102.3 160.4 357.1 145.9 5.3 4.5 1455.1 435.1 4622.4
r2 = 1−
sY2|x sY2
0.95 -40.95 513.60 0.50
Example 17.1
17.16
Example – Regression Analysis Water Use (mm) [y] 60.0 62.0 72.0 82.0 88.0 92.0 102.0 103.0 108.0 116.0 170.0
200.0 180.0 160.0 Water Use (mm)
Sugar Cane Yield (t/hectare) [x] 150.0 120.0 129.4 140.0 149.0 120.0 137.7 153.9 158.9 125.0 200.0
y = 0.9505x - 40.954 R2 = 0.5087
140.0 120.0 100.0 80.0 60.0 40.0 100.0
120.0
140.0
160.0
180.0
200.0
220.0
Sugar Cane Yield (t/hectare)
• Regression model to predict water use based on annual sugar cane yield
17.17
Example – Regression Analysis in EXCEL SUMMARY OUTPUT Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations
0.713248241 0.508723053 0.454136725 22.66269606 11
ANOVA df Regression Residual Total
Intercept Sugar Cane Yield (t/hectare)
1 9 10
SS 4786.528955 4622.380136 9408.909091
MS F 4786.528955 9.319606 513.5977929
Coefficients -40.95410879 0.950471558
Standard Error 45.34972022 0.311343895
t Stat P-value -0.90307302 0.390017 3.052802937 0.013731
Tools | Data Analysis | Regression Analysis Adjusted R Square takes into account number of parameters – useful for multiple linear regression Standard Error = Conditional Standard Deviation, SY|X, sample estimate of error standard deviation, σe
17.18
Coefficient of determination, R2 R2 - measure of how well the regression model matches “fits” data % of variance in Y that is explained by X By knowing X, variance in Y is reduced by R2 x 100% 2
200.0 180.0
yˆ | x = 150 ~ N (102, 22)
160.0
y = 0.9505x - 40.954 R2 = 0.5087
140.0
Water Use (mm)
Water Use (mm)
yˆ | x = 190 ~ N (139, 22)
120.0 100.0
139
102
80.0 60.0 40.0 100.0
120.0
140.0
160.0
180.0
200.0
220.0
Sugar Cane Yield (t/hectare)
SD(Y)=32.2 mm By knowing x=150, SD(Y|x=150) = 22 mm By knowing x=190, SD(Y|x=150) = 22 mm
150
190 Sugar Cane Yield (t/hectare)
17.19
Coefficient of determination, R2 R2 - measure of model fit % of variance in Y that is explained by X Match the R2 value to the graph
(a)
(1) (a ) R 2 → 0, (b) R 2 → 1 (c) 1 < R 2 < 0 (2) (a ) R 2 → 1, (b) 1 < R 2 < 0 (c) R 2 → 0
(b)
(3) (a ) R → 1, (b) R → 0 (c) 1 < R < 0 2
2
2
(c)
17.20
Summary Regression Analysis Probabilistic Relationship between two variable Linear Regression – Relationship is Linear
Develop Least Squares Estimates of Parameters – Slope and Intercept – Conditional Variance
Use this relationship to predict values of y given x
17.21
Pitfalls of Regression Extrapolating beyond observed data Influence of Outliers High R2 does not always imply a good model Low R2 does not always imply there is no relationship Relationship developed from a regression analysis does not necessarily imply any casual effect between variables – Need to ensure there is a physical processes
17.22
Extrapolation Fitting your regression model E (Y | X = x) = αˆ + βˆ x
Making predictions of y, using x values outside the range of observed data Regression of Power on Working Days
Power (kWh)
400 350
y = 15.518x - 100.52 R2 = 0.6839
300 250 200 20 21 22 23 24 25 26 27 28 29 30 31 32 Workings Days
17.23
Example Water resource management in agricultural regions predictions uses management models – Determine water allocations, cropping regime’s etc.
An important input is the crop water use Management models run over periods of many years Need crop water use every year Crop water use is difficult to measure – – – –
Not directly available annually Can be related to crop yield, available annually Higher Crop Yield=Crops Use More Water 1960’s study measured crop water use and compared it to crop yield over several farms over a number of years
Regression model of Water Use = F(Crop Yield) applied in 2000’s
17.24
Data from 1960’s 300 y = 0.5352x + 92.661 R2 = 0.5087
Water Use (mm)
250 200 150 100 50 0 0
50
100
150
200
Sugar Cane Yield (t/hectare)
• Using regression model to predict water use based on annual sugar cane yield
17.25
Use of Model in 2000’s •
Increase in Crop Yields from 1960-2000, up to 300 t/hectare
•
Require a prediction of water use with increase crop yields 300
?
Water Use (mm)
250 200 150 100 y = 0.5352x + 92.661 R2 = 0.5087
50 0 0
50
100
150
200
250
Sugar Cane Yield (t/hectare)
300
350
17.26
Choose a suitable trendline to estimate water use with a sugar cane yield of 300 t/hectare (a)
(b)
300
300 250
200
Water Use (mm)
Water Use (mm)
250
150 100 50
200 150 100 50
0 0
50
100
150
200
250
300
0
350
0
Sugar Cane Yield (t/hectare)
50
100
150
(c)
250
300
350
(d)
300
300
250
250 Water Use (mm)
Water Use (mm)
200
Sugar Cane Yield (t/hectare)
200 150 100
200 150 100
50
50
0
0
0
50
100
150
200
250
Sugar Cane Yield (t/hectare)
300
350
0
50
100
150
200
250
Sugar Cane Yield (t/hectare)
300
350
17.27
Trendline actually used (a) 300
Water Use (mm)
250 200 150 100 50 0 0
50
100
150
200
250
Sugar Cane Yield (t/hectare)
300
350
17.28
Influence of Outliers 300
Water Use (mm)
250
~40%
200 150
Regression line with largest value removed
100 50 0 0
50
100
150
200
250
300
350
Sugar Cane Yield (t/hectare)
•
Outliers can exert a large influence on regression equation
•
Be wary of Extrapolation
•
Regression equation may not provide reliable predictions outside range of calibration data – need to think about physical processes!
17.29
High R2 does not imply a good model
Data on right will still have reasonable R2 but is not good model
17.30
Low R2 does not imply there is no relationship between X and Y
17.31
Summary Meaning of Coefficient of Determination, R2 Pitfalls of Regression – – –
Dangers of Extrapolating Influence of Outliers Consider physical processes when assessing model adequacy
Next: More Regression – – – – –
Probabilistic Predictions using Regression Multiple Linear Regression Nonlinear Regression Residual Analysis Verify error model assumptions