Introduction Linear Regression

  • Uploaded by: Keith Yang
  • 0
  • 0
  • October 2019
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Introduction Linear Regression as PDF for free.

More details

  • Words: 1,860
  • Pages: 31
17. Introduction to Regression Parameter Estimation Pitfalls

17.2

Regression Often in engineering we are interested in the relationship between two variables – If we want to understand what factors influence a variable of interest – In the absence of actual measurements we can use other information to infer what is the likely value. – Compare a observed data to simulated or modelled data

Often due to randomness and other unknown factors the relationship may not be unique Relationship is presented in form of scattergram

17.3

Scattergrams

17.4

Scattergrams

17.5

Scattergrams

17.6

Scattergram

17.7

Regression Analysis If the relationship between these variables is not unique it requires probabilistic description Use technique of Regression Analysis to determine the mean and variance of one variable as function of another variable If the function is simply a linear function it is called Linear Regression More generally regression may be nonlinear, but this is beyond scope of this course.

17.8

Linear Regression Range of possible values of Y increases with X However, knowing value of X does not give perfect information in Y Range of values could be covered by a probability distribution Mean values of Y increase with increasing values of x If this relationship is linear, we have linear regression: E (Y | X = x) = α + β x

17.9

Linear Regression E (Y | X = x) = α + β x

Regression equation, represents regression of Y on X α (intercept) and β (slope) are constants known as regression coefficients Estimated from the data

17.10

Linear Regression E (Y | X = x) = α + β x

From scatter, expect variance of Y, depend on X (general) First, lets consider case where Var(Y|X=x) is constant Estimated from the data

17.11

Linear Regression – Parameter Estimation “Best” straight line is one that passes through data points with least error Each data point (xi,yi) error between observed yi and from straight line, yi’= α+ βxi is |yi’ –yi| For several data points, total error is represented by cumulative squared error n

Δ = ∑ yi' − yi 2

2

i =1 n

= ∑ ( yi − α − β xi )

2

i =1

Obtain straight line with least squared error by minimising Δ 2 Method of Least Squares

E (Y | X = x) = α + β x

17.12

Linear Regression – Least Squares Parameter Estimates Parameter Estimates: αˆ = y − βˆ x βˆ =

∑ ( x y ) − nxy ∑ ( x ) − nx i

i

2 i

2

x , y = sample mean of X and Y n = sample size

Least Squares Regression Equation:

E (Y | x) = αˆ + βˆ x

E (Y | X = x) = α + β x

17.13

Example Water resource management in agricultural regions predictions uses management models – Determine water allocations, cropping regime’s etc.

An important input is the crop water use Management models run over periods of many years Need crop water use every year Crop water use is difficult to measure – – – –

Not directly available annually Can be related to crop yield, available annually Higher Crop Yield=Crops Use More Water A study from 1960’s measured crop water use and compared it to crop yield over several farms over a number of years

17.14

Example – Regression Analysis Water Use (mm) [y] 60.0 62.0 72.0 82.0 88.0 92.0 102.0 103.0 108.0 116.0 170.0

300 250 Water Use (mm)

Sugar Cane Yield (t/hectare) [x] 150.0 120.0 129.4 140.0 149.0 120.0 137.7 153.9 158.9 125.0 200.0

200 150 100 50 0 100.0

150.0

200.0

250.0

300.0

350.0

Sugar Cane Yield (t/hectare)

• Using regression model to predict water use based on annual sugar cane yield E[Water Use | Yield ] = αˆ + βˆ * Yield [ x]

Develop a regression model to predict crop water use based on sugar cane yield17.15

E[Water Use | Yield ] = αˆ + βˆ * Yield yˆ = αˆ + βˆ * x

αˆ = y − βˆ x βˆ =

∑ ( x y ) − nxy ∑ ( x ) − nx i

i

2 i

Sum Average Variance Beta(hat) Alpha(hat) Conditional Var R2

2

n

Δ 2 = ∑ ( yi − α − β xi )

2

sY2|x =

i =1

Sugar Cane Yield (t/hectare) [x] 150.0 120.0 129.4 140.0 149.0 120.0 137.7 153.9 158.9 125.0 200.0 1583.9 144.0

Water Use (mm) [y] 60.0 62.0 72.0 82.0 88.0 92.0 102.0 103.0 108.0 116.0 170.0 1055.0 95.9 1034.98

xy

x2

9000.0 22500.0 7440.0 14400.0 9319.7 16754.8 11480.0 19600.0 13112.0 22201.0 11040.0 14400.0 14044.7 18959.5 15849.1 23677.5 17165.1 25260.8 14500.0 15625.0 34000.0 40000.0 156950.7 233378.7

Δ n−2 2

y(pred)

[y(obs)-y(pred)]2

101.6 73.1 82.1 92.1 100.7 73.1 89.9 105.3 110.1 77.9 149.1

1731.9 123.3 101.5 102.3 160.4 357.1 145.9 5.3 4.5 1455.1 435.1 4622.4

r2 = 1−

sY2|x sY2

0.95 -40.95 513.60 0.50

Example 17.1

17.16

Example – Regression Analysis Water Use (mm) [y] 60.0 62.0 72.0 82.0 88.0 92.0 102.0 103.0 108.0 116.0 170.0

200.0 180.0 160.0 Water Use (mm)

Sugar Cane Yield (t/hectare) [x] 150.0 120.0 129.4 140.0 149.0 120.0 137.7 153.9 158.9 125.0 200.0

y = 0.9505x - 40.954 R2 = 0.5087

140.0 120.0 100.0 80.0 60.0 40.0 100.0

120.0

140.0

160.0

180.0

200.0

220.0

Sugar Cane Yield (t/hectare)

• Regression model to predict water use based on annual sugar cane yield

17.17

Example – Regression Analysis in EXCEL SUMMARY OUTPUT Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations

0.713248241 0.508723053 0.454136725 22.66269606 11

ANOVA df Regression Residual Total

Intercept Sugar Cane Yield (t/hectare)

1 9 10

SS 4786.528955 4622.380136 9408.909091

MS F 4786.528955 9.319606 513.5977929

Coefficients -40.95410879 0.950471558

Standard Error 45.34972022 0.311343895

t Stat P-value -0.90307302 0.390017 3.052802937 0.013731

Tools | Data Analysis | Regression Analysis Adjusted R Square takes into account number of parameters – useful for multiple linear regression Standard Error = Conditional Standard Deviation, SY|X, sample estimate of error standard deviation, σe

17.18

Coefficient of determination, R2 R2 - measure of how well the regression model matches “fits” data % of variance in Y that is explained by X By knowing X, variance in Y is reduced by R2 x 100% 2

200.0 180.0

yˆ | x = 150 ~ N (102, 22)

160.0

y = 0.9505x - 40.954 R2 = 0.5087

140.0

Water Use (mm)

Water Use (mm)

yˆ | x = 190 ~ N (139, 22)

120.0 100.0

139

102

80.0 60.0 40.0 100.0

120.0

140.0

160.0

180.0

200.0

220.0

Sugar Cane Yield (t/hectare)

SD(Y)=32.2 mm By knowing x=150, SD(Y|x=150) = 22 mm By knowing x=190, SD(Y|x=150) = 22 mm

150

190 Sugar Cane Yield (t/hectare)

17.19

Coefficient of determination, R2 R2 - measure of model fit % of variance in Y that is explained by X Match the R2 value to the graph

(a)

(1) (a ) R 2 → 0, (b) R 2 → 1 (c) 1 < R 2 < 0 (2) (a ) R 2 → 1, (b) 1 < R 2 < 0 (c) R 2 → 0

(b)

(3) (a ) R → 1, (b) R → 0 (c) 1 < R < 0 2

2

2

(c)

17.20

Summary Regression Analysis Probabilistic Relationship between two variable Linear Regression – Relationship is Linear

Develop Least Squares Estimates of Parameters – Slope and Intercept – Conditional Variance

Use this relationship to predict values of y given x

17.21

Pitfalls of Regression Extrapolating beyond observed data Influence of Outliers High R2 does not always imply a good model Low R2 does not always imply there is no relationship Relationship developed from a regression analysis does not necessarily imply any casual effect between variables – Need to ensure there is a physical processes

17.22

Extrapolation Fitting your regression model E (Y | X = x) = αˆ + βˆ x

Making predictions of y, using x values outside the range of observed data Regression of Power on Working Days

Power (kWh)

400 350

y = 15.518x - 100.52 R2 = 0.6839

300 250 200 20 21 22 23 24 25 26 27 28 29 30 31 32 Workings Days

17.23

Example Water resource management in agricultural regions predictions uses management models – Determine water allocations, cropping regime’s etc.

An important input is the crop water use Management models run over periods of many years Need crop water use every year Crop water use is difficult to measure – – – –

Not directly available annually Can be related to crop yield, available annually Higher Crop Yield=Crops Use More Water 1960’s study measured crop water use and compared it to crop yield over several farms over a number of years

Regression model of Water Use = F(Crop Yield) applied in 2000’s

17.24

Data from 1960’s 300 y = 0.5352x + 92.661 R2 = 0.5087

Water Use (mm)

250 200 150 100 50 0 0

50

100

150

200

Sugar Cane Yield (t/hectare)

• Using regression model to predict water use based on annual sugar cane yield

17.25

Use of Model in 2000’s •

Increase in Crop Yields from 1960-2000, up to 300 t/hectare



Require a prediction of water use with increase crop yields 300

?

Water Use (mm)

250 200 150 100 y = 0.5352x + 92.661 R2 = 0.5087

50 0 0

50

100

150

200

250

Sugar Cane Yield (t/hectare)

300

350

17.26

Choose a suitable trendline to estimate water use with a sugar cane yield of 300 t/hectare (a)

(b)

300

300 250

200

Water Use (mm)

Water Use (mm)

250

150 100 50

200 150 100 50

0 0

50

100

150

200

250

300

0

350

0

Sugar Cane Yield (t/hectare)

50

100

150

(c)

250

300

350

(d)

300

300

250

250 Water Use (mm)

Water Use (mm)

200

Sugar Cane Yield (t/hectare)

200 150 100

200 150 100

50

50

0

0

0

50

100

150

200

250

Sugar Cane Yield (t/hectare)

300

350

0

50

100

150

200

250

Sugar Cane Yield (t/hectare)

300

350

17.27

Trendline actually used (a) 300

Water Use (mm)

250 200 150 100 50 0 0

50

100

150

200

250

Sugar Cane Yield (t/hectare)

300

350

17.28

Influence of Outliers 300

Water Use (mm)

250

~40%

200 150

Regression line with largest value removed

100 50 0 0

50

100

150

200

250

300

350

Sugar Cane Yield (t/hectare)



Outliers can exert a large influence on regression equation



Be wary of Extrapolation



Regression equation may not provide reliable predictions outside range of calibration data – need to think about physical processes!

17.29

High R2 does not imply a good model

Data on right will still have reasonable R2 but is not good model

17.30

Low R2 does not imply there is no relationship between X and Y

17.31

Summary Meaning of Coefficient of Determination, R2 Pitfalls of Regression – – –

Dangers of Extrapolating Influence of Outliers Consider physical processes when assessing model adequacy

Next: More Regression – – – – –

Probabilistic Predictions using Regression Multiple Linear Regression Nonlinear Regression Residual Analysis Verify error model assumptions

Related Documents

Linear Regression
August 2019 23
Regression
November 2019 28
Regression
May 2020 27

More Documents from "sheheryar"

Unsteady Flow
October 2019 13
December 2019 14
General Control Volume
October 2019 18
Steel_framing.pdf
October 2019 16
Construction Joints
October 2019 25