Regression

  • Uploaded by: sheheryar
  • 0
  • 0
  • May 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Regression as PDF for free.

More details

  • Words: 2,430
  • Pages: 10
Regression In statistics, linear regression is used for two things; •

to construct a simple formula that will predict a value or values for a variable given the value of another variable.



to test whether and how a given variable is related to another variable or variables.

Note: correlation does not imply causation. •

Linear regression is a form of regression analysis in which the relationship between one or more independent variables and another variable, called the dependent variable, is modelled by a least squares function, called a linear regression equation.



This function is a linear combination of one or more model parameters, called regression coefficients.



A linear regression equation with one independent variable represents a straight line when the predicted value (i.e. the dependant variable from the regression equation) is plotted against the independent variable: this is called a simple linear regression. However, note that "linear" does not refer to this straight line, but rather to the way in which the regression coefficients occur in the regression equation. The results are subject to statistical analysis.

Example of linear regression with one independent variable.

Theoretical model

A linear regression model assumes, given a random sample

,a

possibly imperfect relationship between Yi, the regressand, and regressors disturbance term

. A

, which is a random variable too, is added to this assumed relationship to

capture the influence of everything else on Yi other than regression model takes the following form:

. Hence, the multiple linear

Note that the regressors are also called independent variables, exogenous variables, covariates, input variables or predictor variables. Similarly, regressands are also called dependent variables, response variables, measured variables, or predicted variables. Models which do not conform to this specification may be treated by nonlinear regression. A linear regression model need not be a linear function of the independent variable: linear in this context means that the conditional mean of Yi is linear in the parameters β. For example, the model

is linear in the parameters β1 and β2, but it is not linear in

, a nonlinear function of Xi. An illustration of this model is shown in the example, below.

Data and estimation It is important to distinguish the model formulated in terms of random variables and the observed values of these random variables. Typically, the observed values, or data, denoted by lower case letters, consist of n values

.

In general there are p + 1 parameters to be determined, parameters it is often useful to use the matrix notation

where Y is a column vector that includes the observed values of unobserved stochastic components regressors

. In order to estimate the

,

includes the

and the matrix X the observed values of the

X includes, typically, a constant column, that is, a column which does not vary across observations, which is used to represent the intercept term β0.

If there is any linear dependence among the columns of X, then the vector of parameters β cannot be estimated by least squares unless β is constrained, as, for example, by requiring the sum of some of its components to be 0. However, some linear combinations of the components of β may still be uniquely estimable in such cases. For example, the model cannot be solved for β1 and β2 independently as the matrix of observations has the reduced rank 2. In this case the model can be rewritten as value for the composite entity β1 + 2β2.

and can be solved to give a

Note that to only perform a least squares estimation of it is not necessary to consider the sample as random variables. It may even be conceptually simpler to consider the sample as fixed, observed values, as we have done thus far. However in the context of hypothesis testing and confidence intervals, it will be necessary to interpret the sample as random variables that will produce estimators which are themselves random variables. Then it will be possible to study the distribution of the estimators and draw inferences.

Example Demand for Homes Find a linear demand equation that best fits the following data, and use it to predict annual sales of homes priced at $140,000. x = Price (Thousands of $)

$160 $180 $200 $220 $240 $260 $280

y = Sales of New Homes This Year

126

103

82

75

82

Solution x 160 180 200 220 240 260 280 Sums

x = 1,540

y 126 103 82 75 82 40 20

xy 20,160 18,540 16,400 16,500 19,680 10,400 5,600

y = 528

xy = 107,280

x2 25,600 32,400 40,000 48,400 57,600 67,600 78,400 x2 = 350,000

Substituting these values in the formula gives (n = 7)

slope = m =

n( xy) − ( x)( y)

= 7(107,280) − (1,540)(528) −0.7929 7(350,000) − 1,5402

n( x2) − ( x)2

40

20

y − m( x) intercept = b =

n

=

528 −(− 0.7928571429)(1,540)

249.9

7

Notice that we used the most accurate value, m = −0.7928571429, that we could obtain on our calculator in the formula for b rather than the rounded value −0.7929. This illustrates the following important general guideline: When calculating, never round intermediate results. Rather, use the most accuate results obtainable, or have your calculator store them for you. Thus our least squares line is y = −0.7929x + 249.9. We can now use this equation to predict the annual sales of homes priced at $140,000, as we were asked to do. Remembering that x is the price in thousands of dollars, we set x = 140, and solve for y, getting y 139. Thus our model predicts that approximately 139 homes will have been sold in the range $140,000-$159,000. Before we go on... We must remember that these figures were for sales in a range of prices. For instance, it would be extremely unlikely that 139 homes would have been sold at exactly $140,000. On the other hand, it does predict that, were we to place 139 homes on the market at $140,000, we could expect to sell them all Here is the original data, together with the least squares line.

Examples: To illustrate the various goals of regression, we give an example. The following data set gives the average heights and weights for American women aged 30-39 (source: The World Almanac and Book of Facts, 1975).

Height (m)

1.47

1.5

1.52

1.55

1.57 1.60

1.63

1.65

1.68

1.7

1.73

1.75 1.78

1.8

Weight (kg)

52.21 53.12 54.48 55.84 57.2 58.57 59.93 61.29 63.11 64.47 66.28 68.1 69.92 72.19 74.46

A plot of weight against height (see below) shows that it cannot be modeled by a straight line, so a regression is performed by modeling the data by a parabola.

where the dependent variable Yi is weight and the independent variable Xi is height. Place the observations

, in the matrix X.

The values of the parameters are found by solving the normal equations

Element ij of the normal equation matrix, and column j of X.

is formed by summing the products of column i

1.83

Element i of the right-hand side vector is formed by summing the products of column i of X with the column of dependent variable values.

Thus, the normal equations are

(value

standard deviation)

The calculated values are given by

Linear least squares Linear least squares, or ordinary least squares (OLS), is an important computational problem, that arises in applications when it is desired to fit a linear mathematical model to measurements obtained from experiments. The goals of linear least squares are to extract predictions from the measurements and to reduce the effect of measurement errors. Mathematically, it can be stated as the problem of finding an approximate solution to an overdetermined system of linear equations. In statistics, it corresponds to the maximum likelihood estimate for a linear regression with normally distributed error. The first objective of regression analysis is to best-fit the data by estimating the parameters of the model. Of the different criteria that can be used to define what constitutes a best fit, the least squares criterion is a very powerful one. This estimate (or estimator, if we are in the context of a random sample), is given by

Regression inference The estimates can be used to test various hypotheses.

Denote by σ2 the variance of the error term (recall we assume that

for every

). An unbiased estimate of σ2 is given by

where value is:

is the sum of square residuals. The relation between the estimate and the true

where has Chi-square distribution with n − p degrees of freedom. The solution to the normal equations can be written as

This shows that the parameter estimators are linear combinations of the dependent variable. It follows that, if the observational errors are normally distributed, the parameter estimators will follow a joint normal distribution. Under the assumptions here, the estimated parameter vector is exactly distributed,

where N denotes the multivariate normal distribution. The standard error of a parameter estimator is given by

The 100(1 − α)% confidence interval for the parameter, βj, is computed as follows:

The residuals can be expressed as

The matrix is known as the hat matrix and has the useful property that it is idempotent. Using this property it can be shown that, if the errors are normally distributed, the

residuals will follow a normal distribution with covariance matrix Studentized residuals are useful in testing for outliers.

.

Given a value of the independent variable, xd, the predicted response is calculated as

Writing the elements as , the 100(1 − α)% mean response confidence interval for the prediction is given, using error propagation theory, by:

The 100(1 − α)% predicted response confidence intervals for the data are given by:

Statistic Terminologies •

Correlation Coefficient ○ A correlation coefficient is a number between -1 and 1 which measures the degree to which two variables are linearly related. ○ If there is perfect linear relationship with positive slope between the two variables, we have a correlation coefficient of 1; if there is positive correlation, whenever one variable has a high (low) value, so does the other. ○ If there is a perfect linear relationship with negative slope between the two variables, we have a correlation coefficient of -1; if there is negative correlation, whenever one variable has a high (low) value, the other has a low (high) value. ○ A correlation coefficient of 0 means that there is no linear relationship between the variables.

• Scatter Plots Scatter plots are similar to line graphs in that they use horizontal and vertical axes to plot data points. However, they have a very specific purpose. Scatter plots show how much one variable is affected by another. The relationship between two variables is called their correlation . Scatter plots usually consist of a large body of data. The closer the data points come when plotted to making a straight line, the higher the correlation between the two variables, or the stronger the relationship. If the data points make a straight line going from the origin out to high x- and y-values, then the variables are said to have a positive correlation . If the line goes from a high-value on the y-axis down to a high-value on the x-axis, the variables have a negative correlation .



Least Squares

The method of least squares is a criterion for fitting a specified model to observed data. For example, it is the most commonly used method of defining a straight line through a set of points on a scatterplot. •

Regression Equation

A regression equation allows us to express the relationship between two (or more) variables algebraically. It indicates the nature of the relationship between two (or more) variables. In particular, it indicates the extent to which you can predict some variables by knowing others, or the extent to which some are associated with others. A linear regression equation is usually written Y = a + bX + e where Y is the dependent variable a is the intercept b is the slope or regression coefficient X is the independent variable (or covariate) e is the error term The equation will specify the average magnitude of the expected change in Y given a change in X. •

Regression Line

A regression line is a line drawn through the points on a scatterplot to summarise the relationship between the variables being studied. When it slopes down (from top left to bottom right), this indicates a negative or inverse relationship between the variables; when it slopes up (from bottom right to top left), a positive or direct relationship is indicated. The regression line often represents the regression equation on a scatterplot. •

Simple Linear Regression

Simple linear regression aims to find a linear relationship between a response variable and a possible predictor variable by the method of least squares. •

Multiple Regression

Multiple linear regression aims is to find a linear relationship between a response variable and several possible predictor variables. •

Multiple Regression Correlation Coefficient

The multiple regression correlation coefficient, R², is a measure of the proportion of variability explained by, or due to the regression (linear relationship) in a sample of paired data. It is a number between zero and one and a value close to zero suggests a poor model. A very high value of R² can arise even though the relationship between the two variables is nonlinear. The fit of a model should never simply be judged from the R² value. •

Stepwise Regression

A 'best' regression model is sometimes developed in stages. A list of several potential explanatory variables are available and this list is repeatedly searched for variables which should be included in the model. The best explanatory variable is used first, then the second best, and so on. This procedure is known as stepwise regression. •

Dummy Variable (in regression)

In regression analysis we sometimes need to modify the form of non-numeric variables, for example sex, or marital status, to allow their effects to be included in the regression model. This can be done through the creation of dummy variables whose role it is to identify each level of the original variables separately.



Accuracy

Accurate means "capable of providing a correct reading or measurement." In physical science it means 'correct'. A measurement is accurate if it correctly reflects the size of the thing being measured. •

Precision

Precise means "exact, as in performance, execution, or amount. "In physical science it means "repeatable, reliable, getting the same measurement each time." Accuracy is the degree of reality while precision is the degree of reproducibility

Related Documents

Regression
November 2019 28
Regression
May 2020 27
Regression
November 2019 24
Regression
May 2020 9
Regression Model
October 2019 23

More Documents from ""