Econometrics test, for 11/Feb/2009 1. About the error term, and what does it mean. When a mathematical model is specified, it assumes an exact (deterministic) relationship between variables. However, the relationship between economic variables is generally inexact. So, to allow for this inexact relationship, one must modify the deterministic model, by adding the error term (u). The error term is a stochastic (random) variable, that represents all the factors not accounted for in the econometric model. It stands as a surrogate of all omitted or neglected variables that have not been included in the model. For each individual in the sample, it represents his or her personal deviation from the expect value. Some of the reasons why such variables may not have been entered in the model include: 1. Vagueness of theory: the theory determining the behaviour might be incomplete 2. Unavailability of data: some important variable might have been excluded for lack of reliable data 3. Core variables vs. peripheral variables: some variable may account for so little of the variance of Y, or their influence may be stochastic, it does not pay to include them in the model 4. Intrinsic variance of human behaviour 5. Poor proxy variables: may result in errors of measurement 6. Principle of parsimony: while attempting to keep the model moderately simple, some variables, considered of less importance, might be left out 7. Wrong functional form The classical normal linear model assumes that each ui follows a normal distribution, with average 0, variance ơ2 and covariance 0, meaning the error terms are independently distributed.
2. PRF in relation to SRF. Discuss the difference between them. A Population Regression Function is an expression that states that the expected value of the distribution of Y, given Xi, is functionally related to Xi. It tells how the mean value of Y varies with X, for the entire population. The PRF is what we're trying to estimate in a linear regression. By defining a functional form and regressing, we try to estimate the unknown coefficients that help define the relationship between X and Y. However, it is often impossible to directly determine the PRF. It is frequently difficult to obtain data for the entire population, which forces investigators to sampling procedures: obtain data from a sample and then estimate for the population. The result is called the Sample Regression Function (SRF). Due to sampling fluctuations, there is often a degree of difference between the PRF and the SRF. When regressing the sample data to produce the Sample Regression Function, we obtain estimates of the true parameters (β). One of the methods that assures these estimates are as close to the real (population) parameters is the OLS (Ordinary Least Squares). It has been shown (and translated into the Gauss-Markhov Theorem) that the parameters obtained through OLS are the Best Linear Unbiased Estimators (BLUE). If we consider these parameters follow a normal distribution, we can then go on to test whether or not the parameters are statistically different from the estimates, by a t test.
3. Functional forms: what do they mean, why are they important. When doing a linear regression, it is possible that the resulting equation is linear in parameters but not necessarily in the variables. Depending on the form of equation we choose, fit between the data and the predictions from the regression will differ. We consider 4 kinds of linear equations – again, all are linear in parameters, but we apply different transformations to the variables. The first kind of equation is the log-linear model. In these forms, Y, β1 and X are all logged (natural logarithm). lnYi = lnβ1 + β2 lnXi + ui In these models, the term lnβ1 (which can be described as α, since it is a constant) measures the elasticity of the model: the percentage change in Y for a given (small) percentage change in X; it is, however, a constant elasticity model. The second kind of models are called semilog models. They can either be log-lin or lin-log models. The log-lin model has the general functional form: ln Yt = β1 + β2 t + ut where t stands for time. The usage of time as the independent variable is due to the fact that this functional form is used to find out the rate of growth of an economic variable, which is given by the slope coefficient, β2. β2, therefore, measures the relative change in Y for a given absolute change in the value of t. As for the lin-log model, it has the functional form Yi = β1 + β2 lnXi + ui Models with this functional form are give us the absolute change in Y for a percentage change in X – this ration equals β2. We then have the reciprocal models. These are useful because they consign the values of Y to an asymptote: they have a limit built in.
Yi = β1 + β2 ln(1/Xi) + ui As X increases, 1/X approaches 0, meaning that Y approaches β1 as its upper limit. Finally, we consider the logarithmic reciprocal model, lnYi = β1 - β2 (1/Xi) + ui In these functional forms, initially Y increases at an increasing rate, and then increases at a decreasing rate – like, for instance, in short run production functions. When doing a linear regression, we are not sure at the start of what the functional form of the regression might be. However, the process of deciding should follow some guidelines: 1. The underlying theory might suggest a specific functional form 2. The coefficients of the model should satisfy a-priori expectations – for instance, be positive of negative where expected so 3. The statistical significance of the coefficients 4. The overall fit of the model; this should not, however, be over-emphasised. 5. Find out the rate of change of the regressand with respect to the regressor, as well as the elasticity of the regressand with respect to the regressor.
4. Essay on R2. Discuss the differences between the two different kinds of R2. In a Classical Linear Regression Model, for the two-variable case, r 2 is a measure of goodness of fit of the regression to a set of data; for the multiple-variable regression model, the notation is R 2, and the use is the same: describe how well a regression fits the data. R2 (and r2) is called the coefficient of determination. Overall, R2 is simply an extension of r2. To calculate r2, one should have information concerning the total variation of each case with respect to the sample mean (the point predicted by the regression equation). The sum of these values gives us the Total Sum of Squares (TSS). The total sum of squares is the sum of the Explained Sum of Squares (ESS), which is explained by the regression, with the Residual Sum of Squares (RSS), the sum of all the stochastic errors (the actual variation of Y about the regression line): TSS = ESS + RSS Since r2 calculates the goodness of fit, it is described as r2 = ESS/TSS Consequently, what r2 measures is the percentage of the variance in the dependent variable that is explained, by the explanatory variable, in the regression model. Since it is a proportion, it varies between 0 and 1, and the regression is considered “better” - with a better fit – the higher the value of r2. It is also possible to regress a dependent variable in more than one explanatory variable. For these cases, we can determine R2, the multiple coefficient of determination. This gives us the proportion of the dependant variable that is explained jointly by the explanatory variables, no matter how many of them. Like with the two-variable case, R2 varies between 0 and 1, and the regression is said to have a better fit the higher the value of R2. However, care must be taken when comparing the values of R2. One of the characteristics of this index is that it never decreases when variables are added to the equation, so it is quite possible that meaningless variables are included in a regression, and the model therefore considered better than one with less explanatory variables, without it being the actual case. So, to compare two models for the same dependent variable, we must use the adjusted R 2, a value of the coefficient of determination that accounts for the number of explanatory variables in the regression. Besides using adjusted R2, we must also make sure that the dependent variable and the sample size are the same, or else it is not possible to compare the goodness of fit of the two
regressions. Finally, it should be noted that the objective of regression analysis is not to obtain the highest value of R2 at the expense of actual, relevant description of reality. When deciding between different solutions to a regression problem, the underlying theory (functional forms, signals of coefficients) should take precedence over the coefficient of determination.