Week1.pdf

  • Uploaded by: Johana Coen Janssen
  • 0
  • 0
  • June 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Week1.pdf as PDF for free.

More details

  • Words: 3,608
  • Pages: 22
Week 1 (For: Professor Albert Satorra Brucart) Akhil Ilango January 2019 ## 1. Practice of Computation Here, we have a population data with discrete uniform distribution. The domain of this distribution is s = (0, 1, 2, ...K). All the values are equally likely to be observed with 1 probability K+1 since there are K + 1 terms. Let’s take K = 100. We would like to compute the mean and variance of the given data. K <- 100 x1 <- seq(0,K) cat('Mean by R function:', mean(x1), '\n') ## Mean by R function: 50 m_a <- (min(x1)+max(x1))/2 cat('Mean by distribution Formula:', m_a, '\n') ## Mean by distribution Formula: 50 m_b <- sum(x1)/length(x1) cat('Mean by estimate formula:', m_b, '\n','\n') ## Mean by estimate formula: 50 ## cat('Variance by R function:', var(x1), '\n')

# using R function

## Variance by R function: 858.5 var_s_a <- sum((x1-m_b)**2)/K # by formula 1 cat('Variance by Formula 1:', var_s_a, '\n') ## Variance by Formula 1: 858.5 var_s_b <- (K+1)*(K+2)/12 # by formula 2 cat('Variance by Formula 2:', var_s_b, '\n') ## Variance by Formula 2: 858.5 A1 <- sum(x1**2) A2 <- K*(K+1)*(2*K+1)/6 A1==A2

# the formula is for natural numbers

1

## [1] TRUE First, we compare the mean obtained via three ways: • using R function We use the function mean. µ ˆ = mean(x). • using distribution formula We know that for a discrete uniform distribution x ∼ U[a, b], we have µ = a+b 2 . If these values are given to us, we can directly use them. Since, a and b are the end points, one way to recover them from the data is by finding the minimum and maximum values. a = min(x), b = max(x). Note that here, it is the true value and not an estimate. This is because we are using information about the shape of the true distribution. • using the sample estimate An unbiased estimate for any given sample is given by n

µ ˆ =

1X xi n i=1

where n is the sample size and x = x1 , x2 , ...xn . Overall, we see that all three methods give us the same answer. Second, we compare the variance obtained via three ways: • using R function We use the function var. σ ˆ 2 = var(x). • using Formula 1 Here, we use the first formula given in the question σ ˆ2 =

K 1 X 2 (xi − µ ˆ) K i=1

. This is an unbiased estimator of variance. Note that we have K + 1 as sample size. • using Formula 2 Here, we use an expnasion of the above formula. This is also given in the question. σ ˆ2 =

(K + 1)(K + 2) 12

.

This is obtained by using the general formula for sum of squares of a sequence N (N +1)(2N +1) . 6

PN

i=1

i2

=

Overall, we see that all three methods give us the same answer.

2. Simulation Since this exercise requires sampling many times on multiple occassions, let us define a function. This can help us keep our code compact.

2

# defining a function for sampling ex2 <- function(data,iterations,n_sample,how){ set.seed(1) mu <- rep(0,iterations) # # # # #

method 1 for (i in 1:iterations){ x <- sample(data, n_sample, replace=how) mu[i] <- mean(x) }

# method 2 x <- as.matrix(replicate(iterations,{sample(data,n_sample, replace=how)})) mu <- colMeans(x) # return the vector of estimated means return(mu) } In this fucntion, we use set.seed(1) to define a reference point for random number generation. This gives us same random numbers every time we try this exercise. This is useful for falsifiability of results. In this function, we take as arguments, the data, number of iterations, size of each sample and a logical variable to notify if sampling with or without replacement. Then, the function samples from the data in each iteration and computes the sample mean for each sample. This value is returned. (a) In this exercise, we are concerned with the variance of sample mean and the variance of population. From a given population data, we draw many samples to study this. data <- 30:100 n_sample <- 10 iterations <- 100 mu_rep <- mu_norep <- rep(0,iterations) mu_rep <- ex2(data,iterations,n_sample,TRUE) mu_norep <- ex2(data,iterations,n_sample,FALSE) cat('Variance of sample mean (with replacement):', var(mu_rep), '\n') ## Variance of sample mean (with replacement): 48.22492 cat('Variance of sample mean (without replacement):', var(mu_norep), '\n') ## Variance of sample mean (without replacement): 36.07392 cat('Variance of population:', var(data), '\n') ## Variance of population: 426

3

cat('(Variance of population)/(Sample size):', var(data)/n_sample) ## (Variance of population)/(Sample size): 42.6 The variables data and nsample are defined as given in question. We take the number of iterations to be 100. To obtain the estimate of sample mean, we call the function ex2 that we defined earlier. We know that for a sample mean

E[ˆ µ] = E[

N 1 X xi ] N i=1

=

N 1 X E(xi ) N i=1

=

N 1 X [ µ] N i=1

= µ It is an unbiased estimate of the population mean.

Var[ˆ µ] = Var[

N 1 X xi ] N i=1

=

N 1 X Var(xi ) N i=1

=

N 1 X 2 [ σ ] N 2 i=1

=

σ2 N

Therefore, it is not the same as the population mean. This is also the result we get in the computation. (b)

set.seed(1) n_sample <- 500 x <- sample(data,n_sample,replace = TRUE) summary(x) ## ##

Min. 1st Qu. 30.00 48.00

Median 63.00

Mean 3rd Qu. 64.68 81.25

Max. 100.00

hist(x, main='Histogram of x (n=500)')

4

30 20 0

10

Frequency

40

50

Histogram of x (n=500)

30

40

50

60

70 x

n_sample <- 5000 x <- sample(data,n_sample,replace = TRUE) hist(x, main='Histogram of x (n=5000)')

5

80

90

100

300 200 0

100

Frequency

400

Histogram of x (n=5000)

30

40

50

60

70

80

90

x n_sample <- 500 iterations <- 100 mu_rep <- ex2(data, iterations, n_sample,TRUE) cat('\n', 'For 100 iterations:', '\n') ## ##

For 100 iterations:

cat('Variance of sample mean (with replacement):', var(mu_rep), '\n') ## Variance of sample mean (with replacement): 0.7739996 cat('Variance of population:', var(data), '\n') ## Variance of population: 426 cat('(Variance of population)/(Sample size):', var(data)/n_sample, '\n') ## (Variance of population)/(Sample size): 0.852 iterations <- 10^4 mu_rep <- ex2(data, iterations, n_sample,TRUE) cat('\n', 'For 10^4 iterations:', '\n') 6

100

## ##

For 10^4 iterations:

cat('Variance of sample mean (with replacement):', var(mu_rep), '\n') ## Variance of sample mean (with replacement): 0.8380109 cat('Variance of population:', var(data), '\n') ## Variance of population: 426 cat('(Variance of population)/(Sample size):', var(data)/n_sample) ## (Variance of population)/(Sample size): 0.852 As argued in the previous exercise, we do not obtain the variance of sample mean as equal to the variance of the data. Also, we observe that we have an asymptotically consistent estimator. Var[ˆ µ] = Var[

N 2 1 X p σ xi ] − → N i=1 N

We do not obtain a normal distribution as we are sampling with replacement with equal probability of drawing any of the values. This means that we will get an approximately discrete uniform distribution. This can be seen more clearly as the sample size increases. (c) n_sample <- 10 iterations <- 100 mu_rep <- ex2(data, iterations, n_sample, TRUE) hist(mu_rep, main='Histogram of mu_rep (100 iterations)')

7

10 15 20 25 30 35 0

5

Frequency

Histogram of mu_rep (100 iterations)

40

50

60 mu_rep

qqnorm(scale(mu_rep)) abline(0,1)

8

70

80

0 −1 −2 −3

Sample Quantiles

1

2

Normal Q−Q Plot

−2

−1

0

1

2

Theoretical Quantiles iterations <- 10^4 mu_rep <- ex2(data, iterations, n_sample, TRUE) hist(mu_rep, # histogram breaks = 50, col = 'grey', # column color border = "blue", prob = TRUE, # show densities instead of frequencies main='Histogram of mu_rep (10^4 iterations)') lines(density(mu_rep), # density plot lwd = 2, col = 2) abline(v = mean(mu_rep), col = 3, lwd = 2) abline(v = median(mu_rep), col = 4, lwd = 2) abline(v = c(mean(mu_rep)+2*sqrt(var(mu_rep)),mean(mu_rep)-2*sqrt(var(mu_rep))), col = 5, lty = 2, lwd = 2) legend(x = "topright", # location of legend within plot area c("Density plot", "Mean", "Median","2 sigma"), 9

col = c(2, 3, 4, 5), lwd = c(2, 2, 2,2))

0.04

Density plot Mean Median 2 sigma

0.00

0.02

Density

0.06

Histogram of mu_rep (10^4 iterations)

40

50

60

70 mu_rep

qqnorm(scale(mu_rep)) abline(0,1)

10

80

90

0 −2 −4

Sample Quantiles

2

Normal Q−Q Plot

−4

−2

0

2

4

Theoretical Quantiles Since our sampling is independent in each iteration and we are drawing from the same distribution (uniformly from same data values), our procedure satisfied the IID assumption. Therefore, applying the Lindberg-Levy Central Limit Theorem, we have the following result µ ˆ ∼ N (µ, σ 2 ) We see this result more clearly as we increase the number of iterations. This shows that the estimator is asymptotically normal.

3. This is an exercise to practice using R to do some calculations. C <- 25000 r <- 0.05/12 n <- 360 #600 A <- (r*C*(1+r)^n)/((1+r)^n-1) cat('Monthly payment (30 years): $',A,'\n') ## Monthly payment (30 years): $ 134.2054 C <- 25000 r <- 0.05/12 n <- 600 A <- (r*C*(1+r)^n)/((1+r)^n-1) cat('Monthly payment (50 years): $',A,'\n') 11

## Monthly payment (50 years): $ 113.5347 Also, this exercise introduces the use of functions. Here, we have defined the function named payment. This takes values of n as input and gives the monthly payment as output. Writing functions are useful when the formula is complicated and the same calculation has to be performed many times. n <-15*12 r <- 6/1200 C <-200000 A <- r*((1+r)**n)*C/(((1+r)**n) -1) cat('Monthly payment (15 years): $',A,'\n') ## Monthly payment (15 years): $ 1687.714

1400 1300 1200 1100

monthly payment (in $)

n <- (20:50)*12 payment <- function(n){ r*((1+r)**n)*C/(((1+r)**n) -1)} plot(20:50, payment(n), type="l",xlab = 'years', ylab = 'monthly payment (in $)')

20

25

30

35 years

4. Plotting Let us generate data as given in the question.

12

40

45

50

mu_x <- 1.7 sig_x <- 0.3 mu_e <- 0 sig_e <- 0.8 N <- 6000 x <- rnorm(n = N, mean = mu_x, sd = sig_x) e <- rnorm(n = N, mean = mu_e, sd = sig_e) y <- 12+3*x+e First, we visualize the scatterplot for the generated data. Then, we plot two families of datapoints to explore the usage of plots in R. The first half of x has been considered as one family and the other half as the second family. We use different symbols, colors and a legend to highlight the difference.

14

16

y

18

20

plot(x = x, y = y, type = 'p', pch=1, cex=0.5)

0.5

1.0

1.5

2.0

2.5

x # plotting families plot(x = c(x[1:3000], x[3001:6000]), y = c(y[1:3000],y[3001:6000]), type = 'p', pch=c(1,2), cex= c(0.5,0.5), col = c('black','red'), xlab = 'x',ylab = 'y', main = 'Scatter Plot') legend("topleft", c("FamilyBlack","FamilyRed"), pch=c(1,2), col = c(1,2), title= "Legend")

13

Scatter Plot

FamilyBlack FamilyRed

14

16

y

18

20

Legend

0.5

1.0

1.5

2.0

2.5

x Here, we have made a scatterplot using the plot function. type gives the type of plot (e.g, ‘p’ gives scatter plot, ‘l’ gives lines etc.). The argument pch gives the plotting symbol. Using different values for this argument gives different symbols (e.g, pch=1 gives circles, pch=2 gives triangles etc.). This can be useful to differentiate families among the plotted points. Similarly, the argument cex refers to the symbol size. col gives the color of the symbols. The value can be given as numbers (which have a pre-defined meaning in R, e.g, col=1 gives black, col=2 gives red etc.) or can be expressed between quotes. xlab, ylab and main give the label of x-axis, label of y-axis and the plot title respectively. Legend is defined in the second plot to help understand the plot. It is a good exercise to add the legend. 4.2 We use the lm function from R to fit a Linear Regression model on the data. We get an object as an output which contains a lot of information. An object is like a collection of information. We extract the required information (e.g, coefficent estimates, coefficient estimate standard deviation etc.) using a $ symbol as shown below. The information contained in such object can also be studied by looking at the variables windown in R. summary(lm(y~x)) ## ## Call: ## lm(formula = y ~ x) ## 14

## ## ## ## ## ## ## ## ## ## ## ## ## ##

Residuals: Min 1Q -2.88087 -0.53367

Median 0.00344

3Q 0.54541

Max 2.72196

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 12.07827 0.05895 204.88 <2e-16 *** x 2.95517 0.03425 86.29 <2e-16 *** --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.7961 on 5998 degrees of freedom Multiple R-squared: 0.5538, Adjusted R-squared: 0.5538 F-statistic: 7445 on 1 and 5998 DF, p-value: < 2.2e-16

beta <- lm(y~x)$coef plot(x = x, y = y, type = 'p', pch=1, cex= 0.5, col = 'grey', xlab = 'x',ylab = 'y', main = 'Scatter Plot') # abline(a = beta[1], b = beta[2], col = 'black', lty = 1, lwd = 1) abline(lm(y~x), col = 'black', lty = 1, lwd = 1)

14

16

y

18

20

Scatter Plot

0.5

1.0

1.5

2.0

2.5

x We can edit the line type (e.g, lty=1 gives solid line, lty=2 gives dashed line etc.) using lty and the line width using lwd. We use the line color as black and points in grey for good visualization.

15

4.3 Let us define a function that takes a sample from the given population and plots the regression line over the scatter plot. Here, the arguments are the population data and sample size. The plot also includes the 95% confidence interval. We combine the concepts of sub-plots and for loops. We use the par function to determine the style of our subplot. And importantly, mfrow(m,n) tells that we will make a matrix of plots with m rows and n columns. ex3 <- function(x,y,n){ loc = sample(x = 1:N,size = n, replace = TRUE) xi <- x[loc] yi <- y[loc] ifit <- lm(yi~xi) betai <- ifit$coefficients sd <- summary(ifit)$coefficients[2,2] title <- paste("Sample ", i) plot(x = xi, y = yi, type = 'p', pch=1, cex= 0.5, col = 'gray', xlab = 'x',ylab = 'y', main = title, axes = FALSE) abline(a = betai[1], b = betai[2], col = 'black') abline(a = beta[1], b = beta[2], col = 'red') abline(a = betai[1]+2*sd, b = betai[2], col = 'black', lty = 3) abline(a = betai[1]-2*sd, b = betai[2], col = 'black', lty = 3) } set.seed(1) iterations <- 9 i <- 1 n <- 20 par(pty = "m", mfrow = c(3, 3), cex.lab = 0.2 , cex.main = 1, mar = c(2, 2, 1, 0), oma = c(1, 1, 1.5, 1), mgp = c(2, 0.5, 0)); # mar, oma, mgp to define margins. # Please See https://www.rdocumentation.org/packages/graphics/versions/3.5.1/topics/par while (i <= iterations){ ex3(x,y,n) # create the plot by calling the function # define x- axis only at last row of subplots for neatness if (i %in% c(7, 8, 9)){ axis(1, col = "grey40", col.axis = "grey20", at = seq(0.9, 2.4, 0.3)) } # define y- axis only at first column of subplots for neatness if (i %in% c(1, 4,7)){ axis(2, col = "grey40", col.axis = "grey20", at = seq(15, 19, 1)) } box(col = "grey60") i <- i+1 } mtext("x axis", side = 1, outer = TRUE, cex = 0.7, col = "grey20") mtext("y axis", side = 2, outer = TRUE, cex = 0.7, col = "grey20") mtext("For n = 20", side = 3, outer = TRUE, cex = 1.2, col = "grey20")

16

For n = 20 Sample 2

Sample 3

y

y

17 15

y

19

Sample 1

Sample 4

Sample 5

Sample 6

x

x

y

y

17 15

y

y axis

19

x

Sample 7

Sample 8

Sample 9

x

x

y

y

15

17

y

19

x

1.2

1.5

1.8 x

2.1

1.5

1.8 x axis x

2.1

1.2

1.5

1.8

2.1

2.4

x

4.4 We see that in small samples the variance of the OLS estimate increases as we move to small samples. This can be seen by plotting the 95% confidence bands around the estimates. Below, we plot for the case when we sample 500 datapoints instead of 20. This concept is evident by looking at both the graphs. Note that the axis ranges are the same in both the graphs. Therefore, the graphs are visually comparable. n <- 500 i <- 1 par(pty = "m", mfrow = c(3, 3), cex.lab = 0.2 , cex.main = 1, mar = c(2, 2, 1, 0), oma = c(1, 1, 1.5, 1), mgp = c(2, 0.5, 0)); while (i <= iterations){ ex3(x,y,n) if (i %in% c(7, 8, 9)){ axis(1, col = "grey40", col.axis = "grey20", at = seq(0.9, 2.4, 0.3)) } if (i %in% c(1, 4,7)){ axis(2, col = "grey40", col.axis = "grey20", at = seq(15, 19, 1)) } box(col = "grey60") i <- i+1 } mtext("x axis", side = 1, outer = TRUE, cex = 0.7, col = "grey20") mtext("y axis", side = 2, outer = TRUE, cex = 0.7, col = "grey20") mtext("For n = 500", side = 3, outer = TRUE, cex = 1.2, col = "grey20")

17

For n = 500 Sample 2

Sample 3

y

y

17 15

y

19

Sample 1

Sample 4

Sample 5

Sample 6

x

x

y

y

17 15

y

y axis

19

x

Sample 7

Sample 8

Sample 9

x

x

y

y

17 15

y

19

x

0.9

1.2

1.5

1.8 x

2.1

2.4

0.9

1.2

1.5

1.8

2.1

2.4

x axis x

0.9

1.5

2.1

x

Also, we see that the estimates are biased in small sample though theroretically, this should not be the case if the same assumptions hold. One explanation could the violation of exogenity or homoskedasticity assumption on  in the small sample. Let us compare the distribution of dependent variable. Let us define another function for this. ex34 <- function(x,y,n){ loc = sample(x = 1:N,size = n, replace = TRUE) xi <- x[loc] yi <- y[loc] title <- paste("Sample ", i) plot(density(yi), col = 'gray', xlab = 'y',ylab = 'Prob', main = title, axes = FALSE) } ex34q <- function(x,y,n){ loc = sample(x = 1:N,size = n, replace = TRUE) xi <- x[loc] yi <- y[loc] title <- paste("Sample ", i) qqnorm(scale(yi), col = 'gray', xlab = 'y',ylab = 'Prob', main = title, axes = FALSE) abline(0,1) }

18

set.seed(1) i <- 1 n <- 20 par(pty = "m", mfrow = c(3, 3), cex.lab = 0.2 , cex.main = 1, mar = c(1, 1, 1, 0), oma = c(1, 1, 1.5, 1), mgp = c(2, 0.5, 0)); while (i <= iterations){ ex34(x,y,n) box(col = "grey60") i <- i+1 } mtext("y", side = 1, outer = TRUE, cex = 0.7, col = "grey20") mtext("Probability Density", side = 2, outer = TRUE, cex = 0.7, col = "grey20") mtext("For n = 20", side = 3, outer = TRUE, cex = 1.2, col = "grey20")

For n = 20

Sample 6

y

y

Prob

Sample 5

y

Prob

Sample 4

Sample 8

Sample 9

y

y

y

Prob

Sample 7

Prob

Probability Density

Sample 3

Prob

Sample 2

Prob

Sample 1

y

set.seed(1) i <- 1 n <- 20 par(pty = "m", mfrow = c(3, 3), cex.lab = 0.2 , cex.main = 1, mar = c(1, 1, 1, 0), oma = c(1, 1, 1.5, 1), mgp = c(2, 0.5, 0)); while (i <= iterations){ ex34q(x,y,n) box(col = "grey60") i <- i+1 } mtext("Theoretical quantiles", side = 1, outer = TRUE, cex = 0.7, col = "grey20") 19

mtext("Sample quantiles", side = 2, outer = TRUE, cex = 0.7, col = "grey20") mtext("For n = 20", side = 3, outer = TRUE, cex = 1.2, col = "grey20")

For n = 20 Sample 3

Prob

Sample 2

Prob

Sample 1

Sample 6

y

y

Prob

Prob

Sample 5

y

Sample quantiles

Sample 4

Sample 9

y

y

Prob

Sample 8

y

Prob

Sample 7

Theoretical quantiles

set.seed(1) i <- 1 n <- 500 par(pty = "m", mfrow = c(3, 3), cex.lab = 0.2 , cex.main = 1, mar = c(1, 1, 1, 0), oma = c(1, 1, 1.5, 1), mgp = c(2, 0.5, 0)); while (i <= iterations){ ex34(x,y,n) box(col = "grey60") i <- i+1 } mtext("y", side = 1, outer = TRUE, cex = 0.7, col = "grey20") mtext("Probability Density", side = 2, outer = TRUE, cex = 0.7, col = "grey20") mtext("For n = 500", side = 3, outer = TRUE, cex = 1.2, col = "grey20")

20

For n = 500

Sample 6

y

y

Prob

Sample 5

y

Prob

Sample 4

Sample 8

Sample 9

y

y

y

Prob

Sample 7

Prob

Probability Density

Sample 3

Prob

Sample 2

Prob

Sample 1

y

set.seed(1) i <- 1 n <- 500 par(pty = "m", mfrow = c(3, 3), cex.lab = 0.2 , cex.main = 1, mar = c(1, 1, 1, 0), oma = c(1, 1, 1.5, 1), mgp = c(2, 0.5, 0)); while (i <= iterations){ ex34q(x,y,n) box(col = "grey60") i <- i+1 } mtext("Theoretical quantiles", side = 1, outer = TRUE, cex = 0.7, col = "grey20") mtext("Sample quantiles", side = 2, outer = TRUE, cex = 0.7, col = "grey20") mtext("For n = 500", side = 3, outer = TRUE, cex = 1.2, col = "grey20")

21

For n = 500 Sample 3

Prob

Sample 2

Prob

Sample 1

Sample 6

y

y

Prob

Prob

Sample 5

y

Sample quantiles

Sample 4

Sample 9

y

y

Prob

Sample 8

y

Prob

Sample 7

Theoretical quantiles

The property of CLT can be observed here.

22

More Documents from "Johana Coen Janssen"