Divya Krishnamohan Student ID: 200292988
CORRELATION & REGRESSION EXERCISE
Biodiversity & Conservation Master’s Skills (BLGY5000) Divya Krishnamohan Student ID: 200292988
Divya Krishnamohan Student ID: 200292988
Correlation and Regression Exercise Q1 Following is a list of hypothetical examples of the types of analysis for which one might use each of the methods mentioned (dependent variable is denoted by an asterisk*; details of variables are within parentheses):
Pearson’s product moment correlation: Testing for an association between leaf area and root starch concentration in a clonal tree species (both variables continuous and normally distributed).
Rank correlation (Spearman and/or Kendall) Testing for an association between the diversity of flowering plant species (ranked) and the number of visiting pollinator species within the study area (normal distribution of variables not required).
Linear regression Testing the relationship between the degree of nuptial shading* (continuous measure, residuals normally distributed) and the availability of mates in a species of fish (linearly related to dependent variable).
Logistic regression Testing the effect of genetic distance on the sex* of sterile offspring (binomial distribution) in five hybrid species pairs.
Analysis of covariance Testing the effect of soil permeability (ranked), litter depth and soil type (covariate) on the rate of ant re-colonisation* (residuals normally distributed).
Q2
Divya Krishnamohan Student ID: 200292988
Dr. William Kunin collected data on the abundance of 33 different species of North American ducks and geese as well as the number of chewing lice (Mallophaga) species recorded on them. The data pertaining to 32 of these species (excluding species code name “snwgos” for which data of concern to this analysis is missing) will be used in the following analysis. In order to determine whether the number of Mallophaga species associated with a duck species is affected by how common the host is, data on the diversity of Mallophaga, as well as two measures of duck abundance will be used. The two measures of duck abundance are: i.
Number of sites used in the Christmas Bird Count at which the species was recorded (abbreviated as CBC circles)
ii.
Number seen in the Christmas Bird Count (abbreviated as CBC number).
a) Testing for a significant relationship between the two measures of host abundance: In order to perform a correlation, data have to meet the assumptions of the type of correlation test being performed. As a rule, a parametric test, such as Pearson’s product-moment correlation, gives a more powerful result than its nonparametric counterparts, Kendall’s tau_b/ Spearman’s rank. Pearson’s correlation assumes a normal distribution of both variables. A histogram of the data sets CBC number and CBC circles reveals a strong skew in the case of the former, and a moderate skew in the latter. (Refer Fig. 1) Usually, a logarithmic transformation is applied to correct strong skews while a square root transformation is applied to correct moderate skews. Both transformations were applied respectively to the data and the normality assessed by means of the Shapiro-Wilk Test (used as there are fewer than 50 cases). (Refer Fig. 2, Table 1)
Divya Krishnamohan Student ID: 200292988
Histogram
30
25
Frequency
20
15
10
5
Mean =194135.47 Std. Dev. =358003.762 N =32
0 0
250000
500000
750000
1000000
1250000
1500000
Histogram CBC number
12
10
Frequency
8
6
4
2
Mean =375.66 Std. Dev. =262.151 N =32
0 0
200
400
600
800
1000
1200
CBC circles
Fig. 1. Histograms showing the distribution of untransformed data – CBC numbers and CBC circles.
Divya Krishnamohan Student ID: 200292988
Histogram
12
10
Frequency
8
6
4
2
Mean =4.8193 Std. Dev. =0.70313 N =32
0 3.00
3.50
4.00
4.50
5.00
5.50
6.00
6.50
log10 (CBC number)
Frequency
6
4
2
Mean =18.0935 Std. Dev. =7.05973 N =32 0 5.00
10.00
15.00
20.00
25.00
30.00
35.00
square root (CBC circles)
Fig. 2 Histograms showing the distribution of log transformed CBC numbers and square root transformed CBC circles.
Divya Krishnamohan Student ID: 200292988
Table 1. Shapiro-Wilk’s test of normality on untransformed and transformed variables – CBC number and CBC circles. Tests of Normality Shapiro-Wilk CBC number log10 (CBC number)
Statistic .513 .948
df 32 32
Sig. .000 .124
CBC circles
.922
32
.024
sqrt (CBC circle)
.947
32
.121
It is apparent that the transformations applied have helped normalise the data. The results of a Pearson’s product-moment correlation are described in the table below. Table 2. Pearson’s product-moment correlation for sqrt (CBC circles) and log10 (CBC numbers). Correlations
sqrt(CBC circle)
Pearson Correlation Sig. (2-tailed)
sqrt(CBC circle) 1
log(CBC number) .625(**) .000
32
32
.625(**)
1
N log(CBC number)
Pearson Correlation Sig. (2-tailed) N
.000 32
32
** Correlation is significant at the 0.01 level (2-tailed).
The Pearson’s correlation reveals that there is a significant correlation between the two variables (r=0.625, P<0.0001). However, there is no literature supporting the use of two different transformations on the variables involved in a correlation analysis, and therefore, a more conservative approach, such as the use of a nonparametric test like Spearman’s Rank correlation (that does not make assumptions about the normality of distribution), will be used to determine whether or not there truly is a significant correlation between CBC numbers and CBC circles.
Divya Krishnamohan Student ID: 200292988
Table 3. Spearman’s Rank Correlation for CBC circles and CBC number Correlations
Spearman's rho
CBC circles
Correlation Coefficient Sig. (2-tailed)
CBC circles 1.000 .
N CBC number
Correlation Coefficient Sig. (2-tailed) N
CBC number .540(**) .001
32
32
.540(**)
1.000
.001
.
32
32
** Correlation is significant at the 0.01 level (2-tailed).
The results of a Spearman’s rank correlation (rho=0.540, P=0.001) concur with the Pearson’s correlation test, i.e., there is a significant and positive correlation between CBC circles and CBC number.
b) Separately testing the effect of each measure of host abundance on the diversity of Mallophaga species: Being continuous, count data, the most appropriate test for determining whether there is an association between a measure of host abundance and Mallophaga diversity is a linear regression analysis. Assumptions of Linear Regression Analysis:
Independent (x) variables are measured without error – This is assumed to be true.
Errors in dependent (y) variable are normally distributed – A normality test of the residuals of the dependent variable, Mallophaga species, reveals that the errors are normally distributed (Shapiro-wilk=0.966, df=32, P=0.395).
Divya Krishnamohan Student ID: 200292988
Variance in the dependent variable is constant – A residual plot of Mallophaga species variable reveals a homoscedastic variance. (Refer Fig. 3)
Std. Residual
Predicted
Observed
Dependent Variable: Mallophaga spp
Observed
Predicted
Std. Residual
Model: Intercept
Fig. 3. Residual plot of Mallophaga species variable showing a homscedastic distribution of variance.
Relationship between dependent and independent variables is linear – A scatter plot of the data reveals that in an untransformed state, the relationship is not linear in the case of Mallophaga and CBC numbers, as well as Mallophaga and CBC circles. (Refer Fig. 4)
Divya Krishnamohan Student ID: 200292988
7
6
5
4
Mallophaga species 3
2
1
0
0
250000
500000
750000
1000000
CBC number
7
6
5
4
Mallophaga spcies 3
2
1
0
1250000
1500000
CBC circles
Divya Krishnamohan Student ID: 200292988
Fig. 4. Scatter plots showing the relationship between dependent variable Mallophaga species and independent variables CBC number and CBC circles.
A logarithmic transformation of the independent variable CBC number helps to increase linearity. (Refer Fig. 5)
7
6
5
Mallophaga species 4
3
2
1
0
3.00
3.50
4.00
4.50
5.00
5.50
6.00
6.50
log(CBC number)
Fig. 5. Scatter plot showing the relationship between dependent variable Mallophaga species and log transformed independent variable, CBC number.
NB: Transforming CBC circles and even the Mallophaga species variable doesn’t help increase the linearity of the relationship between these two variables.
Analysis of dependent variable Mallophaga species with independent variable CBC circles:
Divya Krishnamohan Student ID: 200292988
A linear regression analysis between these two variables is redundant as CBC circles does not have a linear relationship with Mallophaga species. Moreover, a linear regression is meaningful only if the two variables are correlated. A Pearson’s correlation between Mallophaga and square root transformed CBC circles (both variables are normally distributed) reveals that there is no significant correlation between the variables (r=0.241, N=32, P=0.184).
Linear regression analysis of dependent variable Mallophaga species with independent variable CBC number: A linear regression analysis of the dependent variable Mallophaga species with the independent variable CBC number (log transformed to increase linearity), reveals a significant association between the two variables (F=4.502, df=1, P=0.042). CBC numbers is seen to explain 13% of the dependent variable variance (R2=0.130).
Backward multiple regression analysis of dependent variable Mallophaga species with independent variables CBC circles and CBC number: A backward multiple regression analysis is a method of regression that allows for the inclusion of more than one independent variable. The analysis helps determine which of the variables should be included in the final model of ‘best fit’. This is achieved by sequentially running all the factors together and then running a model that takes a step “backwards” by removing the factor that is assessed as the least significant to the operation of the model. Choosing the best fit model is dependent on two values – the R2 value (representing the amount of variation in the dependent variable that is explained by that model) as well as the significance of the model, denoted by the P value of the test statistic.
Divya Krishnamohan Student ID: 200292988
The following table summarises the models tested by a backward multiple regression; also indicated is the criterion for factor removal. Table 4. Summary of variables entered and removed in the backward multiple regressions performed on variables Mallophaga, CBC circles and log CBC number. Variables Entered/ Removed (b)
Model 1
Variables Entered CBC circles, log (CBC number) (a)
Variables Removed
.
Method
Enter
2
.
CBC circles
Backward (criterion: Probability of Fto-remove >= .100).
a. All requested variables entered. b. Dependent Variable: Mallophaga sp
The backward multiple regression analysis revealed that of the two possible models, (Model 1: Dependent variable – Mallophaga species, independent variables – log CBC Numbers and CBC circles; and Model 2: Dependent variable – Mallophaga species, independent variable – log CBC number), Model 1 had an R2 value marginally higher than Model 2. (R2=0.133 and R2=0.130, respectively). However, Model 2 was seen to be significant (F=4.502, df=1, P=0.042), while Model 1 was non-significant (F=2.232, df=2, P=0.125). Refer Table 5. A review of the adjusted R2 values (a more conservative estimate of the variation explained, as it attempts correcting the R2 value to give a better indication of the “goodness of fit”) reveals that Model 2 explains 10.1% of the variation and Model 1, only 7.4%.
Divya Krishnamohan Student ID: 200292988
From these statistics it is evident that in spite of a marginally lower R2 value, Model 2 is actually the model representing the best fit. Table 5. The ANOVA table for the backward multiple regression ANOVA(c)
Model 1
2
Regression Residual
Sum of Squares 11.855 77.020
df 2 29
Mean Square 5.927 2.656
F 2.232
Sig. .125(a)
4.502
.042(b)
Total
88.875
31
Regression
11.596
1
11.596
Residual
77.279
30
2.576
Total
88.875
31
a. Predictors: (Constant), CBC circles, log (CBC number) b. Predictors: (Constant), log (CBC number) c. Dependent Variable: Mallophaga sp
A possible explanation for this occurrence is the inter-correlation of the independent variables within Model 1. The phenomenon of ‘multicollinearity’ or ‘collinearity’ is an undesirable occurrence that manifests when two or more independent variables have a high degree of correlation. While this phenomenon is usually said to occur in variables that possess a correlation of over 50%, the possibility of collinearity weakening the regression analysis can be assessed by running a collinearity diagnostics test. The results of the test are described in Table 6. The eigenvalues of a collinearity diagnostic test provide an indication of how many distinct dimensions there are among the independent variables. Of the three possible eigenvalues in Model 1, one value is close to zero. This could indicate that the variables are fairly highly inter-correlated and small changes in the data values may lead to large changes in the estimates of the coefficients.
Divya Krishnamohan Student ID: 200292988
Table 6. Results of the collinearity diagnostics test on Models 1 and 2 Collinearity Diagnostics (a)
Model 1
2
Dimension
Variance Proportions
Condition Index
Eigen value 2.789 .204
1.000 3.695
(Constant) .00 .02
3
.007
20.069
1
1.990
1.000
2
.010
13.999
1 2
log CBC number .00 .01
CBC circles .02 .65
.98
.99
.33
.01
.01
.99
.99
a. Dependent Variable: Mallophaga spp
The Condition index (representative of the square roots of the ratios of the largest eigenvalue to each successive eigenvalue), if greater than 15 (seen in Model 1, dimension 3), indicates a possible problem with collinearity. Further, the backward multiple regression analysis reveals that the excluded variable, CBC circles has a test static and significance indicating a non-linear relationship with the dependent variable (t=0.312, P=.757, where t tests the null hypothesis that the regression coefficient is zero).
Table 7. The Coefficient’s table generated by a linear regression analysis of Mallophaga species and log transformed CBC numbers. Coefficients (a) Unstandardized Coefficients (Constant) log(CBC number)
B -1.005 .870
Std. Error 1.996 .410
a. Dependent Variable: Mallophaga sp
Standardized Coefficients Beta .361
-.503 2.122
.618 .042
Divya Krishnamohan Student ID: 200292988
Conclusion: On performing a backward multiple regression on the data, it was determined that a model comprising the dependent variable Mallophaga species and the independent variable CBC number (log transformed to increase linearity), was the model of best fit. The independent variable CBC circles was excluded from the analysis on the basis of possible collinearity. Further, the relationship of CBC circles with the dependent variable was seen to be non-linear and nonsignificantly correlated. The analysis of the selected model revealed that Mallophaga diversity is significantly affected by duck host abundance (as measured by CBC number) (F=4.502, df=1, P=0.042). The coefficient’s table (refer Table 7) may be used to predict the diversity of Mallophaga given a specific abundance of duck hosts. It may be noted however, that the regression is a weak one as the R2 value of the model indicates that only 13% of the Mallophaga variance is explained. It is evident that there are other, more significant factors that influence Mallophaga diversity; however, these factors lie outside the scope of this analysis.