mydataindex = read.table("./Hymenoptera.csv", header = TRUE, sep = ",", na.strings = "NA", dec = ".", strip.white = TRUE)mydataindex
Year Hymenoptera SunAPR1 1970 NA 128.72 1971 NA 115.73 1972 NA 130.74 1973 NA 150.05 1974 NA 149.56 1975 NA 125.77 1976 NA 152.68 1977 NA 157.49 1978 NA 110.410 1979 NA 121.511 1980 100.0 166.612 1981 96.1 129.013 1982 94.5 173.114 1983 97.3 136.015 1984 99.6 217.116 1985 100.7 128.117 1986 98.3 127.818 1987 96.2 154.219 1988 94.5 129.620 1989 93.4 132.121 1990 95.2 206.522 1991 95.7 148.823 1992 98.8 124.424 1993 100.7 113.725 1994 101.8 166.126 1995 102.8 173.127 1996 105.3 131.628 1997 104.5 165.329 1998 98.1 120.030 1999 95.7 154.931 2000 95.1 136.732 2001 94.4 134.033 2002 90.8 193.534 2003 88.9 192.035 2004 88.0 135.836 2005 90.7 147.437 2006 92.8 159.938 2007 93.5 215.639 2008 89.7 153.040 2009 82.9 172.041 2010 82.3 201.342 2011 84.9 218.343 2012 84.2 129.444 2013 83.2 168.445 2014 80.1 146.446 2015 78.0 220.747 2016 77.6 168.3
'data.frame': 47 obs. of 3 variables: $ Year : int 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 ... $ Hymenoptera: num NA NA NA NA NA NA NA NA NA NA ... $ SunAPR : num 129 116 131 150 150 ...
1.1 Run a linear regression: Hymenoptera~Year
Call:lm(formula = Hymenoptera ~ Year, data = mydataindex)Residuals: Min 1Q Median 3Q Max -7.123 -3.793 -1.203 3.017 11.097 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1152.11166 145.81814 7.901 2.73e-09 ***Year -0.53001 0.07298 -7.262 1.76e-08 ***---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Residual standard error: 4.74 on 35 degrees of freedom (10 observations deleted due to missingness)Multiple R-squared: 0.6011, Adjusted R-squared: 0.5897 F-statistic: 52.74 on 1 and 35 DF, p-value: 1.758e-08
1.2 Run a Polynomial regression: Hymenoptera~ Year + I(Year^2)
Call:lm(formula = Hymenoptera ~ Year + I(Year^2), data = mydataindex)Residuals: Min 1Q Median 3Q Max -5.5962 -2.5437 -0.1513 1.7272 7.4874 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.298e+05 2.134e+04 -6.083 6.72e-07 ***Year 1.306e+02 2.136e+01 6.112 6.16e-07 ***I(Year^2) -3.281e-02 5.347e-03 -6.137 5.72e-07 ***---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Residual standard error: 3.312 on 34 degrees of freedom (10 observations deleted due to missingness)Multiple R-squared: 0.8107, Adjusted R-squared: 0.7996 F-statistic: 72.83 on 2 and 34 DF, p-value: 5.124e-13
1.3 Scatterplot(Hymenoptera~Year) + a Linear Regression + a Polynomial Regression
scatterplot(Hymenoptera ~ Year, regLine = FALSE, smooth = FALSE, boxplots = FALSE, xlim = c(1980, 2020), ylim = c(70, 110), ylab = "Hymenoptera, index value", cex = 1.3, pch = 16, data = mydataindex, col = "green")xx <- seq(1980, 2020, length = 10)lines(xx, predict(RegModel.1, data.frame(Year = xx)), col = "blue", lwd = 2.5, lty = 6)lines(xx, predict(RegModel.2, data.frame(Year = xx)), col = "red", lwd = 2.5, lty = 6)legend("bottomleft", legend = c("Linear Regression: Hymenoptera ~ Year", "Polynomial regression: Hymenoptera ~ Year + I(Year^2)"), col = c("blue", "red"), lty = c(6, 6), cex = 0.9)
2.1 GAM: gam(Hymenoptera~s(Year)
- Generalized Additive Models(GAM) is a linear predictor additive models, assuming that the error structures follow Poisson and binomial distributed. Details can refere the Chapter 18 in Crawley (2013)
Family: gaussian Link function: identity Formula:Hymenoptera ~ s(Year)Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 93.1432 0.3408 273.3 <2e-16 ***---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Approximate significance of smooth terms: edf Ref.df F p-value s(Year) 8.351 8.883 48.07 <2e-16 ***---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1R-sq.(adj) = 0.922 Deviance explained = 94%GCV = 5.7515 Scale est. = 4.2979 n = 37
2.2 Diagnostics: gam(Hymenoptera~s(Year)
Method: GCV Optimizer: magicSmoothing parameter selection converged after 7 iterations.The RMS GCV score gradient at convergence was 1.756672e-06 .The Hessian was positive definite.Model rank = 10 / 10 Basis dimension (k) checking results. Low p-value (k-index<1) mayindicate that k is too low, especially if edf is close to k'. k' edf k-index p-value s(Year) 9.00 8.35 0.67 0.015 *---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
2.3 GAM: gam(Hymenoptera~s(Year,k=14)
- The k value controls the “wiggliness” of the response curve, i.e., how tightly the response curve follows the individual data points. A low k value makes the response smoother, while a high k value results in a response curve that follows the individual observation points closely.
Family: gaussian Link function: identity Formula:Hymenoptera ~ s(Year, k = 14)Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 93.1432 0.2988 311.7 <2e-16 ***---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Approximate significance of smooth terms: edf Ref.df F p-value s(Year) 11.11 12.37 45.66 <2e-16 ***---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1R-sq.(adj) = 0.94 Deviance explained = 95.8%GCV = 4.9125 Scale est. = 3.3044 n = 37
2.4 Diagnostics: gam(Hymenoptera~s(Year,k=14))
Method: GCV Optimizer: magicSmoothing parameter selection converged after 9 iterations.The RMS GCV score gradient at convergence was 3.356308e-05 .The Hessian was positive definite.Model rank = 14 / 14 Basis dimension (k) checking results. Low p-value (k-index<1) mayindicate that k is too low, especially if edf is close to k'. k' edf k-index p-value s(Year) 13.0 11.1 0.78 0.05 *---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
2.5 Polynomial model Vs GAM model.
Analysis of Variance TableModel 1: Hymenoptera ~ Year + I(Year^2)Model 2: Hymenoptera ~ s(Year) Res.Df RSS Df Sum of Sq F Pr(>F) 1 34.000 373.05 2 27.649 118.83 6.3508 254.22 9.3136 1.02e-05 ***---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Analysis of Variance TableModel 1: Hymenoptera ~ Year + I(Year^2)Model 2: Hymenoptera ~ s(Year, k = 14) Res.Df RSS Df Sum of Sq F Pr(>F) 1 34.000 373.05 2 24.888 82.24 9.1116 290.81 9.6588 3.362e-06 ***---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Results:
- shows that the GAM model (gamyear_K14 or gamyear ) has much lower residual sum of squares (RSS) than the polynomial model (RegModel.2).
- the difference is highly significant, indicating that the response curve of the GAM model follows more tightly the observation points.
par(mfrow = c(1, 1))xx = seq(1980, 2016, length = 10)prd = data.frame(Year = xx)err <- predict(RegModel.2, newdata = prd, se.fit = TRUE)prd$lci <- err$fit - 1.96 * err$se.fitprd$fit <- err$fitprd$uci <- err$fit + 1.96 * err$se.fitplot(Hymenoptera ~ Year, pch = 16, data = mydataindex, xlim = c(1980, 2016), cex = 1.3, col = "green")lines(xx, prd$lci, lty = 5, col = "red")lines(xx, prd$fit, lty = 5, col = "blue")lines(xx, prd$uci, lty = 5, col = "red")grid()
2.6 GAM: gam(Hymenoptera~s(SunAPR))
Family: gaussian Link function: identity Formula:Hymenoptera ~ s(SunAPR)Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 93.143 1.157 80.47 <2e-16 ***---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Approximate significance of smooth terms: edf Ref.df F p-value s(SunAPR) 1 1 4.769 0.0356 *---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1R-sq.(adj) = 0.0948 Deviance explained = 12%GCV = 52.399 Scale est. = 49.566 n = 37
2.7 Diagnostics: gam(Hymenoptera~s(SunAPR))
Method: GCV Optimizer: magicSmoothing parameter selection converged after 14 iterations.The RMS GCV score gradient at convergence was 5.263863e-06 .The Hessian was positive definite.Model rank = 10 / 10 Basis dimension (k) checking results. Low p-value (k-index<1) mayindicate that k is too low, especially if edf is close to k'. k' edf k-index p-value s(SunAPR) 9 1 0.79 0.07 .---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
2.8 GAM:gam(Hymenoptera~s(Year,k=14)+s(SunAPR)
gamyearsun_02 <- gam(Hymenoptera ~ s(Year, k = 14) + s(SunAPR), data = mydataindex)summary(gamyearsun_02)
Family: gaussian Link function: identity Formula:Hymenoptera ~ s(Year, k = 14) + s(SunAPR)Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 93.143 0.302 308.5 <2e-16 ***---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Approximate significance of smooth terms: edf Ref.df F p-value s(Year) 11.13 12.37 38.764 <2e-16 ***s(SunAPR) 1.00 1.00 0.246 0.625 ---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1R-sq.(adj) = 0.938 Deviance explained = 95.9%GCV = 5.2283 Scale est. = 3.3735 n = 37
2.9 Diagnostics:gam(Hymenoptera~s(Year,k=14)+s(SunAPR)
Method: GCV Optimizer: magicSmoothing parameter selection converged after 17 iterations.The RMS GCV score gradient at convergence was 6.360075e-07 .The Hessian was positive definite.Model rank = 23 / 23 Basis dimension (k) checking results. Low p-value (k-index<1) mayindicate that k is too low, especially if edf is close to k'. k' edf k-index p-value s(Year) 13.0 11.1 0.78 0.045 *s(SunAPR) 9.0 1.0 1.07 0.655 ---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
2.10 GAM: gamyearsun_02 Vs gamyear
Analysis of Deviance TableModel 1: Hymenoptera ~ s(Year, k = 14) + s(SunAPR)Model 2: Hymenoptera ~ s(Year) Resid. Df Resid. Dev Df Deviance F Pr(>F) 1 22.627 80.54 2 27.117 118.83 -4.4901 -38.294 2.5281 0.06331 .---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
2.11 GAM: gamyearsun_02 Vs gamsun_01
Analysis of Deviance TableModel 1: Hymenoptera ~ s(Year, k = 14) + s(SunAPR)Model 2: Hymenoptera ~ s(SunAPR) Resid. Df Resid. Dev Df Deviance F Pr(>F) 1 22.627 80.54 2 35.000 1734.82 -12.373 -1654.3 39.633 2.238e-12 ***---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
gamdatacsv = read.table("./gamdata.csv", header = TRUE, sep = ",", na.strings = "NA", dec = ".", strip.white = TRUE)head(gamdatacsv, 30)
resp wave hill1 19 0.0 7.042 10 0.1 9.043 13 0.2 5.484 6 0.3 5.245 20 0.4 8.606 12 0.5 7.327 9 0.6 8.768 9 0.7 10.009 9 0.8 9.0010 10 0.9 7.8011 14 1.0 6.1612 9 1.1 6.0813 11 1.2 5.5214 15 1.3 7.9615 11 1.4 8.4816 14 1.5 8.4017 11 1.6 9.4418 12 1.7 7.8419 15 1.8 7.8820 7 1.9 5.7221 13 2.0 6.9622 8 2.1 9.4823 11 2.2 7.4024 8 2.3 9.3225 4 2.4 9.9626 15 2.5 7.9227 7 2.6 8.7228 3 2.7 5.3229 5 2.8 9.5230 5 2.9 5.40
'data.frame': 126 obs. of 3 variables: $ resp: int 19 10 13 6 20 12 9 9 9 10 ... $ wave: num 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 ... $ hill: num 7.04 9.04 5.48 5.24 8.6 7.32 8.76 10 9 7.8 ...
- resp: the response variable
- wave and hill: explanatory variables
4.1 scatterplot Matrix: gamdatacsv
5.1 glmodel.A1: resp ~ wave + hill
glmodel.A1 <- glm(resp ~ wave + hill, family = poisson(link = log), data = gamdatacsv)summary(glmodel.A1)
Call:glm(formula = resp ~ wave + hill, family = poisson(link = log), data = gamdatacsv)Deviance Residuals: Min 1Q Median 3Q Max -4.1532 -0.7368 -0.0819 0.6189 3.6487 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.334230 0.161659 14.439 <2e-16 ***wave -0.017758 0.007982 -2.225 0.0261 * hill 0.002789 0.019940 0.140 0.8888 ---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1(Dispersion parameter for poisson family taken to be 1) Null deviance: 193.89 on 125 degrees of freedomResidual deviance: 188.88 on 123 degrees of freedomAIC: 699.01Number of Fisher Scoring iterations: 4
5.2 Basic Diagnostic Plot: resp ~ wave + hill
- The Diagnostic test for glmodel.A1 found that
- Residual deviance = 188.88 on 123 Degrees of freedom
- The ratio of Residual deviance/Degrees of freedom = 188.88 / 123 = 1.53561 which is greater than 1.2, indicating that the model is overdispersed . That is, the variance is increasing with the mean.
- Transformation of Poisson to quasiPoisson is a method to solve the Overdispersionproblem.
5.3 Has overdispersion problem in glm(resp ~ wave + hill)
\[\frac{\mathrm{Residual\_Deviance}}{\mathrm{DOF}} = \frac{188.88}{123} = 1.53561 >> 1 \]
glmodel.B1 <- glm(resp ~ wave + hill, family = quasipoisson(link = log), data = gamdatacsv)summary(glmodel.B1)
Call:glm(formula = resp ~ wave + hill, family = quasipoisson(link = log), data = gamdatacsv)Deviance Residuals: Min 1Q Median 3Q Max -4.1532 -0.7368 -0.0819 0.6189 3.6487 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.334230 0.196143 11.901 <2e-16 ***wave -0.017758 0.009685 -1.834 0.0691 . hill 0.002789 0.024193 0.115 0.9084 ---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1(Dispersion parameter for quasipoisson family taken to be 1.472122) Null deviance: 193.89 on 125 degrees of freedomResidual deviance: 188.88 on 123 degrees of freedomAIC: NANumber of Fisher Scoring iterations: 4
6.1 Overdispersion problem in glm(resp ~ wave + hill)
- not so serious if using quasipoisson
\[\frac{\mathrm{Residual\_Deviance}}{\mathrm{DOF}} = \frac{188.88}{123} = 1.53561 > 1.472122 \]
6.2 Calculate an ANOVA Table
Analysis of Deviance Table (Type II tests)Response: respError estimate based on Pearson residuals Sum Sq Df F value Pr(>F) wave 4.956 1 3.3665 0.06895 .hill 0.020 1 0.0133 0.90840 Residuals 181.070 123 ---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Analysis of Deviance Table (Type II tests)Response: respError estimate based on Pearson residuals Sum Sq Df F value Pr(>F) wave 4.956 1 3.3665 0.06895 .hill 0.020 1 0.0133 0.90840 Residuals 181.070 123 ---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
gam_resp_wave_hill_01 <- gam(resp ~ s(wave) + s(hill), data = gamdatacsv)summary(gam_resp_wave_hill_01)
Family: gaussian Link function: identity Formula:resp ~ s(wave) + s(hill)Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 9.4524 0.2416 39.12 <2e-16 ***---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Approximate significance of smooth terms: edf Ref.df F p-value s(wave) 6.056 7.208 5.278 2.42e-05 ***s(hill) 3.377 4.184 20.625 5.21e-14 ***---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1R-sq.(adj) = 0.476 Deviance explained = 51.6%GCV = 8.0183 Scale est. = 7.3543 n = 126
gam_resp_wave_hill_02 <- gam(resp ~ s(wave) + s(hill), family = poisson(link = log), data = gamdatacsv)summary(gam_resp_wave_hill_02)
Family: poisson Link function: log Formula:resp ~ s(wave) + s(hill)Parametric coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.20920 0.03001 73.63 <2e-16 ***---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Approximate significance of smooth terms: edf Ref.df Chi.sq p-value s(wave) 5.805 6.949 29.57 0.000125 ***s(hill) 3.593 4.452 66.89 4.52e-13 ***---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1R-sq.(adj) = 0.465 Deviance explained = 51.4%UBRE = -0.087376 Scale est. = 1 n = 126
7.1 Anova Test
Analysis of Deviance TableModel 1: resp ~ s(wave) + s(hill)Model 2: resp ~ wave + hill Resid. Df Resid. Dev Df Deviance Pr(>Chi) 1 113.6 94.195 2 126.0 188.880 -12.4 -94.685 9.356e-15 ***---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Analysis of Deviance TableModel 1: resp ~ s(wave) + s(hill)Model 2: resp ~ s(wave) + s(hill) Resid. Df Resid. Dev Df Deviance Pr(>Chi) 1 113.60 94.19 2 113.61 849.92 -0.0080872 -755.72 < 2.2e-16 ***---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1