Introduction to Generalized Additive Models(GAM) with Rstudio (2024)

 Year Hymenoptera SunAPR1 1970 NA 128.72 1971 NA 115.73 1972 NA 130.74 1973 NA 150.05 1974 NA 149.56 1975 NA 125.77 1976 NA 152.68 1977 NA 157.49 1978 NA 110.410 1979 NA 121.511 1980 100.0 166.612 1981 96.1 129.013 1982 94.5 173.114 1983 97.3 136.015 1984 99.6 217.116 1985 100.7 128.117 1986 98.3 127.818 1987 96.2 154.219 1988 94.5 129.620 1989 93.4 132.121 1990 95.2 206.522 1991 95.7 148.823 1992 98.8 124.424 1993 100.7 113.725 1994 101.8 166.126 1995 102.8 173.127 1996 105.3 131.628 1997 104.5 165.329 1998 98.1 120.030 1999 95.7 154.931 2000 95.1 136.732 2001 94.4 134.033 2002 90.8 193.534 2003 88.9 192.035 2004 88.0 135.836 2005 90.7 147.437 2006 92.8 159.938 2007 93.5 215.639 2008 89.7 153.040 2009 82.9 172.041 2010 82.3 201.342 2011 84.9 218.343 2012 84.2 129.444 2013 83.2 168.445 2014 80.1 146.446 2015 78.0 220.747 2016 77.6 168.3
'data.frame': 47 obs. of 3 variables: $ Year : int 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 ... $ Hymenoptera: num NA NA NA NA NA NA NA NA NA NA ... $ SunAPR : num 129 116 131 150 150 ...

1.1 Run a linear regression: Hymenoptera~Year

Call:lm(formula = Hymenoptera ~ Year, data = mydataindex)Residuals: Min 1Q Median 3Q Max -7.123 -3.793 -1.203 3.017 11.097 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1152.11166 145.81814 7.901 2.73e-09 ***Year -0.53001 0.07298 -7.262 1.76e-08 ***---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Residual standard error: 4.74 on 35 degrees of freedom (10 observations deleted due to missingness)Multiple R-squared: 0.6011, Adjusted R-squared: 0.5897 F-statistic: 52.74 on 1 and 35 DF, p-value: 1.758e-08

Introduction to Generalized Additive Models(GAM) with Rstudio (1)

1.2 Run a Polynomial regression: Hymenoptera~ Year + I(Year^2)

Call:lm(formula = Hymenoptera ~ Year + I(Year^2), data = mydataindex)Residuals: Min 1Q Median 3Q Max -5.5962 -2.5437 -0.1513 1.7272 7.4874 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.298e+05 2.134e+04 -6.083 6.72e-07 ***Year 1.306e+02 2.136e+01 6.112 6.16e-07 ***I(Year^2) -3.281e-02 5.347e-03 -6.137 5.72e-07 ***---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Residual standard error: 3.312 on 34 degrees of freedom (10 observations deleted due to missingness)Multiple R-squared: 0.8107, Adjusted R-squared: 0.7996 F-statistic: 72.83 on 2 and 34 DF, p-value: 5.124e-13

Introduction to Generalized Additive Models(GAM) with Rstudio (2)

2.1 GAM: gam(Hymenoptera~s(Year)

  • Generalized Additive Models(GAM) is a linear predictor additive models, assuming that the error structures follow Poisson and binomial distributed. Details can refere the Chapter 18 in Crawley (2013)
Family: gaussian Link function: identity Formula:Hymenoptera ~ s(Year)Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 93.1432 0.3408 273.3 <2e-16 ***---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Approximate significance of smooth terms: edf Ref.df F p-value s(Year) 8.351 8.883 48.07 <2e-16 ***---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1R-sq.(adj) = 0.922 Deviance explained = 94%GCV = 5.7515 Scale est. = 4.2979 n = 37

2.2 Diagnostics: gam(Hymenoptera~s(Year)

Introduction to Generalized Additive Models(GAM) with Rstudio (4)

Method: GCV Optimizer: magicSmoothing parameter selection converged after 7 iterations.The RMS GCV score gradient at convergence was 1.756672e-06 .The Hessian was positive definite.Model rank = 10 / 10 Basis dimension (k) checking results. Low p-value (k-index<1) mayindicate that k is too low, especially if edf is close to k'. k' edf k-index p-value s(Year) 9.00 8.35 0.67 0.015 *---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

2.3 GAM: gam(Hymenoptera~s(Year,k=14)

  • The k value controls the “wiggliness” of the response curve, i.e., how tightly the response curve follows the individual data points. A low k value makes the response smoother, while a high k value results in a response curve that follows the individual observation points closely.
Family: gaussian Link function: identity Formula:Hymenoptera ~ s(Year, k = 14)Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 93.1432 0.2988 311.7 <2e-16 ***---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Approximate significance of smooth terms: edf Ref.df F p-value s(Year) 11.11 12.37 45.66 <2e-16 ***---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1R-sq.(adj) = 0.94 Deviance explained = 95.8%GCV = 4.9125 Scale est. = 3.3044 n = 37

2.4 Diagnostics: gam(Hymenoptera~s(Year,k=14))

Introduction to Generalized Additive Models(GAM) with Rstudio (5)

Method: GCV Optimizer: magicSmoothing parameter selection converged after 9 iterations.The RMS GCV score gradient at convergence was 3.356308e-05 .The Hessian was positive definite.Model rank = 14 / 14 Basis dimension (k) checking results. Low p-value (k-index<1) mayindicate that k is too low, especially if edf is close to k'. k' edf k-index p-value s(Year) 13.0 11.1 0.78 0.05 *---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

2.5 Polynomial model Vs GAM model.

Analysis of Variance TableModel 1: Hymenoptera ~ Year + I(Year^2)Model 2: Hymenoptera ~ s(Year) Res.Df RSS Df Sum of Sq F Pr(>F) 1 34.000 373.05 2 27.649 118.83 6.3508 254.22 9.3136 1.02e-05 ***---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Analysis of Variance TableModel 1: Hymenoptera ~ Year + I(Year^2)Model 2: Hymenoptera ~ s(Year, k = 14) Res.Df RSS Df Sum of Sq F Pr(>F) 1 34.000 373.05 2 24.888 82.24 9.1116 290.81 9.6588 3.362e-06 ***---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Results:

  • shows that the GAM model (gamyear_K14 or gamyear ) has much lower residual sum of squares (RSS) than the polynomial model (RegModel.2).
  • the difference is highly significant, indicating that the response curve of the GAM model follows more tightly the observation points.

Introduction to Generalized Additive Models(GAM) with Rstudio (6)

Introduction to Generalized Additive Models(GAM) with Rstudio (7)

2.6 GAM: gam(Hymenoptera~s(SunAPR))

Family: gaussian Link function: identity Formula:Hymenoptera ~ s(SunAPR)Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 93.143 1.157 80.47 <2e-16 ***---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Approximate significance of smooth terms: edf Ref.df F p-value s(SunAPR) 1 1 4.769 0.0356 *---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1R-sq.(adj) = 0.0948 Deviance explained = 12%GCV = 52.399 Scale est. = 49.566 n = 37

Introduction to Generalized Additive Models(GAM) with Rstudio (8)

2.7 Diagnostics: gam(Hymenoptera~s(SunAPR))

Introduction to Generalized Additive Models(GAM) with Rstudio (9)

Method: GCV Optimizer: magicSmoothing parameter selection converged after 14 iterations.The RMS GCV score gradient at convergence was 5.263863e-06 .The Hessian was positive definite.Model rank = 10 / 10 Basis dimension (k) checking results. Low p-value (k-index<1) mayindicate that k is too low, especially if edf is close to k'. k' edf k-index p-value s(SunAPR) 9 1 0.79 0.07 .---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

2.8 GAM:gam(Hymenoptera~s(Year,k=14)+s(SunAPR)

Family: gaussian Link function: identity Formula:Hymenoptera ~ s(Year, k = 14) + s(SunAPR)Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 93.143 0.302 308.5 <2e-16 ***---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Approximate significance of smooth terms: edf Ref.df F p-value s(Year) 11.13 12.37 38.764 <2e-16 ***s(SunAPR) 1.00 1.00 0.246 0.625 ---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1R-sq.(adj) = 0.938 Deviance explained = 95.9%GCV = 5.2283 Scale est. = 3.3735 n = 37

Introduction to Generalized Additive Models(GAM) with Rstudio (10)

2.9 Diagnostics:gam(Hymenoptera~s(Year,k=14)+s(SunAPR)

Introduction to Generalized Additive Models(GAM) with Rstudio (11)

Method: GCV Optimizer: magicSmoothing parameter selection converged after 17 iterations.The RMS GCV score gradient at convergence was 6.360075e-07 .The Hessian was positive definite.Model rank = 23 / 23 Basis dimension (k) checking results. Low p-value (k-index<1) mayindicate that k is too low, especially if edf is close to k'. k' edf k-index p-value s(Year) 13.0 11.1 0.78 0.045 *s(SunAPR) 9.0 1.0 1.07 0.655 ---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

2.10 GAM: gamyearsun_02 Vs gamyear

Analysis of Deviance TableModel 1: Hymenoptera ~ s(Year, k = 14) + s(SunAPR)Model 2: Hymenoptera ~ s(Year) Resid. Df Resid. Dev Df Deviance F Pr(>F) 1 22.627 80.54 2 27.117 118.83 -4.4901 -38.294 2.5281 0.06331 .---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

2.11 GAM: gamyearsun_02 Vs gamsun_01

Analysis of Deviance TableModel 1: Hymenoptera ~ s(Year, k = 14) + s(SunAPR)Model 2: Hymenoptera ~ s(SunAPR) Resid. Df Resid. Dev Df Deviance F Pr(>F) 1 22.627 80.54 2 35.000 1734.82 -12.373 -1654.3 39.633 2.238e-12 ***---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
 resp wave hill1 19 0.0 7.042 10 0.1 9.043 13 0.2 5.484 6 0.3 5.245 20 0.4 8.606 12 0.5 7.327 9 0.6 8.768 9 0.7 10.009 9 0.8 9.0010 10 0.9 7.8011 14 1.0 6.1612 9 1.1 6.0813 11 1.2 5.5214 15 1.3 7.9615 11 1.4 8.4816 14 1.5 8.4017 11 1.6 9.4418 12 1.7 7.8419 15 1.8 7.8820 7 1.9 5.7221 13 2.0 6.9622 8 2.1 9.4823 11 2.2 7.4024 8 2.3 9.3225 4 2.4 9.9626 15 2.5 7.9227 7 2.6 8.7228 3 2.7 5.3229 5 2.8 9.5230 5 2.9 5.40
'data.frame': 126 obs. of 3 variables: $ resp: int 19 10 13 6 20 12 9 9 9 10 ... $ wave: num 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 ... $ hill: num 7.04 9.04 5.48 5.24 8.6 7.32 8.76 10 9 7.8 ...

5.1 glmodel.A1: resp ~ wave + hill

Call:glm(formula = resp ~ wave + hill, family = poisson(link = log), data = gamdatacsv)Deviance Residuals: Min 1Q Median 3Q Max -4.1532 -0.7368 -0.0819 0.6189 3.6487 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.334230 0.161659 14.439 <2e-16 ***wave -0.017758 0.007982 -2.225 0.0261 * hill 0.002789 0.019940 0.140 0.8888 ---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1(Dispersion parameter for poisson family taken to be 1) Null deviance: 193.89 on 125 degrees of freedomResidual deviance: 188.88 on 123 degrees of freedomAIC: 699.01Number of Fisher Scoring iterations: 4

5.2 Basic Diagnostic Plot: resp ~ wave + hill

Introduction to Generalized Additive Models(GAM) with Rstudio (15)

  • The Diagnostic test for glmodel.A1 found that
    • Residual deviance = 188.88 on 123 Degrees of freedom
    • The ratio of Residual deviance/Degrees of freedom = 188.88 / 123 = 1.53561 which is greater than 1.2, indicating that the model is overdispersed . That is, the variance is increasing with the mean.
    • Transformation of Poisson to quasiPoisson is a method to solve the Overdispersionproblem.

5.3 Has overdispersion problem in glm(resp ~ wave + hill)

\[\frac{\mathrm{Residual\_Deviance}}{\mathrm{DOF}} = \frac{188.88}{123} = 1.53561 >> 1 \]

Call:glm(formula = resp ~ wave + hill, family = quasipoisson(link = log), data = gamdatacsv)Deviance Residuals: Min 1Q Median 3Q Max -4.1532 -0.7368 -0.0819 0.6189 3.6487 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.334230 0.196143 11.901 <2e-16 ***wave -0.017758 0.009685 -1.834 0.0691 . hill 0.002789 0.024193 0.115 0.9084 ---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1(Dispersion parameter for quasipoisson family taken to be 1.472122) Null deviance: 193.89 on 125 degrees of freedomResidual deviance: 188.88 on 123 degrees of freedomAIC: NANumber of Fisher Scoring iterations: 4

6.1 Overdispersion problem in glm(resp ~ wave + hill)

  • not so serious if using quasipoisson

\[\frac{\mathrm{Residual\_Deviance}}{\mathrm{DOF}} = \frac{188.88}{123} = 1.53561 > 1.472122 \]

6.2 Calculate an ANOVA Table

Analysis of Deviance Table (Type II tests)Response: respError estimate based on Pearson residuals Sum Sq Df F value Pr(>F) wave 4.956 1 3.3665 0.06895 .hill 0.020 1 0.0133 0.90840 Residuals 181.070 123 ---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Analysis of Deviance Table (Type II tests)Response: respError estimate based on Pearson residuals Sum Sq Df F value Pr(>F) wave 4.956 1 3.3665 0.06895 .hill 0.020 1 0.0133 0.90840 Residuals 181.070 123 ---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Family: gaussian Link function: identity Formula:resp ~ s(wave) + s(hill)Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 9.4524 0.2416 39.12 <2e-16 ***---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Approximate significance of smooth terms: edf Ref.df F p-value s(wave) 6.056 7.208 5.278 2.42e-05 ***s(hill) 3.377 4.184 20.625 5.21e-14 ***---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1R-sq.(adj) = 0.476 Deviance explained = 51.6%GCV = 8.0183 Scale est. = 7.3543 n = 126
Family: poisson Link function: log Formula:resp ~ s(wave) + s(hill)Parametric coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.20920 0.03001 73.63 <2e-16 ***---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Approximate significance of smooth terms: edf Ref.df Chi.sq p-value s(wave) 5.805 6.949 29.57 0.000125 ***s(hill) 3.593 4.452 66.89 4.52e-13 ***---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1R-sq.(adj) = 0.465 Deviance explained = 51.4%UBRE = -0.087376 Scale est. = 1 n = 126

7.1 Anova Test

Analysis of Deviance TableModel 1: resp ~ s(wave) + s(hill)Model 2: resp ~ wave + hill Resid. Df Resid. Dev Df Deviance Pr(>Chi) 1 113.6 94.195 2 126.0 188.880 -12.4 -94.685 9.356e-15 ***---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Analysis of Deviance TableModel 1: resp ~ s(wave) + s(hill)Model 2: resp ~ s(wave) + s(hill) Resid. Df Resid. Dev Df Deviance Pr(>Chi) 1 113.60 94.19 2 113.61 849.92 -0.0080872 -755.72 < 2.2e-16 ***---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Introduction to Generalized Additive Models(GAM) with Rstudio (2024)

FAQs

What is the GAM function in rstudio? ›

gam is used to fit generalized additive models, specified by giving a symbolic description of the additive predictor and a description of the error distribution. gam uses the backfitting algorithm to combine different smoothing or fitting methods.

What is the generalized additive model of GAM? ›

In statistics, a generalized additive model (GAM) is a generalized linear model in which the linear response variable depends linearly on unknown smooth functions of some predictor variables, and interest focuses on inference about these smooth functions.

What is the difference between a GAM and a glm? ›

Generalized Additive Models (GAMs) are an extension of Generalized Linear Models (GLMs) in such a way that predictor variables can be modeled non-parametrically in addition to linear and polynomial terms for other predictors.

What is the GCV score in R? ›

The GCV score is the minimised generalised cross-validation (GCV) score of the GAM fitted. GCV is used for smoothness selection in the mgcv package for R; smoothing parameters are chosen to minimise prediction error where ϕ is unknown, and standard CV or GCV can be used to estimate prediction error.

How to predict using GAM in R? ›

As with most models in R, you can make predictions from a GAM object with the predict() function. Simply running predict() on a model, in this case our logistic model of purchasing behavior, will yield a vector of predictions for each data point in the data set we used to fit the model.

When to use a gam model? ›

Unlike linear models, GAMs can capture non-linear patterns by combining multiple smooth functions of predictor variables. GAMs are particularly valuable when investigating intricate dependencies, making them a crucial tool for data analysis and predictive modeling.

What is the formula for the GAM function? ›

by can be used to specify a variable by which the smooth should be multiplied. For example gam(y~s(x,by=z)) would specify a model E ( y ) = f ( x ) z where is a smooth function.

Is linear regression a GAM? ›

Unlike linear regression, where each predictor term in the additive model is assumed to vary linearly with the predictand (unless specified otherwise by the developer), GAM is a nonparametric tool that makes use of the data to automatically estimate the appropriate functional (curvative) relationship for each predictor ...

What are the assumptions of a GAM? ›

#5 – Assumptions

GLMs and GAMs make assumptions about the distribution of the response variable, independence of observations, and appropriate link function choice. However, GAMs additionally assume the smoothness of the functions used to model the predictor-response relationships