19  GAMs: Generalized Additive Models

So far, we have learned ways to model continuous, logical, and count response variables as functions of quantitative and categorical predictors. We started with linear models - where both the response and predictor variables are quantitative and the relationship between them is linear. What about nonlinear relationships?

So far, we have considered…

19.1 Non-linear, non-monotonic relationships

It’s not true that all interesting predictor-response relationships are linear or monotonic. One example is in some data on bat migration: how does the probability of bats leaving on their migratory journey depend on air humidity?

bats <- read.csv('https://ndownloader.figshare.com/files/9348010')

Another dataset (our example for the day) – ozone levels as a function of temperature, day, and humidity:

ozone <- read_csv('https://raw.githubusercontent.com/selva86/datasets/master/ozone.csv') |>
  mutate(Date = lubridate::mdy(paste(Month, Day_of_month, '2018')),
         Day = lubridate::yday(Date))
gf_point(ozone_reading ~ Temperature_Sandburg, 
         data=ozone, alpha=0.4, ylab = 'Ozone')
gf_point(ozone_reading ~ Day, 
         data=ozone, alpha=0.4, ylab = 'Ozone')
gf_point(ozone_reading ~ Humidity, 
         data=ozone, alpha=0.4, ylab = 'Ozone')

19.2 Smooth terms

We can fit a model where the relationship between the response and the predictor is a ``smooth” – no linearity or monotonicity requirement.

19.2.1 Basis functions

  • A smooth term is constructed as the sum of several parts, or basis functions. Each of these functions has a relatively simple shape, but scaled and added together, they can produce nearly any ``wiggly” shape.
  • Increasing the dimension of the basis (more functions added together) can allow more wiggliness.
  • Goal: allow enough wiggliness to fit the data well, without overfitting (smooth goes through every point in the data, or follows ``trends” that are spurious)

We will fit smooth models to data using the function gam() from the package mgcv. It includes many options for basis functions (types of smooths) - see ?mgcv::gam or [https://rsconnect.calvin.edu:3939/content/28/] for details.

19.3 Fitting GAMs

An excellent resource: https://converged.yt/mgcv-workshop/.

19.3.1 Choosing model formulation

Which terms should be modelled as smooth terms? Explore the data!

  • Pros:
  • Cons:

19.3.2 Model formula

Let’s fit a simple GAM for the ozone data as a function of radiation, temperature and wind. Note the s() function for specifying a smooth, which takes as input:

  • a variable name (or more than one, for advanced users)
  • k
  • bs

How do we choose? For some exploration, see: [https://rsconnect.calvin.edu:3939/content/28/].

We can also fit the model and smooths by different methods and with options:

  • method = 'GCV.Cp'
  • method = 'REML'
  • method = 'ML'
  • select = TRUE (or FALSE)
library(mgcv)
oz.gam <- gam(ozone_reading ~ s(Day, k = 7, bs = 'cc') +
                s(Wind_speed, k = 5, bs = 'tp') +
                s(Temperature_Sandburg, 
                  k = 5, bs = 'tp'),
              data = ozone,
              method = 'ML', 
              select = TRUE)
summary(oz.gam)

Family: gaussian 
Link function: identity 

Formula:
ozone_reading ~ s(Day, k = 7, bs = "cc") + s(Wind_speed, k = 5, 
    bs = "tp") + s(Temperature_Sandburg, k = 5, bs = "tp")

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  11.3740     0.3163   35.96   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
                          edf Ref.df     F p-value    
s(Day)                  3.641      5  4.93 2.1e-05 ***
s(Wind_speed)           1.180      4  0.77  0.0696 .  
s(Temperature_Sandburg) 3.283      4 45.87 < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.697   Deviance explained =   71%
-ML =  604.6  Scale est. = 20.309    n = 203

19.4 Model Assessment

In addition to what you already know (…which all still holds, except linearity expectation!) mgcv has some nice model checking functions built in.

par(mar=c(4,4,2,2))
gam.check(oz.gam)

Method: ML   Optimizer: outer newton
full convergence after 10 iterations.
Gradient range [-0.000846257,0.001371855]
(score 604.6047 & scale 20.3091).
Hessian positive definite, eigenvalue range [0.0003470333,101.5511].
Model rank =  14 / 14 

Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.

                          k'  edf k-index p-value  
s(Day)                  5.00 3.64    0.86    0.02 *
s(Wind_speed)           4.00 1.18    1.14    0.95  
s(Temperature_Sandburg) 4.00 3.28    1.10    0.93  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

19.4.1 Concurvity

Like collinearity and multicollinearity, but for smooths…values of 0 indicate no problem, and 1 a huge problem (total lack of identifiability – same information in multiple predictors).

Overall, does the model have problems with concurvity?

concurvity(oz.gam, full=TRUE)
                 para    s(Day) s(Wind_speed) s(Temperature_Sandburg)
worst    2.191674e-20 0.6717847     0.2733692               0.6505342
observed 2.191674e-20 0.1609172     0.1601073               0.6453322
estimate 2.191674e-20 0.2421204     0.2119838               0.5270197

Or alternatively, which specific pairs of terms cause problems?

concurvity(oz.gam, full=FALSE)
$worst
                                para       s(Day) s(Wind_speed)
para                    1.000000e+00 6.612020e-32  1.595828e-29
s(Day)                  1.243525e-31 1.000000e+00  2.476773e-01
s(Wind_speed)           2.467911e-29 2.476773e-01  1.000000e+00
s(Temperature_Sandburg) 2.081406e-20 6.398944e-01  2.054144e-01
                        s(Temperature_Sandburg)
para                               2.081412e-20
s(Day)                             6.398944e-01
s(Wind_speed)                      2.054144e-01
s(Temperature_Sandburg)            1.000000e+00

$observed
                                para       s(Day) s(Wind_speed)
para                    1.000000e+00 1.710467e-32  1.312129e-30
s(Day)                  1.243525e-31 1.000000e+00  8.351370e-02
s(Wind_speed)           2.467911e-29 9.567723e-02  1.000000e+00
s(Temperature_Sandburg) 2.081406e-20 6.230468e-02  7.925646e-02
                        s(Temperature_Sandburg)
para                               7.645727e-23
s(Day)                             6.353482e-01
s(Wind_speed)                      8.030772e-02
s(Temperature_Sandburg)            1.000000e+00

$estimate
                                para       s(Day) s(Wind_speed)
para                    1.000000e+00 1.411658e-32  9.541351e-31
s(Day)                  1.243525e-31 1.000000e+00  1.685925e-01
s(Wind_speed)           2.467911e-29 7.801012e-02  1.000000e+00
s(Temperature_Sandburg) 2.081406e-20 1.970558e-01  8.331840e-02
                        s(Temperature_Sandburg)
para                               1.703578e-22
s(Day)                             4.902313e-01
s(Wind_speed)                      9.916508e-02
s(Temperature_Sandburg)            1.000000e+00

19.5 Model Selection

19.5.1 Shrinkage and Penalties

With GAMs, in a sense, some model selection is (or can be) done during model fitting - what smooth is best? Or is the relationship a line? A flat line? Using shrinkage basis or including select=TRUE allows for this.

19.5.2 P-value selection

Cautions: p-values are approximate! Successfulness of the procedure best when fitting method is: ML (1st choice), REML (2nd choice).

anova(oz.gam)

Interpretation as usual. Note that anova() (not Anova()) works here - especially for GAMs, it does not do sequential tests; and Anova() doesn’t handle GAMs.

19.5.3 Information criteria

  • Conditional/Approximate - bias
  • Fitting method:
    • REML-based IC scores can be used to compare models with different random effects but not different predictors. (IF select = TRUE and using a shrinkage basis.)
    • ML-based IC scores can be used to compare models with different fixed effects (regular predictors) and different family, but not different random effects
library(MuMIn)
oz.ml <- update(oz.gam, method='ML', na.action='na.fail')
head(dredge(oz.ml, rank='AIC'),2)
Global model call: gam(formula = ozone_reading ~ s(Day, k = 7, bs = "cc") + s(Wind_speed, 
    k = 5, bs = "tp") + s(Temperature_Sandburg, k = 5, bs = "tp"), 
    data = ozone, na.action = "na.fail", method = "ML", select = TRUE)
---
Model selection table 
  (Int) s(Day,7,"cc") s(Tmp_Snd,5,"tp") s(Wnd_spd,5,"tp") df   logLik    AIC
8 11.37             +                 +                 + 11 -589.011 1201.6
4 11.37             +                 +                   10 -591.448 1203.0
  delta weight
8  0.00  0.666
4  1.38  0.334
Models ranked by AIC(x)