Chapter 18 GAMs: Generalized Additive Models

So far, we have learned ways to model continuous, logical, and count response variables as functions of quantitative and categorical predictors. We started with linear models - where both the response and predictor variables are quantitative and the relationship between them is linear. What about nonlinear relationships?

So far, we have considered…

  • Categorical predictor variables. Making use of indicator variables for (all but one of the) categories, we can model a situation where each value of the predictor variable has a different effect on the response. But…
    • How many categories?
    • What about periodicity?
  • GLMs. In logistic, Poisson, etc. regression, the action of the link function results in a relationship between the predictors and the response variable that is linear on the scale of the link function ( = scale of the RHS of the equation), but non-linear on the scale of the response variables (LHS). But…
    • Nonlinear, but monotonic

18.1 Non-linear, non-monotonic relationships

It’s not true that all interesting predictor-response relationships are linear or monotonic. One example is in some data on bat migration: how does the probability of bats leaving on their migratory journey depend on air humidity?

Another dataset (our example for the day) – ozone levels as a function of solar radiation (rad), temperature (temp) and wind speed (wind):

18.2 Smooth terms

We can fit a model where the relationship between the response and the predictor is a ``smooth" – no linearity or monotonicity requirement.

18.2.1 Basis functions

  • A smooth term is constructed as the sum of several parts, or basis functions. Each of these functions has a relatively simple shape, but scaled and added together, they can produce nearly any ``wiggly" shape.
  • Increasing the dimension of the basis (more functions added together) can allow more wiggliness.
  • Goal: allow enough wiggliness to fit the data well, without overfitting (smooth goes through every point in the data, or follows ``trends" that are spurious)

We will fit smooth models to data using the function gam() from the package mgcv. It includes many options for basis functions (types of smooths) - see ?mgcv::gam or [https://rsconnect.calvin.edu:3939/content/28/] for details.

18.3 Fitting GAMs

An excellent resource: https://converged.yt/mgcv-workshop/.

18.3.1 Choosing model formulation

Which terms should be modelled as smooth terms? Explore the data!

  • Pros:
  • Cons:

18.3.2 Model formula

Let’s fit a simple GAM for the ozone data as a function of radiation, temperature and wind. Note the s() function for specifying a smooth, which takes as input:

  • a variable name (or more than one, for advanced users)
  • k
  • bs

How do we choose? For some exploration, see: [https://rsconnect.calvin.edu:3939/content/28/].

We can also fit the model and smooths by different methods and with options:

  • method = 'GCV.Cp'
  • method = 'REML'
  • method = 'ML'
  • select = TRUE (or FALSE)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## ozone ~ s(rad, k = 5, bs = "tp") + s(wind, k = 5, bs = "tp") + 
##     s(temp, k = 5, bs = "tp")
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    42.10       1.65    25.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##          edf Ref.df     F p-value    
## s(rad)  1.66      4  3.26 0.00047 ***
## s(wind) 2.91      4 13.78 3.0e-11 ***
## s(temp) 2.37      4 14.46 8.5e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.726   Deviance explained = 74.3%
## -REML = 482.89  Scale est. = 303.89    n = 111

18.4 Model Assessment

In addition to what you already know (…which all still holds, except linearity expectation!) mgcv has some nice model checking functions built in.

## 
## Method: REML   Optimizer: outer newton
## full convergence after 6 iterations.
## Gradient range [-0.000192,1.33e-05]
## (score 483 & scale 304).
## Hessian positive definite, eigenvalue range [4.35e-05,55.1].
## Model rank =  13 / 13 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##           k'  edf k-index p-value   
## s(rad)  4.00 1.66    0.95    0.27   
## s(wind) 4.00 2.91    0.96    0.22   
## s(temp) 4.00 2.37    0.78    0.01 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

18.4.1 Concurvity

Like collinearity and multicollinearity, but for smooths…values of 0 indicate no problem, and 1 a huge problem (total lack of identifiability – same information in multiple predictors).

Overall, does the model have problems with concurvity?

##              para s(rad) s(wind) s(temp)
## worst    8.11e-20  0.280   0.344   0.426
## observed 8.11e-20  0.202   0.318   0.357
## estimate 8.11e-20  0.172   0.238   0.343

Or alternatively, which specific pairs of terms cause problems?

## $worst
##             para   s(rad)  s(wind)  s(temp)
## para    1.00e+00 4.14e-28 7.60e-20 1.72e-28
## s(rad)  3.92e-28 1.00e+00 1.24e-01 2.69e-01
## s(wind) 7.60e-20 1.24e-01 1.00e+00 3.24e-01
## s(temp) 1.78e-28 2.69e-01 3.24e-01 1.00e+00
## 
## $observed
##             para   s(rad)  s(wind)  s(temp)
## para    1.00e+00 1.15e-30 5.47e-23 2.84e-30
## s(rad)  3.92e-28 1.00e+00 1.03e-01 1.60e-01
## s(wind) 7.60e-20 5.43e-02 1.00e+00 2.97e-01
## s(temp) 1.78e-28 1.95e-01 3.00e-01 1.00e+00
## 
## $estimate
##             para   s(rad)  s(wind)  s(temp)
## para    1.00e+00 2.39e-30 8.37e-22 7.83e-31
## s(rad)  3.92e-28 1.00e+00 7.78e-02 2.00e-01
## s(wind) 7.60e-20 3.93e-02 1.00e+00 2.45e-01
## s(temp) 1.78e-28 1.62e-01 2.20e-01 1.00e+00

18.5 Model Selection

18.5.1 Shrinkage and Penalties

With GAMs, in a sense, some model selection is (or can be) done during model fitting - what smooth is best? Or is the relationship a line? A flat line? Using shrinkage basis or including select=TRUE allows for this.

18.5.2 P-value selection

Cautions: p-values are approximate! Successfulness of the procedure best when fitting method is: ML (1st choice), REML (2nd choice).

Interpretation as usual. Note that anova() (not Anova()) works here - especially for GAMs, it does not do sequential tests; and Anova() doesn’t handle GAMs.

18.5.3 Information criteria

  • Conditional/Approximate - bias
  • Fitting method:
    • REML-based IC scores can be used to compare models with different random effects but not different predictors. (IF select = TRUE and using a shrinkage basis.)
    • ML-based IC scores can be used to compare models with different fixed effects (regular predictors) and different family, but not different random effects
## Global model call: gam(formula = ozone ~ s(rad, k = 5, bs = "tp") + s(wind, k = 5, 
##     bs = "tp") + s(temp, k = 5, bs = "tp"), data = oz, na.action = "na.fail", 
##     method = "ML", select = TRUE)
## ---
## Model selection table 
##   (Int) s(rad,5,"tp") s(tmp,5,"tp") s(wnd,5,"tp") df logLik AIC delta
## 8  42.1             +             +             + 10   -471 962  0.00
## 7  42.1                           +             +  8   -477 971  9.32
##   weight
## 8  0.991
## 7  0.009
## Models ranked by AIC(x)