Chapter 3 Model Selection Using Information Criteria

So far, we have learned to fit models with multiple predictors, both quantitative and categorical, and to assess whether required conditions are met for linear regression to be an appropriate model for a dataset.

One missing piece is: If I have an appropriate model with a set of multiple predictors, how can I choose which predictors are worth retaining in a “best” model for the data (and which ones have no relationship, or a weak relationship, with the response, so should be discarded)?

3.1 Data and Model

Today we will recreate part of the analysis from Vertebrate community composition and diversity declines along a defaunation gradient radiating from rural villages in Gabon, by Sally Koerner and colleagues. They investigated the relationship between rural villages, hunting, and wildlife in Gabon. They asked how monkey abundance depends on distance from villages, village size, and vegetation characteristics. They shared their data at Dryad.org and we can read it in and fit a regression model like this:

## 
## Call:
## lm(formula = RA_Apes ~ Veg_DBH + Veg_Canopy + Veg_Understory + 
##     Veg_Rich + Veg_Stems + Veg_liana + LandUse + Distance + NumHouseholds, 
##     data = defaun)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.986 -0.942 -0.036  0.824  6.383 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)  
## (Intercept)     5.75252   13.37221    0.43    0.674  
## Veg_DBH        -0.09317    0.07311   -1.27    0.225  
## Veg_Canopy      0.67009    2.06254    0.32    0.750  
## Veg_Understory -1.69124    2.07130   -0.82    0.429  
## Veg_Rich        0.36196    0.48036    0.75    0.465  
## Veg_Stems      -0.09721    0.16907   -0.57    0.575  
## Veg_liana      -0.15850    0.25303   -0.63    0.542  
## LandUseNeither  1.69675    2.05894    0.82    0.425  
## LandUsePark    -2.94719    2.22271   -1.33    0.208  
## Distance        0.30263    0.11987    2.52    0.025 *
## NumHouseholds  -0.00211    0.04346   -0.05    0.962  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.73 on 13 degrees of freedom
## Multiple R-squared:  0.544,  Adjusted R-squared:  0.193 
## F-statistic: 1.55 on 10 and 13 DF,  p-value: 0.226
## [1] -50.8

3.2 Calculations

  • Information criteria allow us to balance the conflicting goals of having a model that fits the data as well as possible (which pushes us toward models with more predictors) and parsimony (choosing the simplest model, with the fewest predictors, that works for the data and research question). The basic idea is that we minimize the quantity \(-(2LogLikelihood - penalty) = -2LogLikelihood + penalty\)

  • AIC is computed according to \(-2LogLikelihood +2k\), where \(k\) is the number of coefficients being estimated (don’t forget \(\sigma\)!) Smaller AIC is better.

  • BIC is computed according to \(-2LogLikelihood + ln(n)k\), where \(n\) is the number of observations (rows) in the dataset and \(k\) is the number of coefficients being estimated. Smaller BIC is better.

  • Verify that the BIC for this model is 139.65.

3.3 Decisions with ICs

The following rules of thumb (not laws, just common rules of thumb) may help you make decisions with ICs:

  • A model with lower IC by at least 3 units is notably better.
  • If two or more models have ICs within 3 IC units of each other, there is not a lot of difference between them. Here, we usually choose the model with fewest predictors.
  • In some cases, if the research question is to measure the influence of some particular predictor on the response, but the IC does not strongly support including that predictor in the best model (IC difference less than 3), you might want to keep it in anyway and then discuss the situation honestly, for example, “AIC does not provide strong support for including predictor x in the best model, but the model including predictor x indicates that as x increases the response decreases slightly. More research would be needed…”

3.4 All-possible-subsets Selection

The model we just fitted is our full model, with all predictors of potential interest included. How can we use information criteria to choose the best model from possible models with subsets of the predictors?

We can use the dredge() function from the MuMIn package to get and display ICs for all these models.

Before using dredge, we need to make sure our dataset has no missing values, and also set the “na.action” input for our model (can be done in call to lm(..., na.action = 'na.fail') also).

Table continues below
  (Intercept) Distance LandUse NumHouseholds Veg_Canopy
258 8.753 0.195 NA NA NA
2 -0.6912 0.2303 NA NA NA
274 11.44 0.1848 NA NA NA
322 11.9 0.2033 NA NA NA
290 9.805 0.1884 NA NA NA
386 9.49 0.1976 NA NA NA
266 7.783 0.1896 NA NA 0.2771
Table continues below
  Veg_DBH Veg_liana Veg_Rich Veg_Stems Veg_Understory df
258 NA NA NA NA -2.988 4
2 NA NA NA NA NA 3
274 -0.04551 NA NA NA -3.144 5
322 NA NA -0.1939 NA -3.11 5
290 NA -0.09802 NA NA -2.952 5
386 NA NA NA -0.03113 -2.904 5
266 NA NA NA NA -2.964 5
  logLik BIC delta weight
258 -53.9 120.5 0 0.3284
2 -55.8 121.1 0.6241 0.2404
274 -53.38 122.7 2.146 0.1123
322 -53.55 123 2.491 0.09449
290 -53.67 123.2 2.727 0.08399
386 -53.82 123.5 3.03 0.0722
266 -53.88 123.7 3.144 0.0682
  • What is the best model according to BIC, for this dataset?

3.5 Which IC should I use?

AIC and BIC may give different best models, especially if the dataset is large. You may want to just choose one to use a priori (before making calculations). You might prefer BIC if you want to err on the “conservative” side, as it is more likely to select a “smaller” model with fewer predictors. This is because of its larger penalty.

3.6 Quantities derived from AIC

  • \(\Delta AIC\) is the AIC for a given model, minus the AIC of the best one in the dataset. (Same for \(\Delta BIC\))
  • Akaike weights are values (ranging from 0-1) that measure the weight of evidence suggesting that a model is the best one (given that there is one best one in the set)

3.7 Important Caution

Very important: IC can ONLY be compared for models with the same response variable, and the exact same rows of data.