Chapter 3 Model Selection Using Information Criteria

So far, we have learned to fit models with multiple predictors, both quantitative and categorical, and to assess whether required conditions are met for linear regression to be an appropriate model for a dataset.

One missing piece is: If I have an appropriate model with a set of multiple predictors, how can I choose which predictors are worth retaining in a “best” model for the data (and which ones have no relationship, or a weak relationship, with the response, so should be discarded)?

3.1 Data and Model

Today we will recreate part of the analysis from Vertebrate community composition and diversity declines along a defaunation gradient radiating from rural villages in Gabon, by Sally Koerner and colleagues. They investigated the relationship between rural villages, hunting, and wildlife in Gabon. They asked how monkey abundance depends on distance from villages, village size, and vegetation characteristics. They shared their data at Dryad.org and we can read it in and fit a regression model like this:

defaun <- read.csv('http://sldr.netlify.com/data/koerner_gabon_defaunation.csv')

ape_mod <- lm(RA_Apes ~ Veg_DBH + Veg_Canopy + Veg_Understory +
                   Veg_Rich + Veg_Stems + Veg_liana +
                   LandUse + Distance + NumHouseholds, data = defaun)
summary(ape_mod)

## 
## Call:
## lm(formula = RA_Apes ~ Veg_DBH + Veg_Canopy + Veg_Understory + 
##     Veg_Rich + Veg_Stems + Veg_liana + LandUse + Distance + NumHouseholds, 
##     data = defaun)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.986 -0.942 -0.036  0.824  6.383 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)  
## (Intercept)     5.75252   13.37221    0.43    0.674  
## Veg_DBH        -0.09317    0.07311   -1.27    0.225  
## Veg_Canopy      0.67009    2.06254    0.32    0.750  
## Veg_Understory -1.69124    2.07130   -0.82    0.429  
## Veg_Rich        0.36196    0.48036    0.75    0.465  
## Veg_Stems      -0.09721    0.16907   -0.57    0.575  
## Veg_liana      -0.15850    0.25303   -0.63    0.542  
## LandUseNeither  1.69675    2.05894    0.82    0.425  
## LandUsePark    -2.94719    2.22271   -1.33    0.208  
## Distance        0.30263    0.11987    2.52    0.025 *
## NumHouseholds  -0.00211    0.04346   -0.05    0.962  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.73 on 13 degrees of freedom
## Multiple R-squared:  0.544,  Adjusted R-squared:  0.193 
## F-statistic: 1.55 on 10 and 13 DF,  p-value: 0.226

as.numeric(logLik(ape_mod))

## [1] -50.8

3.2 Calculations

Information criteria allow us to balance the conflicting goals of having a model that fits the data as well as possible (which pushes us toward models with more predictors) and parsimony (choosing the simplest model, with the fewest predictors, that works for the data and research question). The basic idea is that we minimize the quantity \(-(2LogLikelihood - penalty) = -2LogLikelihood + penalty\)
AIC is computed according to \(-2LogLikelihood +2k\), where \(k\) is the number of coefficients being estimated (don’t forget \(\sigma\)!) Smaller AIC is better.
BIC is computed according to \(-2LogLikelihood + ln(n)k\), where \(n\) is the number of observations (rows) in the dataset and \(k\) is the number of coefficients being estimated. Smaller BIC is better.
Verify that the BIC for this model is 139.65.

3.3 Decisions with ICs

The following rules of thumb (not laws, just common rules of thumb) may help you make decisions with ICs:

A model with lower IC by at least 3 units is notably better.
If two or more models have ICs within 3 IC units of each other, there is not a lot of difference between them. Here, we usually choose the model with fewest predictors.
In some cases, if the research question is to measure the influence of some particular predictor on the response, but the IC does not strongly support including that predictor in the best model (IC difference less than 3), you might want to keep it in anyway and then discuss the situation honestly, for example, “AIC does not provide strong support for including predictor x in the best model, but the model including predictor x indicates that as x increases the response decreases slightly. More research would be needed…”

3.4 All-possible-subsets Selection

The model we just fitted is our full model, with all predictors of potential interest included. How can we use information criteria to choose the best model from possible models with subsets of the predictors?

We can use the dredge() function from the MuMIn package to get and display ICs for all these models.

Before using dredge, we need to make sure our dataset has no missing values, and also set the “na.action” input for our model (can be done in call to lm(..., na.action = 'na.fail') also).

require(MuMIn)
ape_mod <- ape_mod %>% update(na.action = 'na.fail')
ape_dredge <- dredge(ape_mod, rank='BIC')
pander::pander(head(ape_dredge, 7))

Table continues below
	(Intercept)	Distance	LandUse	NumHouseholds	Veg_Canopy
258	8.753	0.195	NA	NA	NA
2	-0.6912	0.2303	NA	NA	NA
274	11.44	0.1848	NA	NA	NA
322	11.9	0.2033	NA	NA	NA
290	9.805	0.1884	NA	NA	NA
386	9.49	0.1976	NA	NA	NA
266	7.783	0.1896	NA	NA	0.2771

Table continues below
	Veg_DBH	Veg_liana	Veg_Rich	Veg_Stems	Veg_Understory	df
258	NA	NA	NA	NA	-2.988	4
2	NA	NA	NA	NA	NA	3
274	-0.04551	NA	NA	NA	-3.144	5
322	NA	NA	-0.1939	NA	-3.11	5
290	NA	-0.09802	NA	NA	-2.952	5
386	NA	NA	NA	-0.03113	-2.904	5
266	NA	NA	NA	NA	-2.964	5

	logLik	BIC	delta	weight
258	-53.9	120.5	0	0.3284
2	-55.8	121.1	0.6241	0.2404
274	-53.38	122.7	2.146	0.1123
322	-53.55	123	2.491	0.09449
290	-53.67	123.2	2.727	0.08399
386	-53.82	123.5	3.03	0.0722
266	-53.88	123.7	3.144	0.0682

What is the best model according to BIC, for this dataset?

3.5 Which IC should I use?

AIC and BIC may give different best models, especially if the dataset is large. You may want to just choose one to use a priori (before making calculations). You might prefer BIC if you want to err on the “conservative” side, as it is more likely to select a “smaller” model with fewer predictors. This is because of its larger penalty.

3.6 Quantities derived from AIC

\(\Delta AIC\) is the AIC for a given model, minus the AIC of the best one in the dataset. (Same for \(\Delta BIC\))
Akaike weights are values (ranging from 0-1) that measure the weight of evidence suggesting that a model is the best one (given that there is one best one in the set)

3.7 Important Caution

Very important: IC can ONLY be compared for models with the same response variable, and the exact same rows of data.