Chapter 7 Model Averaging

So far, we have used AIC and/or BIC for model selection, to decide which variables to keep in a “best” model and which to exclude. But we have already seen a number of cases where there is not one model that is clearly superior to all the others. In those cases, we have decided on the smallest model (with the fewest predictors), but there are other options when many competing models have similar scores.

One option is not to choose – instead, keep them all, and make predictions via a weighted average of all the models. The models with the best IC scores get more weight. This can be a good option if the main goal is accurate prediction (rather than deciding definitively which predictors are “good” ones and which are not).

How can we do it? Let’s explore an example.

7.1 Data: School Survey on Crime and Safety

The data for this example are from a survey of U.S. schools, the 2000 School Survey on Crime and Safety. There is information about the study at

http://catalog.data.gov/dataset/2000-school-survey-on-crime-and-safety,

which says the study “is a cross-sectional survey of the nation’s public schools designed to provide estimates of school crime, discipline, disorder, programs and policies. SSOCS is administered to public primary, middle, high, and combined school principals in the spring of even-numbered school years…Public schools were sampled in the spring of 2000 to participate in the study.”

The dataset you will use is available online at:

http://sldr.netlify.com/data/sscrime.csv

It contains a number of variables:

  • VisitorCheckIn: Whether visitors to the school must check in to gain entry to the school.
  • LockedGates: Whether there are locked gates at the entry to the school.
  • MetalDetectors: Whether there is a metal detector at the entrance to the school.
  • DrugSniffDog: Whether a drug-sniffing dog is randomly brought into the school to carry out inspections.
  • DrugTesting: Whether any drug testing of students occurs.
  • UniformsRequired:Whether students are required to wear uniforms.
  • DressCode: Whether a strict dress code is enforced.
  • Lockers: Whether students have lockers.
  • StudentIDBadges: Whether students are required to wear ID badges.
  • StaffIDBadges: Whether teachers and other staff are required to wear ID badges.
  • SecurityCameras: Whether there are security cameras on the premises.
  • OfficialRiotPlan: Whether the school has a written plan in place for how to deal with a riot or large-scale fight.
  • ViolenceReductionProgram: Whether the school has a Violence Reduction Program in place.
  • Security: Whether security officers are present on the premises.
  • TrainingHours: Average amount of time (in hours) that teachers and staff have devoted to training related to violence reduction.
  • AttacksWithoutWeapon: Number of attacks that have occurred at the school, not involving a weapon.
  • Thefts: Number of thefts.
  • Vandalism: Number of incidents of vandalism.
  • ViolentIncidentsTotal: Number of violent incidents of all types that have occurred at the school.
  • Enrollment: Number of students enrolled in the school (categorical)
  • NEnrollment: Number of students enrolled in the school (numeric)
  • SurveyRespondent: The identity of the person who filled out the survey.
  • Location: Whether the location of the school is Urban, Rural, etc.

7.2 Modelling number of violent incidents per school

We will fit a model for the number of violent incidents total as a function of a number of predictors. This is count data and we will fit a negative binomial regression model:

I will use AIC and the dredge() function to compare all possible subsets of my saturated model and figure out which variables should be included in the best model. I chose AIC in this case because it is perhaps more widely used than BIC (that’s not a good reason unless you really have no better one, but there you have it) and because with the relatively small sample size here, I don’t feel a particular need to use BIC for its larger penalty term.

## Global model call: glmmTMB(formula = ViolentIncidentsTotal ~ TrainingHours + Location + 
##     SecurityCameras + DressCode + UniformsRequired + NEnrollment, 
##     data = ssc, family = nbinom2(link = "log"), ziformula = ~0, 
##     dispformula = ~1)
## ---
## Model selection table 
##    cnd((Int)) dsp((Int)) cnd(DrC) cnd(Lct) cnd(NEn) cnd(ScC) cnd(TrH)
## 4        3.61          +        +        +                           
## 36       3.58          +        +        +                           
## 12       3.63          +        +        +                 +         
## 8        3.57          +        +        + 2.54e-05                  
## 20       3.61          +        +        +                   0.000892
## 44       3.61          +        +        +                 +         
## 40       3.53          +        +        + 3.41e-05                  
## 16       3.59          +        +        + 3.14e-05        +         
##    cnd(UnR) df logLik  AIC delta weight
## 4            6  -1745 3503  0.00  0.287
## 36        +  7  -1745 3504  1.27  0.152
## 12           7  -1745 3504  1.45  0.139
## 8            7  -1745 3505  1.80  0.117
## 20           7  -1745 3505  2.00  0.106
## 44        +  8  -1745 3506  2.71  0.074
## 40        +  8  -1745 3506  2.92  0.067
## 16           8  -1745 3506  3.15  0.060
## Models ranked by AIC(x)

Because the first 7 or so models all have AIC scores within 3 units of each other, it is hard to choose one best model here. In this situation, one way to choose is to pick the model that includes the smallest number of predictors, and still acheives an AIC that is among the best. Another option would be to use model averaging.

7.3 Model Averaging

What if we wanted to use model averaging to find the best model, instead? We might choose this route because there are several models that all have AIC that are close to each other and thus fit the data approximately equally well. So we might choose to make predictions (and compute coefficients) that are the average of all the models (weighted by IC weights).

Notes of caution:

To do model averaging, we use package MuMIn (function model.avg).

7.3.1 Getting the Averaged Model

The following code gets the average model. If we did the default (fit=FALSE), it would be a bit faster, but we would then not be able to get predictions from the model.

## 
## Call:
## model.avg(object = get.models(object = mod.sel2, subset = NA))
## 
## Component model call: 
## glmmTMB(formula = ViolentIncidentsTotal ~ <64 unique rhs>, data = 
##      ssc, family = nbinom2(link = "log"), ziformula = ~0, dispformula = 
##      ~1)
## 
## Component models: 
##        df logLik AICc delta weight
## 12      6  -1745 3503  0.00   0.24
## 126     7  -1745 3504  1.35   0.12
## 124     7  -1745 3505  1.52   0.11
## 123     7  -1745 3505  1.88   0.09
## 125     7  -1745 3505  2.08   0.08
## 1246    8  -1745 3506  2.87   0.06
## 1236    8  -1745 3506  3.08   0.05
## 1234    8  -1745 3506  3.31   0.05
## 1256    8  -1745 3506  3.44   0.04
## 1245    8  -1745 3507  3.60   0.04
## 1235    8  -1745 3507  3.96   0.03
## 12346   9  -1744 3507  4.48   0.03
## 12456   9  -1745 3508  4.96   0.02
## 12356   9  -1745 3508  5.18   0.02
## 12345   9  -1745 3508  5.40   0.02
## 123456 10  -1744 3510  6.58   0.01
## 2       5  -1753 3516 13.35   0.00
## 26      6  -1752 3517 13.82   0.00
## 24      6  -1753 3517 14.44   0.00
## 246     7  -1752 3518 14.89   0.00
## 25      6  -1753 3518 15.41   0.00
## 23      6  -1753 3518 15.41   0.00
## 236     7  -1752 3519 15.84   0.00
## 256     7  -1752 3519 15.89   0.00
## 234     7  -1753 3519 16.49   0.00
## 245     7  -1753 3519 16.50   0.00
## 2346    8  -1752 3520 16.83   0.00
## 2456    8  -1752 3520 16.97   0.00
## 235     7  -1753 3520 17.49   0.00
## 2356    8  -1752 3521 17.92   0.00
## 2345    8  -1753 3522 18.57   0.00
## 23456   9  -1752 3522 18.92   0.00
## 136     5  -1757 3523 20.43   0.00
## 1346    6  -1756 3524 21.41   0.00
## 13      4  -1758 3525 21.62   0.00
## 16      4  -1759 3525 22.29   0.00
## 1356    6  -1757 3525 22.45   0.00
## 134     5  -1758 3526 22.66   0.00
## 1       3  -1760 3526 22.69   0.00
## 13456   7  -1756 3526 23.48   0.00
## 135     5  -1758 3527 23.60   0.00
## 146     5  -1758 3527 23.78   0.00
## 14      4  -1760 3527 24.17   0.00
## 156     5  -1759 3527 24.26   0.00
## 15      4  -1760 3528 24.62   0.00
## 1345    6  -1758 3528 24.71   0.00
## 1456    6  -1758 3529 25.80   0.00
## 145     5  -1760 3529 26.17   0.00
## 36      4  -1766 3540 36.77   0.00
## 346     5  -1765 3540 37.10   0.00
## 6       3  -1767 3540 37.33   0.00
## 46      4  -1767 3541 38.20   0.00
## 356     5  -1766 3542 38.71   0.00
## 3456    6  -1765 3542 39.14   0.00
## 56      4  -1767 3542 39.20   0.00
## (Null)  2  -1769 3543 39.74   0.00
## 3       3  -1768 3543 40.07   0.00
## 456     5  -1766 3543 40.18   0.00
## 34      4  -1768 3544 40.54   0.00
## 4       3  -1769 3544 40.63   0.00
## 5       3  -1769 3545 41.55   0.00
## 35      4  -1768 3545 41.95   0.00
## 345     5  -1768 3546 42.53   0.00
## 45      4  -1769 3546 42.55   0.00
## 
## Term codes: 
##        cond(DressCode)         cond(Location)      cond(NEnrollment) 
##                      1                      2                      3 
##  cond(SecurityCameras)    cond(TrainingHours) cond(UniformsRequired) 
##                      4                      5                      6 
## 
## Model-averaged coefficients:  
## (full average) 
##                             Estimate Std. Error Adjusted SE z value
## cond((Int))                 3.60e+00   1.32e-01    1.32e-01   27.25
## cond(DressCodeyes)          3.76e-01   9.61e-02    9.64e-02    3.91
## cond(LocationRural)        -5.60e-01   1.33e-01    1.33e-01    4.20
## cond(LocationTown)         -5.87e-01   1.54e-01    1.55e-01    3.80
## cond(LocationUrban Fringe) -1.02e-01   1.18e-01    1.18e-01    0.87
## cond(UniformsRequiredyes)   5.69e-02   1.37e-01    1.38e-01    0.41
## cond(SecurityCamerasyes)   -2.38e-02   6.45e-02    6.47e-02    0.37
## cond(NEnrollment)           8.83e-06   2.81e-05    2.82e-05    0.31
## cond(TrainingHours)        -2.16e-04   2.51e-02    2.52e-02    0.01
##                            Pr(>|z|)    
## cond((Int))                  <2e-16 ***
## cond(DressCodeyes)           <2e-16 ***
## cond(LocationRural)          <2e-16 ***
## cond(LocationTown)           <2e-16 ***
## cond(LocationUrban Fringe)     0.39    
## cond(UniformsRequiredyes)      0.68    
## cond(SecurityCamerasyes)       0.71    
## cond(NEnrollment)              0.75    
## cond(TrainingHours)            0.99    
##  
## (conditional average) 
##                             Estimate Std. Error Adjusted SE z value
## cond((Int))                 3.60e+00   1.32e-01    1.32e-01   27.25
## cond(DressCodeyes)          3.77e-01   9.50e-02    9.53e-02    3.96
## cond(LocationRural)        -5.60e-01   1.33e-01    1.33e-01    4.20
## cond(LocationTown)         -5.87e-01   1.54e-01    1.55e-01    3.80
## cond(LocationUrban Fringe) -1.02e-01   1.18e-01    1.18e-01    0.87
## cond(UniformsRequiredyes)   1.66e-01   1.92e-01    1.93e-01    0.86
## cond(SecurityCamerasyes)   -7.41e-02   9.60e-02    9.63e-02    0.77
## cond(NEnrollment)           3.05e-05   4.55e-05    4.56e-05    0.67
## cond(TrainingHours)        -8.26e-04   4.91e-02    4.93e-02    0.02
##                            Pr(>|z|)    
## cond((Int))                  <2e-16 ***
## cond(DressCodeyes)           <2e-16 ***
## cond(LocationRural)          <2e-16 ***
## cond(LocationTown)           <2e-16 ***
## cond(LocationUrban Fringe)     0.39    
## cond(UniformsRequiredyes)      0.39    
## cond(SecurityCamerasyes)       0.44    
## cond(NEnrollment)              0.50    
## cond(TrainingHours)            0.99    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

If you are trying to get model-averaged coefficients from the summary output above, be sure to look for the “full average” ones and not the “conditional average” (which only includes models where the predictor was included, i.e., where the coefficient was not 0).

7.3.2 Getting Predictions from the Averaged Model

The resulting predictions are a list with entries fit and se.fit just like we are used to. (So you could make predictions with a newdata data set and use them for prediction plots, for example. Be careful – your “new” dataset now has to include values for all candidate predictors in the full model.)

Comparing with the predictions from our previous “best” model:

So they are pretty comparable, but a little different (the differences may be bigger the more there are different models with similar IC results contributing to the average model – when one model carries almost all the weight, then the “single best” model and the model-averaging model will give almost the same results). It also makes sense that there will be a bit more uncertainty in the average model.