18  Other Model Selection Approaches

The way we have been doing model selection thus far is definitely not the only way. What other options are out there? Many - let’s consider a few.

This section also includes some miscellaneous R notes (making summary tables, and sources of inspiration for cool figures).

18.0.1 Rationale

Until now, we have focused on using information criteria for model selection, in order to get very familiar with one coherent framework for choosing variables across model types. But:

  • In some fields, using hypothesis tests for variable selection is preferred
  • For datasets that are large and/or models that are complex, dredge() can be a challenge (taking a very long time to run and perhaps timing out on the server)
  • Using hypothesis tests for selection is quite common, so we should know how it’s done!

18.0.2 Hypotheses

Basically, for each (fixed effect) variable in a model, we’d like to test:

\[H_0: \text{all } \beta\text{s for this variable are 0; it's not a good predictor}\] \[H_1: \text{ at least one } \beta\text{ is non-zero; it's a good predictor}\]

We want to test these hypotheses given that all the other predictors in the current full model are included. Because of this condition, and because there are multiple \(\beta\)s for categorical predictors with more than 2 categories, we can not generally just use the p-values from the model summary() output.

Instead, we use Anova() from the package car. lm() example:

iris_mod <- lm(Petal.Length ~ Petal.Width + Species + Sepal.Length, data = iris)
summary(iris_mod)

Call:
lm(formula = Petal.Length ~ Petal.Width + Species + Sepal.Length, 
    data = iris)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.76508 -0.15779  0.01102  0.13378  0.66548 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -1.45957    0.22387  -6.520 1.09e-09 ***
Petal.Width        0.50641    0.11528   4.393 2.15e-05 ***
Speciesversicolor  1.73146    0.12762  13.567  < 2e-16 ***
Speciesvirginica   2.30468    0.19839  11.617  < 2e-16 ***
Sepal.Length       0.55873    0.04583  12.191  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2664 on 145 degrees of freedom
Multiple R-squared:  0.9778,    Adjusted R-squared:  0.9772 
F-statistic:  1600 on 4 and 145 DF,  p-value: < 2.2e-16
library(car)
Anova(iris_mod)
Anova Table (Type II tests)

Response: Petal.Length
              Sum Sq  Df F value    Pr(>F)    
Petal.Width   1.3691   1  19.296 2.147e-05 ***
Species      13.6137   2  95.936 < 2.2e-16 ***
Sepal.Length 10.5454   1 148.627 < 2.2e-16 ***
Residuals    10.2880 145                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Notice that Anova() reports one p-value for each predictor (excellent!). If the p-value is small, that gives evidence against \(H_0\), and we’d conclude we should keep the predictor in the model. Many people use \(\alpha = 0.05\) as the “dividing line” between “small” and “large” p-values and thus “statistically significant” and “non-significant” test results, but remember the p-value is a probability - there’s no magical difference between 0.049 and 0.051.

Warning: be careful with your capitalization! The R function anova() does someting kind of similar to Anova() but NOT the same and should be avoided – it does sequential rather than marginal tests.

18.1 Backward selection

How do we use p-value-based selection to arrive at a best model? There are many options and much controversy about different approaches; here I’ll suggest one. None of these methods are guaranteed to arrive at a model that is theoretically “best” in some specific way, but they do give a framework to guide decision-making and are computationally quick. The premise is that we’d like a simple algorithm to implement, and we will begin with a full model including all the predictors that we think should or could reasonably be important (not just throwing in everything possible).

18.1.1 Algorithm

  • Obtain p-values for all predictors in full model
  • Remove the predictor with the largest p-value that you judge to be “not small” or “not significant”
  • Re-compute p-values for the new, smaller model
  • Repeat until all p-values are “significant”

18.1.2 Example

Let’s consider a logistic regression to predict whether a person in substance-abuse treatment is homeless.

home_mod0 <- glm(homeless ~ sex + substance + i1 + cesd + 
                  racegrp + age,
                data = HELPrct, family = binomial(link = 'logit'))
Anova(home_mod0)
Analysis of Deviance Table (Type II tests)

Response: homeless
          LR Chisq Df Pr(>Chisq)   
sex         3.3837  1   0.065846 . 
substance   2.2326  2   0.327483   
i1         10.6261  1   0.001115 **
cesd        1.1751  1   0.278351   
racegrp     3.1811  3   0.364541   
age         0.4392  1   0.507491   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Removing age:

home_mod <- update(home_mod0, .~. - age)
Anova(home_mod)
Analysis of Deviance Table (Type II tests)

Response: homeless
          LR Chisq Df Pr(>Chisq)    
sex         3.2184  1   0.072817 .  
substance   2.6874  2   0.260877    
i1         11.0836  1   0.000871 ***
cesd        1.1300  1   0.287773    
racegrp     3.1561  3   0.368180    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Removing racegrp

home_mod <- update(home_mod, .~. - racegrp)
Anova(home_mod)
Analysis of Deviance Table (Type II tests)

Response: homeless
          LR Chisq Df Pr(>Chisq)    
sex         3.5883  1  0.0581886 .  
substance   3.6342  2  0.1624971    
i1         11.0174  1  0.0009026 ***
cesd        1.6055  1  0.2051242    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Remove cesd (a score indicating depression level)

home_mod <- update(home_mod, .~. - cesd)
Anova(home_mod)
Analysis of Deviance Table (Type II tests)

Response: homeless
          LR Chisq Df Pr(>Chisq)    
sex         2.7855  1   0.095118 .  
substance   3.6743  2   0.159272    
i1         13.2405  1   0.000274 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Remove substance

home_mod <- update(home_mod, .~. - substance)
Anova(home_mod)
Analysis of Deviance Table (Type II tests)

Response: homeless
    LR Chisq Df Pr(>Chisq)    
sex   2.9647  1     0.0851 .  
i1   25.7861  1  3.814e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Remove sex

home_mod <- update(home_mod, .~. - sex)
Anova(home_mod)
Analysis of Deviance Table (Type II tests)

Response: homeless
   LR Chisq Df Pr(>Chisq)    
i1   27.187  1  1.847e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

18.1.3 Can’t this be automated?

Strangely…functions are not widely available.

18.1.4 Stepwise IC-based selection

Another option may be to use backward stepwise selection (same algorithm as above), but using AIC or BIC as the criterion at each stage instead of p-values. If the IC value is better (by any amount) without a variable, it gets dropped. Variables are dropped one by one until no further IC improvement is possible.

This evaluates many fewer models than dredge so should be much faster, but may not find the best of all possible models.

For example, for our model using AIC (note: this may or may not work for all model types.):

library(MASS)
stepAIC(home_mod0)
Start:  AIC=606.26
homeless ~ sex + substance + i1 + cesd + racegrp + age

            Df Deviance    AIC
- racegrp    3   589.44 603.44
- substance  2   588.49 604.49
- age        1   586.70 604.70
- cesd       1   587.43 605.43
<none>           586.26 606.26
- sex        1   589.64 607.64
- i1         1   596.88 614.88

Step:  AIC=603.44
homeless ~ sex + substance + i1 + cesd + age

            Df Deviance    AIC
- age        1   589.85 601.85
- substance  2   592.45 602.45
- cesd       1   591.10 603.10
<none>           589.44 603.44
- sex        1   593.20 605.20
- i1         1   600.00 612.00

Step:  AIC=601.85
homeless ~ sex + substance + i1 + cesd

            Df Deviance    AIC
- cesd       1   591.46 601.46
- substance  2   593.49 601.49
<none>           589.85 601.85
- sex        1   593.44 603.44
- i1         1   600.87 610.87

Step:  AIC=601.46
homeless ~ sex + substance + i1

            Df Deviance    AIC
- substance  2   595.13 601.13
<none>           591.46 601.46
- sex        1   594.24 602.24
- i1         1   604.70 612.70

Step:  AIC=601.13
homeless ~ sex + i1

       Df Deviance    AIC
<none>      595.13 601.13
- sex   1   598.10 602.10
- i1    1   620.92 624.92

Call:  glm(formula = homeless ~ sex + i1, family = binomial(link = "logit"), 
    data = HELPrct)

Coefficients:
(Intercept)      sexmale           i1  
    0.92969     -0.39976     -0.02657  

Degrees of Freedom: 452 Total (i.e. Null);  450 Residual
Null Deviance:      625.3 
Residual Deviance: 595.1    AIC: 601.1

Note that we might want to still remove one more variable than stepAIC() does! Above, you see that if you were to remove age, the AIC would only go up by about 1 unit. So according to our \(\Delta AIC \sim 3\) threshold, we would take age out too.

Using BIC instead, we need to specify the input k = log(nrow(data)) (the BIC penalty multiplier):

stepAIC(home_mod0, k = log10(nrow(HELPrct)))
Start:  AIC=612.82
homeless ~ sex + substance + i1 + cesd + racegrp + age

            Df Deviance    AIC
- racegrp    3   589.44 608.03
- substance  2   588.49 609.74
- age        1   586.70 610.60
- cesd       1   587.43 611.34
<none>           586.26 612.82
- sex        1   589.64 613.55
- i1         1   596.88 620.79

Step:  AIC=608.03
homeless ~ sex + substance + i1 + cesd + age

            Df Deviance    AIC
- substance  2   592.45 605.73
- age        1   589.85 605.79
- cesd       1   591.10 607.04
<none>           589.44 608.03
- sex        1   593.20 609.14
- i1         1   600.00 615.94

Step:  AIC=605.73
homeless ~ sex + i1 + cesd + age

       Df Deviance    AIC
- age   1   593.49 604.11
- cesd  1   594.20 604.82
<none>      592.45 605.73
- sex   1   596.48 607.11
- i1    1   611.94 622.57

Step:  AIC=604.11
homeless ~ sex + i1 + cesd

       Df Deviance    AIC
- cesd  1   595.13 603.10
<none>      593.49 604.11
- sex   1   597.25 605.22
- i1    1   615.70 623.66

Step:  AIC=603.1
homeless ~ sex + i1

       Df Deviance    AIC
<none>      595.13 603.10
- sex   1   598.10 603.41
- i1    1   620.92 626.23

Call:  glm(formula = homeless ~ sex + i1, family = binomial(link = "logit"), 
    data = HELPrct)

Coefficients:
(Intercept)      sexmale           i1  
    0.92969     -0.39976     -0.02657  

Degrees of Freedom: 452 Total (i.e. Null);  450 Residual
Null Deviance:      625.3 
Residual Deviance: 595.1    AIC: 601.1

To get less verbose output, set trace = 0 – but then you won’t know whether it would make sense to perhaps remove additional variables…

stepAIC(home_mod0, k = log10(nrow(HELPrct)), trace = 0)

Call:  glm(formula = homeless ~ sex + i1, family = binomial(link = "logit"), 
    data = HELPrct)

Coefficients:
(Intercept)      sexmale           i1  
    0.92969     -0.39976     -0.02657  

Degrees of Freedom: 452 Total (i.e. Null);  450 Residual
Null Deviance:      625.3 
Residual Deviance: 595.1    AIC: 601.1

18.2 Summary tables

You may want to compute and display summary tables for your projects. Here are a few examples of how to do it.

18.2.1 Mean (or sd, median, IQR, etc.) by groups

Compute the mean and sd (could use any other summary stats you want, though) for several quantitative variables, by groups.

Example: find mean and sd of iris flower Petal.Length and Petal.Width by Species and display results in a pretty table. The dataset is called iris.

Make a little one-row table for each variable being summarized, then stick them together.

library(knitr)

length_stats <- iris |> 
  df_stats(Petal.Length ~ Species, mean, sd, long_names = FALSE) |>
  mutate(variable = 'Petal Length')

width_stats <- iris |> 
  df_stats(Petal.Width ~ Species, mean, sd, long_names = FALSE) |>
  mutate(variable = 'Petal Width')

my_table <- bind_rows(length_stats, width_stats)

kable(my_table)
response Species mean sd variable
Petal.Length setosa 1.462 0.1736640 Petal Length
Petal.Length versicolor 4.260 0.4699110 Petal Length
Petal.Length virginica 5.552 0.5518947 Petal Length
Petal.Width setosa 0.246 0.1053856 Petal Width
Petal.Width versicolor 1.326 0.1977527 Petal Width
Petal.Width virginica 2.026 0.2746501 Petal Width

What if we want to round all table entries to 2 digits after the decimal?

kable(my_table, digits = 2)
response Species mean sd variable
Petal.Length setosa 1.46 0.17 Petal Length
Petal.Length versicolor 4.26 0.47 Petal Length
Petal.Length virginica 5.55 0.55 Petal Length
Petal.Width setosa 0.25 0.11 Petal Width
Petal.Width versicolor 1.33 0.20 Petal Width
Petal.Width virginica 2.03 0.27 Petal Width

What if we want the column order to be Variable, Species, mean, sd, and sort by Species and then Variable?

my_table <- my_table |>
  dplyr::select(variable, Species, mean, sd) |>
  arrange(Species, variable)
kable(my_table, digits = 2)
variable Species mean sd
Petal Length setosa 1.46 0.17
Petal Width setosa 0.25 0.11
Petal Length versicolor 4.26 0.47
Petal Width versicolor 1.33 0.20
Petal Length virginica 5.55 0.55
Petal Width virginica 2.03 0.27

What if we actually want a column for mean length, sd length, etc. and one row per species?

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ lubridate 1.9.2     ✔ tibble    3.2.1
✔ purrr     1.0.1     ✔ tidyr     1.3.0
✔ readr     2.1.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ mosaic::count() masks dplyr::count()
✖ purrr::cross()  masks mosaic::cross()
✖ mosaic::do()    masks dplyr::do()
✖ tidyr::expand() masks Matrix::expand()
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
✖ tidyr::pack()   masks Matrix::pack()
✖ car::recode()   masks dplyr::recode()
✖ MASS::select()  masks dplyr::select()
✖ purrr::some()   masks car::some()
✖ mosaic::stat()  masks ggplot2::stat()
✖ mosaic::tally() masks dplyr::tally()
✖ tidyr::unpack() masks Matrix::unpack()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
my_table2 <- my_table |>
  pivot_wider(names_from = variable, 
              values_from = c("mean", "sd"),
              names_sep = ' ')
kable(my_table2, digits = 2, align = 'c')
Species mean Petal Length mean Petal Width sd Petal Length sd Petal Width
setosa 1.46 0.25 0.17 0.11
versicolor 4.26 1.33 0.47 0.20
virginica 5.55 2.03 0.55 0.27

18.2.2 Proportions in categories by groups

You may also want to make a table of proportion observations in each category by groups, potentially for many variables.

For just one variable, we can use tally:

tally(~substance | sex, data = HELPrct, format = 'prop') |>
  kable(caption = 'Proportion using each substance', digits = 2)
Proportion using each substance
female male
alcohol 0.34 0.41
cocaine 0.38 0.32
heroin 0.28 0.27

For many variables we can use a loop. For example, we might want to know the proportion homeless and housed and proportion using each substance, both by sex, from the HELPrct dataset. Above we were using the function knitr::kable() to make tables, but we can use pander::pander() too:

# select only variables needed for the table
# make the first variable the groups one
cat_data <- HELPrct |> dplyr::select(sex, substance, homeless) 

for (i in c(2:ncol(cat_data))){
tally(~cat_data[,i] | cat_data[,1], format = 'prop') |> 
    pander::pander(caption = paste('Proportion in each ',
                                   names(cat_data)[i]))
  # can rename variables in cat_data if you want better captions
  }

18.3 Figures

We’ve made a lot of figures in this class, and almost all have been kind of mediocre. To aim for awesome, here are a couple of great references for inspiration, ideas, and best practices: