Chapter 6 Regression for Count Data

So far, we have fitted regression models for continuous-valued quantitative response variables. What if our response variable is really count data – discrete quantitative values limited to zero and positive integers?

6.1 Data Source

The dataset used here is beevisits, from:

Felicity Muth, Jacob S. Francis, and Anne S. Leonard. 2017. Bees use the taste of pollen to determine which flowers to visit. Biology Letters 12(7): DOI: 10.1098/rsbl.2016.0356. http://rsbl.royalsocietypublishing.org/content/roybiolett/12/7/20160356.full.pdf.

knitr::include_graphics('images/BeeExperiment.png')

6.2 A bad idea: multiple linear regression model

summary(beevisits)

##  colony                    beename        treatment  target.colour
##  W:126   giantbeeYsucroseyellow:  3   Quinine  :90   blue  :138   
##  X: 66   o51Wsucroseblue       :  3   Cellulose:90   yellow:132   
##  Y: 78   o54Wquinineblue       :  3   Sucrose  :90                
##          o58Wcelluloseyellow   :  3                               
##          o60Wcelluloseblue     :  3                               
##          o63Wquinineblue       :  3                               
##          (Other)               :252                               
##     novisits          flower  
##  Min.   : 0.00   familiar:90  
##  1st Qu.: 1.00   novel   :90  
##  Median : 3.00   target  :90  
##  Mean   : 4.44                
##  3rd Qu.: 7.00                
##  Max.   :21.00                
##

bee.lm <- lm(novisits ~ flower + treatment +
               colony, data=beevisits)
summary(bee.lm)

## 
## Call:
## lm(formula = novisits ~ flower + treatment + colony, data = beevisits)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.567 -2.111 -0.285  1.567 11.577 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           2.111      0.444    4.76  3.3e-06 ***
## flowernovel          -1.678      0.439   -3.83  0.00016 ***
## flowertarget          5.456      0.439   12.44  < 2e-16 ***
## treatmentCellulose    2.688      0.439    6.12  3.3e-09 ***
## treatmentSucrose      2.724      0.441    6.17  2.6e-09 ***
## colonyX              -0.832      0.449   -1.85  0.06511 .  
## colonyY              -1.849      0.425   -4.35  2.0e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.94 on 263 degrees of freedom
## Multiple R-squared:  0.576,  Adjusted R-squared:  0.567 
## F-statistic: 59.6 on 6 and 263 DF,  p-value: <2e-16

6.3 Problems with the linear model

What problems do we have with this model (its appropriateness, or goodness of fit to the data)?

Non-constant error variance
Non-normality of residuals
Some predicted values are less than 0 – it’s impossible that a bee could visit a flower a negative number of times!

6.4 Poisson Regression

Detailed notes on the model equation were filled in here in class

6.4.1 Fitting the Model

prm <- glm(novisits ~ flower + treatment +
               colony, data=beevisits,
           family=poisson(link='log')) 
summary(prm)

## 
## Call:
## glm(formula = novisits ~ flower + treatment + colony, family = poisson(link = "log"), 
##     data = beevisits)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.994  -1.312  -0.386   0.779   4.702  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          0.7905     0.0869    9.09  < 2e-16 ***
## flowernovel         -0.7507     0.1044   -7.19  6.5e-13 ***
## flowertarget         0.9994     0.0692   14.45  < 2e-16 ***
## treatmentCellulose   0.6982     0.0791    8.83  < 2e-16 ***
## treatmentSucrose     0.7095     0.0797    8.91  < 2e-16 ***
## colonyX             -0.1752     0.0718   -2.44    0.015 *  
## colonyY             -0.4372     0.0731   -5.98  2.2e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1236.33  on 269  degrees of freedom
## Residual deviance:  542.05  on 263  degrees of freedom
## AIC: 1261
## 
## Number of Fisher Scoring iterations: 5

6.4.2 Conditions

What conditions must hold for this model to be appropriate?

Response variable (y) contains count data
Linearity: \(log(\lambda_{i})\) is a linear function of the covariates \(x_1\), \(x_2\), … \(x_n\). (Later we will consider how to check this when some covariates are quantitative…in brief: plot log(rate) as a function of each covariate and look for (no non-)linear trend.)
Mean = Variance
Independence (of residuals)
There is not a condition specifying a PDF that the residuals should follow.

6.4.3 Model Assessment

Which conditions can we check already?

acf(resid(prm, type='response'), main='bee.lm Residuals')

What does type='response' mean?

This means that the residuals are reported on the same scale as the response variable (that is, in units of counts, rather than log(counts)).

beevisits <- beevisits %>%
  mutate(pm.resids = resid(prm, type='response'),
         pm.fitted = fitted(prm))

gf_point(pm.resids ~ pm.fitted, data=beevisits) %>%
  gf_labs(x='Fitted Values', y='Residuals')

This trumpet pattern is what we expect! Why?

If the variance equals the mean, then as the mean (fitted value) goes up, then the variance (spread of residuals) will also be larger.

But how can we decide if it’s the right amount of increase in variance with increase in fitted value? We can compute Pearson residuals, which are scaled by the expected variance. The Pearson residuals should have approximately constant variance as a function of fitted values, and a standard deviation of about 1.

beevisits <- beevisits %>%
  mutate(pm.pearson.resids = resid(prm, type='pearson'))

gf_point( pm.pearson.resids ~ pm.fitted, data=beevisits) %>%
  gf_labs(x='Fitted Values', y='Pearson Residuals')

A more complex solution: we could divide the fitted values into bins (choice of bin size is somewhat arbitrary; generally, you want them as small as possible, but still containing enough observations per bin to get good mean and variance estimates). In each bin, compute the mean and the variance of the residuals. Plot these means vs these variances and see if the slope is 1 (and intercept 0).

First, split the fitted values into 5 bins:

beevisits <- beevisits %>%
  mutate(fitted.bins = cut(fitted(prm), breaks=5))

head(beevisits)

##   colony                beename treatment target.colour novisits flower
## 1      Y giantbeeYsucroseyellow   Sucrose        yellow       12 target
## 2      W        o51Wsucroseblue   Sucrose          blue        8 target
## 3      W        o54Wquinineblue   Quinine          blue       13 target
## 4      W    o58Wcelluloseyellow Cellulose        yellow        9 target
## 5      W      o60Wcelluloseblue Cellulose          blue       11 target
## 6      W        o63Wquinineblue   Quinine          blue        5 target
##   lm.resids lm.fitted pm.resids pm.fitted pm.pearson.resids fitted.bins
## 1     3.559      8.44     4.136      7.86             1.475 (7.57,9.88]
## 2    -2.291     10.29    -4.177     12.18            -1.197 (9.88,12.2]
## 3     5.433      7.57     7.011      5.99             2.865 (5.27,7.57]
## 4    -1.255     10.26    -3.039     12.04            -0.876 (9.88,12.2]
## 5     0.745     10.26    -1.039     12.04            -0.300 (9.88,12.2]
## 6    -2.567      7.57    -0.989      5.99            -0.404 (5.27,7.57]

Next, compute the means and variances in each bin, view the result, and plot:

binned.bees <- beevisits %>%
  group_by(fitted.bins) %>%
  summarise(mean = mean(pm.fitted, na.rm=TRUE),
            var = var(pm.resids))

binned.bees

## # A tibble: 5 x 3
##   fitted.bins   mean   var
##   <fct>        <dbl> <dbl>
## 1 (0.661,2.97]  1.77  3.34
## 2 (2.97,5.27]   4.28 10.6 
## 3 (5.27,7.57]   5.99 21.3 
## 4 (7.57,9.88]   7.83 12.8 
## 5 (9.88,12.2]  11.5  13.5

gf_point(var ~ mean, data=binned.bees) %>%
  gf_labs(x='Mean Fitted Value', 
          y='Variance of Residuals') %>%
   gf_abline(slope=1, intercept=0)

That’s a bit of a pain, and still a judgment call at the end.

How else can we check the Mean = Variance condition?

6.4.4 Checking for overdispersion using overdispersion factor

We can estimate the overdispersion factor. This satisfies

\[ \text{residual variance} = \text{overdispersion factor} * \text{mean}\]

So if it’s a lot larger than 1, we have a problem. The function overdisp_fun() in package s245 computes an estimate:

require(s245)
overdisp_fun(prm)

## [1] 2.05

Here, it’s about 2: not too terrible, but still, residual variance is double what it should be according to our model. (Usually if the overdispersion is larger than 2 we will prefer a different model that better accounts for the fact that \(\text{mean} \neq \text{variance}\).)

6.5 Accounting for overdispersion: Negative Binomial Models

Finally, we could simply fit a model that allows for a more permissive mean-variance relationship (where the variance can be larger or smaller than the mean, by different amounts), and see if it is a better fit to the data according it our IC. The negative binomial distributions (type I and II) do this:

require(glmmTMB)
nbm1 <- glmmTMB(novisits ~ flower + treatment +
               colony, data=beevisits,
           family=nbinom1(link='log'))

nbm2 <- glmmTMB(novisits ~ flower + treatment +
               colony, data=beevisits,
           family=nbinom2(link='log'))

summary(nbm1)

##  Family: nbinom1  ( log )
## Formula:          novisits ~ flower + treatment + colony
## Data: beevisits
## 
##      AIC      BIC   logLik deviance df.resid 
##     1183     1211     -583     1167      262 
## 
## 
## Overdispersion parameter for nbinom1 family (): 1.07 
## 
## Conditional model:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          0.7173     0.1229    5.84  5.3e-09 ***
## flowernovel         -0.7414     0.1450   -5.11  3.2e-07 ***
## flowertarget         1.0384     0.0981   10.58  < 2e-16 ***
## treatmentCellulose   0.7156     0.1117    6.41  1.5e-10 ***
## treatmentSucrose     0.7432     0.1115    6.67  2.6e-11 ***
## colonyX             -0.1150     0.0997   -1.15  0.24904    
## colonyY             -0.3839     0.1015   -3.78  0.00016 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(nbm2)

##  Family: nbinom2  ( log )
## Formula:          novisits ~ flower + treatment + colony
## Data: beevisits
## 
##      AIC      BIC   logLik deviance df.resid 
##     1207     1236     -596     1191      262 
## 
## 
## Overdispersion parameter for nbinom2 family (): 4.69 
## 
## Conditional model:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          0.7354     0.1178    6.24  4.2e-10 ***
## flowernovel         -0.7255     0.1269   -5.72  1.1e-08 ***
## flowertarget         1.0438     0.0998   10.46  < 2e-16 ***
## treatmentCellulose   0.7619     0.1123    6.79  1.2e-11 ***
## treatmentSucrose     0.7694     0.1142    6.73  1.6e-11 ***
## colonyX             -0.1681     0.1084   -1.55     0.12    
## colonyY             -0.5153     0.1076   -4.79  1.7e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can use AIC or BIC-based model selection to decide which model fits best, between Poisson, NB1, and NB2 models.

AIC(prm, nbm1, nbm2)

##      df  AIC
## prm   7 1261
## nbm1  8 1183
## nbm2  8 1207

We have a clear winner: NB1!

6.6 Accounting for overdispersion: quasi-Poisson Model

Or yet another option…in our class, we will not use quasi-Poisson models as much, as they are fitted via quasi-likelihood and this complicates model selection.

qprm <- glm(novisits ~ flower + treatment +
               colony, data=beevisits,
           family=quasipoisson(link='log'))
summary(qprm)

## 
## Call:
## glm(formula = novisits ~ flower + treatment + colony, family = quasipoisson(link = "log"), 
##     data = beevisits)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.994  -1.312  -0.386   0.779   4.702  
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           0.790      0.124    6.35  9.1e-10 ***
## flowernovel          -0.751      0.149   -5.02  9.3e-07 ***
## flowertarget          0.999      0.099   10.10  < 2e-16 ***
## treatmentCellulose    0.698      0.113    6.17  2.6e-09 ***
## treatmentSucrose      0.710      0.114    6.22  1.9e-09 ***
## colonyX              -0.175      0.103   -1.71    0.089 .  
## colonyY              -0.437      0.105   -4.18  4.0e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 2.05)
## 
##     Null deviance: 1236.33  on 269  degrees of freedom
## Residual deviance:  542.05  on 263  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 5

We should not compare AIC with QAIC to compare two models, so to decide between Poisson and quasi-Poisson (if you ever use it) you must rely on the model assessment plots and the overdispersion factor to decide which is better.

6.7 Model selection with dredge() and (Q)AIC, BIC

Poisson and negative binomial models are fitted via maximum likelihood, so AIC or BIC may be used for model selection.

How can we use model selection criteria in a case where the likelihood can’t be computed exactly?

The quasi (log) likelihood is an approximation to the likelihood. It has some properties in common with a (log) likelihood, but is not a likelihood; we resort to using it in cases where the (log) likelihood can not even be evaluated.

With some code/mathematical gymnastics, we can use principles of quasi-likelihood to estimate QAIC (quasi-AIC, or AIC based on quasi-likelihood) in R for model selection.

6.7.1 Review: all subsets selection with dredge()

What if, instead of comparing two (or a few) specific models, we want to compare all possible models that contain some combination of a set of candidate covariates? The package MuMIn contains a function, dredge(), that takes as input a “full” model (one containing all the candidate covariates). It then computes AIC (or other model selection criteria) for all possible models containing subsets of the covariate and reports the results in a ranked table. For example, for our Poisson regression model, we could do:

require(MuMIn)
#have to make sure na.action is 'na.fail' for input model
prm <- update(prm, na.action='na.fail')
dredge(prm, rank='AIC')

## Global model call: glm(formula = novisits ~ flower + treatment + colony, family = poisson(link = "log"), 
##     data = beevisits, na.action = "na.fail")
## ---
## Model selection table 
##   (Intrc) colny flowr trtmn df logLik  AIC delta weight
## 8   0.790     +     +     +  7   -623 1261     0      1
## 7   0.643           +     +  5   -642 1295    34      0
## 4   1.310     +     +        5   -677 1364   104      0
## 3   1.156           +        3   -695 1396   136      0
## 6   1.124     +           +  5   -899 1807   547      0
## 5   0.977                 +  3   -918 1841   581      0
## 2   1.644     +              3   -953 1911   650      0
## 1   1.490                    1   -970 1943   682      0
## Models ranked by AIC(x)

dredge(prm, rank='BIC')

## Global model call: glm(formula = novisits ~ flower + treatment + colony, family = poisson(link = "log"), 
##     data = beevisits, na.action = "na.fail")
## ---
## Model selection table 
##   (Intrc) colny flowr trtmn df logLik  BIC delta weight
## 8   0.790     +     +     +  7   -623 1286   0.0      1
## 7   0.643           +     +  5   -642 1313  26.8      0
## 4   1.310     +     +        5   -677 1382  96.6      0
## 3   1.156           +        3   -695 1407 121.2      0
## 6   1.124     +           +  5   -899 1825 539.5      0
## 5   0.977                 +  3   -918 1852 566.3      0
## 2   1.644     +              3   -953 1922 636.0      0
## 1   1.490                    1   -970 1946 660.7      0
## Models ranked by BIC(x)

6.7.2 Review: IC “weights”

Note the last two columns of the dredge output: the “delta” (or \(\delta\)) AIC or BIC values and the “weights”. The \(\delta\)s are obtained by simply subtracting the best model’s IC value from that of each other model.

We already mentioned a rule of thumb: that the \(\delta IC\) should be at least 3 or to provide reasonable evidence that one model is really better than the other. Another way of measuring the differences between models is to use model weights. Theoretically, these measure the relative likelihoods of different models; you can think of them as giving the probability that a given model is the best-fitting one in the set of models examined, according to the IC being used.

Model weights are computed simply according to: \[ e^\frac{-\delta IC}{2}\]

And they sum to one for all models in a dredge().

6.7.3 Extending dredge() to quasi-Likelihood

With (rather a lot of) effort to define some custom functions, we can do the same thing for a quasi-Poisson model using quasi-AIC. Note: This idea is based on notes by Ben Bolker provided with the R package bbmle.

require(MuMIn)

# modify a glm() output object so that
# it contains a quasipoisson fit but the 
# AIC (likelihood) from the equivalent regular Poisson model
x.quasipoisson <- function(...) {
res <- quasipoisson(...)
res$aic <- poisson(...)$aic
res
}

# function to extract the overdispersion parameter
# from a quasi model
dfun <- function(object) {
with(object,sum((weights * residuals^2)[weights > 0])/df.residual)
}

# function that modifies MuMIn::dredge() 
# for use with quasi GLM
qdredge <- function(model, family='x.quasipoisson', na.action=na.fail, chat = dfun(model), rank='QAIC', ...){
  model2 <- update(model, family=family, na.action=na.action)
  (dt <- dredge(model2, rank=rank, chat=chat, ...))
}

#do "dredge" for model selection
qdredge(qprm)

## Global model call: glm(formula = novisits ~ flower + treatment + colony, family = family, 
##     data = beevisits, na.action = na.action)
## ---
## Model selection table 
##   (Intrc) colny flowr trtmn df logLik rank delta weight
## 8   0.790     +     +     +  7   -623  625   0.0  0.999
## 7   0.643           +     +  5   -642  640  14.6  0.001
## 4   1.310     +     +        5   -677  674  48.7  0.000
## 3   1.156           +        3   -695  687  62.2  0.000
## 6   1.124     +           +  5   -899  890 265.0  0.000
## 5   0.977                 +  3   -918  904 279.6  0.000
## 2   1.644     +              3   -953  939 313.6  0.000
## 1   1.490                    1   -970  952 327.1  0.000
## Models ranked by rank(x, chat = 2.04707063774377)

Note: the qdredge() function is also provided for you in the package s245. So if you do want to use this, all you need to do is use s245::qdredge() instead of dredge().

6.8 Offsets

In our model, the response variable was the number of flowers visited by a bee, and each bee was in the same experimental setting for the same amount of time, so there was no need to account for effort or time spent in each case.

This is not always true: consider, for example:

Dolphin surveys
Bird counts
Consider the schools and crime example from homework as well

In this case, it would be natural to adjust for effort. The intuitive way to do it would be to use counts per unit effort as the response variable:

\[ log(\frac{\lambda_i}{effort}) = \beta_0 + \dots \]

But notice that this is equivalent to including \(log(effort)\) as an ``offset" on the right hand side of the regression equation:

\[ log(\lambda_i) = \beta_0 + \dots + log(effort)\]

This is how we specify models with offsets in R:

offset.mod <- glm(counts ~ predictor1 + predictor2 +
                    offset(log(effort)), family=poisson)

Note: if you use dredge() with a model with an offset, be sure to specify the offset as a “fixed” term, i.e. a term that must be included in all models:

dredge(offset.mod, fixed = 'offset(log(effort))')

6.9 Prediction Plots

Once you have a “best” model, how do you interpret it?

When we went to great lengths to make prediction plots for a linear regression model so we could “see” the slope of the predicted relationship, you may have wondered: why bother? I can just look at the slope estimate and get the same insight!

Now, with a more complicated model equation with a link function, it’s not so easy. Now, we will really appreciate those prediction plots!

With the link function and the Poisson distribution, it is more challenging to interpret the coefficients of the model directly. The easiest way to understand the effects of different predictors is to look at plots of model predictions. However, as always we don’t want to plot fitted(model) as a function of each predictor in a model with more than one predictor; predictors other than the one we are interested in will influence the predictions, introducing extra variation. Instead, we will construct a new (fake) dataset to make predictions for, in which all predictors but the one we are interested in are held constant. For example, using our quasi-Poisson model and looking at the effect of colony (predicted values with 95% CIs):

newdata <- data.frame(colony=c('W', 'X', 'Y'),
                      flower='familiar',
                      target.colour='blue',
                      treatment='Cellulose')

pred = predict(qprm, newdata=newdata,
                 type='response',
                 se.fit=TRUE)

newdata <- newdata %>%
  mutate(preds = pred$fit,
         CIlow = pred$fit - 1.96*pred$se.fit,
         CIup = pred$fit + 1.96*pred$se.fit)

gf_point(preds ~ colony, data=newdata) %>%
  gf_labs(x='Colony', y='Predicted\nN. Visits') %>%
  gf_errorbar(CIlow + CIup ~ colony, data=newdata)

If we had a quantitative predictor instead, the process would be similar, except when we define newdata, we would have to choose a range and granularity for which to make predictions. For example, imagine if bee length in mm was one of our predictors, and we wanted predictions for lengths between 0.5 and 3 mm (with a value every 0.05mm). We might begin:

newdata <- expand.grid(bee.length = seq(from=0.05, by=0.05, to=3),
                       colony=c('W'),
                       flower='familiar',
                      target.colour='blue',
                      treatment='Cellulose')
head(newdata)

##   bee.length colony   flower target.colour treatment
## 1       0.05      W familiar          blue Cellulose
## 2       0.10      W familiar          blue Cellulose
## 3       0.15      W familiar          blue Cellulose
## 4       0.20      W familiar          blue Cellulose
## 5       0.25      W familiar          blue Cellulose
## 6       0.30      W familiar          blue Cellulose

Then, when we make the plot, we would need to use gf\_ribbon() instead of gf\_errorbar() to show the CI.

We can also use pred_plot(fitted_model, 'variable_name') to make these plots with a lot less code (and if you need to know the values at which other predictors are fixed, use get_fixed(dataset)).