Regression for Counts and Rates

Brandon M. Greenwell, PhD

University of Cincinnati

Count data

  • A type of data in which the observations take on non-negative integer values \(\left\{0, 1, 2, \dots\right\}\)

  • Examples include:

    • The number of patients who come to the ER of Children’s Hospital between 9PM and 1AM.
    • The number of shoppers in Kenwood Towne Centre on a calendar day.
    • The number of Google searches (in a week) for flights to Shanghai right before Lunar New Year.

The Poisson distribution

  • The simplest distribution for modeling counts is the Poisson distribution:

    • \(Y \sim \mathrm{Poi}\left(\mu\right)\)
    • \(f\left(y\right) = \mathrm{P}\left(Y = y\right) = \frac{\exp\left(-\mu\right)\mu^y}{y!}\), for \(y = 0, 1, 2, \dots\)
    • \(\mathrm{E}\left(Y\right) = \mathrm{Var}\left(Y\right) = \mu\), where \(\mu \in \left(0, \infty\right)\)
    • Fact: \(\sum_i Y_i \sim \mathrm{Poi}\left(\sum_i \mu_i\right)\) (aggregated data)
  • An interesting characteristic of the Poisson is that the mean and variance are equal to each other

The Poisson distribution

Show R code
# Install required package(s)
pkgs <- c("faraway", "investr", "mgcv", "performance", "pscl")
lib <- installed.packages()[, "Package"]
install.packages(setdiff(pkgs, lib))

# Y ~ Poisson(lambda = 0.5)
set.seed(2004)  # for reproducibility
par(mfrow = c(2, 2))
for (lambda in c(0.5, 2, 5, 15)) {
  y <- dpois(0:35, lambda = lambda)
  barplot(y, xlab = "y", ylab = "P(Y = y)", names = 0:35, main = paste("E(Y) =", lambda), 
          col = "dodgerblue2", border = "dodgerblue2", las = 1)
}

The Poisson distribution

Show R code
y <- rpois(10000, lambda = 200)
hist(y, br = 50)

The Poisson distribution

  • If the count is some number out of some possible total, then the response would be more appropriately modeled as a binomial r.v.

  • However, for small \(p\) and large \(n\), the Poisson distribution provides a reasonable approximation to the binomial; For example, in modeling the incidence of rare forms of cancer, the number of people affected is a small proportion of the population in a given geographical area

Show R code
c("Binomial" = pbinom(5, size = 8, p = 0.7),
  "Poisson" = ppois(5, lambda = 8 * 0.7))
 Binomial   Poisson 
0.4482262 0.5118609 
Show R code
c("Binomial" = pbinom(5, size = 100, p = 0.05),
  "Poisson" = ppois(5, lambda = 100 * 0.05))
 Binomial   Poisson 
0.6159991 0.6159607 

Galápagos islands data

There are 30 Galapagos islands and 7 variables in the data set. The relationship between the number of plant species (\(Y\)) and several geographic variables is of interest. The original data set contained several missing values which have been filled for convenience. See the faraway::galamiss data set for the original version.

Galápagos islands data

Galápagos islands data

We’ll remove Endemics since we won’t be using it!

Show R code
# Load the Galapagos Islands data
data(gala, package = "faraway")
gala$Endemics <- NULL

# Print structure of data frame
str(gala)
'data.frame':   30 obs. of  6 variables:
 $ Species  : num  58 31 3 25 2 18 24 10 8 2 ...
 $ Area     : num  25.09 1.24 0.21 0.1 0.05 ...
 $ Elevation: num  346 109 114 46 77 119 93 168 71 112 ...
 $ Nearest  : num  0.6 0.6 2.8 1.9 1.9 8 6 34.1 0.4 2.6 ...
 $ Scruz    : num  0.6 26.3 58.7 47.4 1.9 ...
 $ Adjacent : num  1.84 572.33 0.78 0.18 903.82 ...

Galápagos islands data

Summary of data frame

Show R code
summary(gala)
    Species            Area             Elevation          Nearest     
 Min.   :  2.00   Min.   :   0.0100   Min.   :  25.00   Min.   : 0.20  
 1st Qu.: 13.00   1st Qu.:   0.2575   1st Qu.:  97.75   1st Qu.: 0.80  
 Median : 42.00   Median :   2.5900   Median : 192.00   Median : 3.05  
 Mean   : 85.23   Mean   : 261.7087   Mean   : 368.03   Mean   :10.06  
 3rd Qu.: 96.00   3rd Qu.:  59.2375   3rd Qu.: 435.25   3rd Qu.:10.03  
 Max.   :444.00   Max.   :4669.3200   Max.   :1707.00   Max.   :47.40  
     Scruz           Adjacent      
 Min.   :  0.00   Min.   :   0.03  
 1st Qu.: 11.03   1st Qu.:   0.52  
 Median : 46.65   Median :   2.59  
 Mean   : 56.98   Mean   : 261.10  
 3rd Qu.: 81.08   3rd Qu.:  59.24  
 Max.   :290.20   Max.   :4669.32  

Galápagos islands data

Show R code
pairs(gala)

Galápagos islands data

Show R code
pairs(~ log(Species) + log(Area) + log(Elevation) + 
  log(Nearest) + log(Scruz + 0.1) + log(Adjacent), data = gala)

Galápagos islands data

Let’s start with a linear model

Show R code
gala.ols <- lm(log(Species) ~ log(Area) + log(Elevation) + log(Nearest) + I(log(Scruz + 0.1)) + log(Adjacent), data = gala)
summary(gala.ols)

Call:
lm(formula = log(Species) ~ log(Area) + log(Elevation) + log(Nearest) + 
    I(log(Scruz + 0.1)) + log(Adjacent), data = gala)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.4563 -0.5192 -0.1059  0.4632  1.3351 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)          5.10569    1.64880   3.097  0.00493 ** 
log(Area)            0.50350    0.09942   5.064 3.53e-05 ***
log(Elevation)      -0.37384    0.32242  -1.159  0.25767    
log(Nearest)        -0.06564    0.11475  -0.572  0.57262    
I(log(Scruz + 0.1)) -0.08255    0.09517  -0.867  0.39433    
log(Adjacent)       -0.02488    0.04596  -0.541  0.59327    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7877 on 24 degrees of freedom
Multiple R-squared:  0.7899,    Adjusted R-squared:  0.7461 
F-statistic: 18.05 on 5 and 24 DF,  p-value: 1.941e-07

Galápagos islands data

Residual analysis

Show R code
par(mfrow = c(2, 2))
plot(gala.ols, which = 1:4)

Poisson regression

  • Need a way to link the mean response \(\mathrm{E}\left(Y\right) = \mu \in \left(0, \infty\right)\) to the linear predictor \(\eta = \boldsymbol{x}^\top\boldsymbol{p}\)

  • In Poisson we regression, we default to \[ \log\left(\mu\right) = \boldsymbol{x}^\top\boldsymbol{p} \]

  • Maximum likelihood estimation provides a convenient estimate of \(\boldsymbol{\beta}\)

Galápagos islands data

Try a Poisson regression

Show R code
summary(gala.poi <- glm(Species ~ ., data = gala, family = poisson))

Call:
glm(formula = Species ~ ., family = poisson, data = gala)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  3.155e+00  5.175e-02  60.963  < 2e-16 ***
Area        -5.799e-04  2.627e-05 -22.074  < 2e-16 ***
Elevation    3.541e-03  8.741e-05  40.507  < 2e-16 ***
Nearest      8.826e-03  1.821e-03   4.846 1.26e-06 ***
Scruz       -5.709e-03  6.256e-04  -9.126  < 2e-16 ***
Adjacent    -6.630e-04  2.933e-05 -22.608  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 3510.73  on 29  degrees of freedom
Residual deviance:  716.85  on 24  degrees of freedom
AIC: 889.68

Number of Fisher Scoring iterations: 5

Galápagos islands data

Check mean-variance relationship

Show R code
mu <- fitted(gala.poi)  # fitted fitted.values
y <- gala.poi$y  # observed response values
plot(log(mu), log((y - mu)^2), xlab = expression(hat(mu)),
     ylab = expression((y-hat(mu))^2))
abline(0, 1)

Galápagos islands data

Crude check for overdispersion

Show R code
performance::check_overdispersion(gala.poi)
# Overdispersion test

       dispersion ratio =  31.749
  Pearson's Chi-Squared = 761.979
                p-value = < 0.001
Overdispersion detected.

Galápagos islands data

Similar to before, can use quasipoisson() family to correct fo overdispersion:

Show R code
gala.quasipoi <- glm(Species ~ ., family = quasipoisson(link = "log"),  data = gala)
summary(gala.quasipoi)

Call:
glm(formula = Species ~ ., family = quasipoisson(link = "log"), 
    data = gala)

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.1548079  0.2915901  10.819 1.03e-10 ***
Area        -0.0005799  0.0001480  -3.918 0.000649 ***
Elevation    0.0035406  0.0004925   7.189 1.98e-07 ***
Nearest      0.0088256  0.0102622   0.860 0.398292    
Scruz       -0.0057094  0.0035251  -1.620 0.118380    
Adjacent    -0.0006630  0.0001653  -4.012 0.000511 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasipoisson family taken to be 31.74921)

    Null deviance: 3510.73  on 29  degrees of freedom
Residual deviance:  716.85  on 24  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 5

Rates and offsets

  • The number of observed events may depend on a size variable that determines the number of opportunities for the events to occur

    • For example, the number of burglaries reported in different cities
  • In other cases, the size variable may be time

    • For example, the number of customers served by a sales worker (must take account of the differing amounts of time worked)

Rates and offsets

An experiment was conducted to determine the effect of gamma radiation on the numbers of chromosomal abnormalities (ca) observed. The number (cells), in hundreds of cells exposed in each run, differs. The dose amount (doseamt) and the rate (doserate) at which the dose is applied are the predictors of interest. The hypothesized model is as follows:

\[ \begin{align} \log\left(\mathtt{ca}/\mathtt{cells}\right) &= \boldsymbol{x}^\top\boldsymbol{\beta} \\ \implies \log\left(\mathtt{ca}\right) &= \log\left(\mathtt{cells}\right) + \boldsymbol{x}^\top\boldsymbol{\beta} \end{align} \]

Rates and offsets

Show R code
dicentric <- faraway::dicentric
dicentric$dosef <- factor(dicentric$doseamt)
fit <- glm(ca ~ offset(log(cells)) + log(doserate)*dosef, family = poisson, 
           data = dicentric)
summary(fit)

Call:
glm(formula = ca ~ offset(log(cells)) + log(doserate) * dosef, 
    family = poisson, data = dicentric)

Coefficients:
                       Estimate Std. Error z value Pr(>|z|)    
(Intercept)            -2.74671    0.03426 -80.165  < 2e-16 ***
log(doserate)           0.07178    0.03518   2.041 0.041299 *  
dosef2.5                1.62542    0.04946  32.863  < 2e-16 ***
dosef5                  2.76109    0.04349  63.491  < 2e-16 ***
log(doserate):dosef2.5  0.16122    0.04830   3.338 0.000844 ***
log(doserate):dosef5    0.19350    0.04243   4.561  5.1e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 4753.00  on 26  degrees of freedom
Residual deviance:   21.75  on 21  degrees of freedom
AIC: 209.16

Number of Fisher Scoring iterations: 4

zero-inflated outcomes

The state wildlife biologists want to model how many fish are being caught by fishermen at a state park. Visitors are asked how long they stayed, how many people were in the group, were there children in the group, and how many fish were caught. Some visitors do not fish, but there is no data on whether a person fished or not. Some visitors who did fish did not catch any fish so there are excess zeros in the data because of the people that did not fish.

Zero-inflated outcomes

Our sample consists of We have data on N=250 groups that went to a park. Each group was questioned about how many fish they caught (count), how many children were in the group (child), how many people were in the group (persons), and whether or not they brought a camper to the park (camper).

The data can be read in as follows:

fish <- read.csv("https://stats.idre.ucla.edu/stat/data/fish.csv")

Zero-inflated outcomes

# Retain only variables of interest and print summary
fish <- fish[, c("count", "child", "persons", "camper")]
summary(fish)
     count             child          persons          camper     
 Min.   :  0.000   Min.   :0.000   Min.   :1.000   Min.   :0.000  
 1st Qu.:  0.000   1st Qu.:0.000   1st Qu.:2.000   1st Qu.:0.000  
 Median :  0.000   Median :0.000   Median :2.000   Median :1.000  
 Mean   :  3.296   Mean   :0.684   Mean   :2.528   Mean   :0.588  
 3rd Qu.:  2.000   3rd Qu.:1.000   3rd Qu.:4.000   3rd Qu.:1.000  
 Max.   :149.000   Max.   :3.000   Max.   :4.000   Max.   :1.000  

Zero-inflated outcomes

Show R code
pairs(log(count + 0.1) ~ ., data = fish)

Zero-inflated outcomes

Too many zeros?

Show R code
barplot(table(fish$count))

Zero-inflated outcomes

fish.poi <- glm(count ~ ., data = fish, family = poisson)
summary(fish.poi)

Call:
glm(formula = count ~ ., family = poisson, data = fish)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.98183    0.15226  -13.02   <2e-16 ***
child       -1.68996    0.08099  -20.87   <2e-16 ***
persons      1.09126    0.03926   27.80   <2e-16 ***
camper       0.93094    0.08909   10.45   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 2958.4  on 249  degrees of freedom
Residual deviance: 1337.1  on 246  degrees of freedom
AIC: 1682.1

Number of Fisher Scoring iterations: 6
performance::check_zeroinflation(fish.poi)
# Check for zero-inflation

   Observed zeros: 142
  Predicted zeros: 95
            Ratio: 0.67
Model is underfitting zeros (probable zero-inflation).

The hurdle model

In addition to predicting the number of fish caught, there is interest in predicting the existence of excess zeros (i.e., the zeroes that were not simply a result of bad luck or lack of fishing skill). In particular, we’d like to estimate the effect of party size on catching zero fish.

We can accomplish this in several ways, but popular choices include:

  1. The zero-inflated Poisson (or negative binomial) model
  2. The hurdle model

The hurdle model

In this example, we’ll use a simple hurdle model, which essentially fits two separate models:

  1. \(\mathrm{P}\left(Y = 0\right)\) via a logistic regression
  2. \(\mathrm{P}\left(Y = j\right)\), \(j = 1, 2, \dots\) via a truncated Poisson regression

The hurdle model

You can fit hurdle models using the hurdle() function from package pscl:

Show R code
suppressMessages(library(pscl))

summary(fish.hur <- hurdle(count ~ child + camper | persons, data = fish))

Call:
hurdle(formula = count ~ child + camper | persons, data = fish)

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-0.8590 -0.7384 -0.6782 -0.1234 23.9679 

Count model coefficients (truncated poisson with log link):
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  1.64668    0.08278  19.892   <2e-16 ***
child       -0.75918    0.09004  -8.432   <2e-16 ***
camper       0.75166    0.09112   8.249   <2e-16 ***
Zero hurdle model coefficients (binomial with logit link):
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  -0.7808     0.3240  -2.410   0.0160 *
persons       0.1993     0.1161   1.716   0.0862 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Number of iterations in BFGS optimization: 14 
Log-likelihood: -1047 on 5 Df

The hurdle model

Check the logit part directly:

z <- fish
z$count <- ifelse(z$count == 0, 0, 1)
glm(count ~ persons, data = z, family = binomial)

Call:  glm(formula = count ~ persons, family = binomial, data = z)

Coefficients:
(Intercept)      persons  
    -0.7808       0.1993  

Degrees of Freedom: 249 Total (i.e. Null);  248 Residual
Null Deviance:      341.9 
Residual Deviance: 339  AIC: 343

The hurdle model

Interpretation of the previous model

  • The expected log number of the fish caught reduces by 0.759 for every additional chile (all else held constant)

  • Being a camper increases the expected log number of fish caught by 0.752 (all else held constant)

  • The log odds of catching at least one fish increases by 0.199 for every additional person

The hurdle model

Predicting new observations

newobs <- data.frame("child" = 0, "persons" = 3, "camper" = 1)
head(predict(fish.hur, newdata = newobs, type = "prob"))
          0            1           2           3           4          5
1 0.5456108 8.310743e-05 0.000457296 0.001677504 0.004615207 0.01015801
           6          7          8          9         10         11         12
1 0.01863138 0.02929101 0.04029322 0.04926936 0.05422062 0.05424495 0.04974685
          13         14         15         16         17          18         19
1 0.04211237 0.03310314 0.02428653 0.01670448 0.01081363 0.006611296 0.00382931
           20          21           22           23           24          25
1 0.002107066 0.001104196 0.0005523459 0.0002642839 0.0001211845 5.33451e-05
            26           27           28           29           30          31
1 2.257921e-05 9.203065e-06 3.617112e-06 1.372624e-06 5.035212e-07 1.78749e-07
            32           33           34           35           36          37
1 6.147253e-08 2.050004e-08 6.635342e-09 2.086329e-09 6.377754e-10 1.89694e-10
            38           39           40           41           42           43
1 5.493606e-11 1.550174e-11 4.264891e-12 1.144752e-12 2.999507e-13 7.676599e-14
            44           45           46           47           48           49
1 1.920011e-14 4.695466e-15 1.123333e-15 2.630256e-16 6.030375e-17 1.354365e-17
           50           51           52           53           54           55
1 2.98094e-18 6.432364e-19 1.361303e-19 2.826614e-20 5.760501e-21 1.152617e-21
            56           57           58           59           60          61
1 2.265085e-22 4.373178e-23 8.297681e-24 1.547719e-24 2.838759e-25 5.12137e-26
            62           63           64           65           66           67
1 9.090379e-27 1.587921e-27 2.730464e-28 4.622859e-29 7.708223e-30 1.266097e-30
            68           69           70           71           72           73
1 2.049017e-31 3.268015e-32 5.137756e-33 7.963475e-34 1.217188e-34 1.834942e-35
           74           75           76           77           78           79
1 2.72884e-36 4.004095e-37 5.798001e-38 8.286576e-39 1.169144e-39 1.628653e-40
            80           81           82          83           84          85
1 2.240402e-41 3.043887e-42 4.085095e-43 5.41641e-44 7.096101e-45 9.18731e-46
            86           87           88           89           90           91
1 1.175648e-46 1.487119e-47 1.859733e-48 2.299578e-49 2.811856e-50 3.400472e-51
            92           93           94           95          96           97
1 4.067606e-52 4.813306e-53 5.635119e-54 6.527803e-55 7.48313e-56 8.489832e-57
            98           99          100          101          102          103
1 9.533679e-58 1.059773e-58 1.166273e-59 1.270769e-60 1.371052e-61 1.464887e-62
           104          105          106         107          108          109
1 1.550095e-63 1.624638e-64 1.686701e-65 1.73477e-66 1.767688e-67 1.784706e-68
           110          111          112          113          114          115
1 1.785507e-69 1.770215e-70 1.739384e-71 1.693966e-72 1.635262e-73 1.564865e-74
          116          117          118          119          120          121
1 1.48459e-75 1.396395e-76 1.302308e-77 1.204354e-78 1.104487e-79 1.004529e-80
           122          123          124          125          126          127
1 9.061297e-82 8.107235e-83 7.195129e-84 6.334555e-85 5.532648e-86 4.794208e-87
           128          129          130          131          132          133
1 4.121871e-88 3.516351e-89 2.976709e-90 2.500648e-91 2.084809e-92 1.725052e-93
           134          135          136          137          138          139
1 1.416722e-94 1.154884e-95 9.345169e-97 7.506787e-98 5.986356e-99 4.73953e-100
            140           141           142           143           144
1 3.725587e-101 2.907791e-102 2.253525e-103 1.734258e-104 1.325375e-105
            145           146           147           148           149
1 1.005908e-106 7.582161e-108 5.676272e-109 4.220744e-110 3.117383e-111