Tag Archives: count regression

Veterinary Epidemiologic Research: Count and Rate Data – Poisson Regression and Risk Ratios

As noted on paragraph 18.4.1 of the book Veterinary Epidemiologic Research, logistic regression is widely used for binary data, with the estimates reported as odds ratios (OR). If it’s appropriate for case-control studies, risk ratios (RR) are preferred for cohort studies as RR provides estimates of probabilities directly. Moreover, it is often forgotten the assumption of rare event rate, and the OR will overestimate the true effect.

One approach to get RR is to fit a generalised linear model (GLM) with a binomial distribution and a log link. But these models may sometimes fail to converge (Zou, 2004). Another option is to use a Poisson regression with no exposure or offset specified (McNutt, 2003). It gives estimates with very little bias but confidence intervals that are too wide. However, using robust standard errors gives correct confidence intervals (Greenland, 2004, Zou, 2004).

We use data on culling of dairy cows to demonstrate this.

temp <- tempfile()
download.file(
"http://ic.upei.ca/ver/sites/ic.upei.ca.ver/files/ver2_data_R.zip",
temp)
load(unz(temp, "ver2_data_R/culling.rdata"))
unlink(temp)

table(culling$culled)

  0   1 
255 466 
### recoding number of lactation
culling$lact <- with(culling, ifelse(lact_c3 > 1, 2, lact_c3))

The logistic regression:

log.reg <- glm(culled ~ lact, family = binomial("logit"), data = culling)
summary(log.reg)

Call:
glm(formula = culled ~ lact, family = binomial("logit"), data = culling)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.546  -1.199   0.849   0.849   1.156  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -0.734      0.299   -2.45    0.014 *  
lact           0.784      0.171    4.59  4.4e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 936.86  on 720  degrees of freedom
Residual deviance: 915.84  on 719  degrees of freedom
AIC: 919.8

Number of Fisher Scoring iterations: 4

cbind(exp(coef(log.reg)), exp(confint(log.reg)))
Waiting for profiling to be done...
                 2.5 % 97.5 %
(Intercept) 0.48 0.267  0.864
lact        2.19 1.568  3.065

We could compute by hand the OR and RR:

with(culling, table(culled, lact))
      lact
culled   1   2
     0  97 158
     1 102 364
## OR is 2.19:
(364 / 158) / (102 / 97)
[1] 2.19
## or the ratio between the cross-products
(364 * 97) / (158 * 102)
[1] 2.19
## or with epicalc
library(epicalc)
with(culling, cc(culled, lact, graph = FALSE))

       lact
culled    1   2 Total
  0      97 158   255
  1     102 364   466
  Total 199 522   721

OR =  2.19 
Exact 95% CI =  1.54, 3.1  
Chi-squared = 21.51, 1 d.f., P value = 0
Fisher's exact test (2-sided) P value = 0 

# RR is 1.36:
(364 / 522) / (102 / 199)
[1] 1.36

The GLM with binomial distribution and log link:

log.reg2 <- glm(culled ~ lact, family = binomial(link = "log"), data = culling)
summary(log.reg2)

Call:
glm(formula = culled ~ lact, family = binomial(link = "log"), 
    data = culling)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.546  -1.199   0.849   0.849   1.156  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.9762     0.1412   -6.91  4.8e-12 ***
lact          0.3078     0.0749    4.11  4.0e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 936.86  on 720  degrees of freedom
Residual deviance: 915.84  on 719  degrees of freedom
AIC: 919.8

Number of Fisher Scoring iterations: 5

cbind(exp(coef(log.reg2)), exp(confint(log.reg2)))
Waiting for profiling to be done...
                  2.5 % 97.5 %
(Intercept) 0.377  0.28  0.488
lact        1.360  1.18  1.588

The modified Poisson with robust estimator as described by Zou, 2004 (GEE with individual observations treated as a repeated measure):

## create id for each observation
culling$id <- 1:length(culling$herd)
library(geepack)
zou.mod <- geeglm(culled ~ lact, family = poisson(link = "log"), id = id, corstr = "exchangeable", data = culling)
summary(zou.mod)

Call:
geeglm(formula = culled ~ lact, family = poisson(link = "log"), 
    data = culling, id = id, corstr = "exchangeable")

 Coefficients:
            Estimate Std.err Wald Pr(>|W|)    
(Intercept)  -0.9762  0.1412 47.8  4.8e-12 ***
lact          0.3078  0.0749 16.9  4.0e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Estimated Scale Parameters:
            Estimate Std.err
(Intercept)    0.354   0.021

Correlation: Structure = exchangeable  Link = identity 

Estimated Correlation Parameters:
      Estimate Std.err
alpha        0       0
Number of clusters:   721   Maximum cluster size: 1 
## exponentiated coefficients
exp(coef(zou.mod))
(Intercept)        lact 
      0.377       1.360 
## getting confidence intervals
library(doBy)
zou.mod.coefci <- esticon(zou.mod, diag(2))
zou.mod.expci <- exp(cbind(zou.mod.coefci$Estimate, zou.mod.coefci$Lower, zou.mod.coefci$Upper))
rownames(zou.mod.expci) <- names(coef(zou.mod))
colnames(zou.mod.expci) <- c("RR", "Lower RR", "Upper RR")
zou.mod.expci
               RR Lower RR Upper RR
(Intercept) 0.377    0.286    0.497
lact        1.360    1.175    1.576

Or the same using glm() and then getting robust standard errors:

pois.reg <- glm(culled ~ lact, family = poisson(link = "log"), data = culling)
library(sandwich) # to get robust estimators
library(lmtest) # to test coefficients
coeftest(pois.reg, vcov = sandwich)

z test of coefficients:

            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.9762     0.1412   -6.91  4.8e-12 ***
lact          0.3078     0.0749    4.11  4.0e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

## RR
exp(coef(pois.reg))
(Intercept)        lact 
      0.377       1.360 
## CI
0.3078 + qnorm(0.05 / 2) * 0.0749 # upper 95% CI
[1] 0.161
exp(0.3078 + qnorm(0.05 / 2) * 0.0749)
[1] 1.17
0.3078 + qnorm(1 - 0.05 / 2) * 0.0749 # lower 95% CI
[1] 0.455
exp(0.3078 + qnorm(1 - 0.05 / 2) * 0.0749)
[1] 1.58

So no excuse to use odds ratios when risk ratios are more appropriate!

Addition: Plotting the Risk Ratios

zou.mod2 <- geeglm(culled ~ lact + johnes, family = poisson(link = "log"),
                   id = id, corstr = "exchangeable", data = culling)
zou.mod2.coefci <- esticon(zou.mod2, diag(3))
zou.mod2.expci <- exp(cbind(zou.mod2.coefci$Estimate, zou.mod2.coefci$Lower,
                           zou.mod2.coefci$Upper))
rownames(zou.mod2.expci) <- names(coef(zou.mod2))
colnames(zou.mod2.expci) <- c("RR", "LowerCI", "UpperCI")
zou.df <- as.data.frame(zou.mod2.expci)
zou.df$var <- row.names(zou.df)
library(ggplot2)
ggplot(zou.df[2:3, ], aes(y = RR, x = reorder(var, RR))) +
  geom_point() +
  geom_pointrange(aes(ymin = LowerCI, ymax = UpperCI)) + 
  scale_x_discrete(labels = c("Lactation", "Johnes")) + 
  scale_y_log10(breaks = seq(1, 2, by = 0.1)) +
  xlab("") + 
  geom_hline(yintercept = 1, linetype = "dotdash", colour = "blue") +
  coord_flip()

Thanks to Tal Galili for suggesting this addition.

Advertisements

Veterinary Epidemiologic Research: Count and Rate Data – Zero Counts

Continuing on the examples from the book Veterinary Epidemiologic Research, we look today at modelling count when the count of zeros may be higher or lower than expected from a Poisson or negative binomial distribution. When there’s an excess of zero counts, you can fit either a zero-inflated model or a hurdle model. If zero counts are not possible, a zero-truncated model can be use.

Zero-inflated models

Zero-inflated models manage an excess of zero counts by simultaneously fitting a binary model and a Poisson (or negative binomial) model. In R, you can fit zero-inflated models (and hurdle models) with the library pscl. We use the fec dataset which give the fecal egg counts of gastro-intestinal nematodes from 313 cows in 38 dairy herds where half of the observations have zero counts. The predictors in the model are lactation (primiparous vs. multiparous), access to pasture, manure spread on heifer pasture, and manure spread on cow pasture.

temp <- tempfile()
download.file(
"http://ic.upei.ca/ver/sites/ic.upei.ca.ver/files/ver2_data_R.zip",
temp)
load(unz(temp, "ver2_data_R/fec.rdata"))
unlink(temp)

library(pscl)
mod3 <- zeroinfl(fec ~ lact + past_lact + man_heif + man_lact, data = fec, dist = "negbin")
summary(mod3)
 
Call:
zeroinfl(formula = fec ~ lact + past_lact + man_heif + man_lact, data = fec, 
    dist = "negbin")

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-0.5055 -0.4537 -0.3624 -0.1426 27.3064 

Count model coefficients (negbin with log link):
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  2.36949    0.13117  18.064  < 2e-16 ***
lact        -1.15847    0.10664 -10.864  < 2e-16 ***
past_lact    0.53666    0.14233   3.771 0.000163 ***
man_heif    -0.99849    0.14216  -7.024 2.16e-12 ***
man_lact     1.07858    0.15789   6.831 8.43e-12 ***
Log(theta)  -1.35981    0.05425 -25.065  < 2e-16 ***

Zero-inflation model coefficients (binomial with logit link):
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.9303     0.2932  -3.173  0.00151 ** 
lact          0.8701     0.3083   2.822  0.00477 ** 
past_lact    -1.8003     0.3989  -4.513  6.4e-06 ***
man_heif     -0.7702     0.4669  -1.650  0.09903 .  
man_lact    -12.2380   167.9185  -0.073  0.94190    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Theta = 0.2567 
Number of iterations in BFGS optimization: 47 
Log-likelihood: -5239 on 11 Df

We can assess if the zero-inflated model fits the data better than a Poisson or negative binomial model with a Vuong test. If the value of the test is 1.96 indicates superiority of model 1 over model 2. If the value lies between -1.96 and 1.96, neither model is preferred.

### fit same model with negative binomial
library(MASS)
mod3.nb <- glm.nb(fec ~ lact + past_lact + man_heif + man_lact, data = fec)
### Vuong test
vuong(mod3, mod3.nb)
Vuong Non-Nested Hypothesis Test-Statistic: 3.308663 
(test-statistic is asymptotically distributed N(0,1) under the
 null that the models are indistinguishible)
in this case:
model1 > model2, with p-value 0.0004687128

### alpha
1/mod3$theta
[1] 3.895448

The Vuong statistic is 3.3 (p < 0.001) indicating the first model, the zero-inflated one, is superior to the regular negative binomial. Note that R does not estimate \alpha but its inverse, \theta . \alpha is 3.9, suggesting a negative binomial is preferable to a Poisson model.
The parameter modelled in the binary part is the probability of a zero count: having lactating cows on pasture reduced the probability of a zero count (\beta = -1.8), and increased the expected count if it was non-zero (\beta = 0.54).

Hurdle models

A hurdle model has also 2 components but it is based on the assumption that zero counts arise from only one process and non-zero counts are determined by a different process. The odds of non-zero count vs. a zero count is modelled by a binomial model while the distribution of non-zero counts is modelled by a zero-truncated model. We refit the fec dataset above with a negative binomial hurdle model:

mod4 <- hurdle(fec ~ lact + past_lact + man_heif + man_lact, data = fec, dist = "negbin")
summary(mod4)

Call:
hurdle(formula = fec ~ lact + past_lact + man_heif + man_lact, data = fec, 
    dist = "negbin")

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-0.4598 -0.3914 -0.3130 -0.1479 23.6774 

Count model coefficients (truncated negbin with log link):
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   1.1790     0.4801   2.456  0.01407 *  
lact         -1.1349     0.1386  -8.187 2.69e-16 ***
past_lact     0.5813     0.1782   3.261  0.00111 ** 
man_heif     -0.9854     0.1832  -5.379 7.50e-08 ***
man_lact      1.0139     0.1998   5.075 3.87e-07 ***
Log(theta)   -2.9111     0.5239  -5.556 2.76e-08 ***
Zero hurdle model coefficients (binomial with logit link):
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.13078    0.10434  -1.253  0.21006    
lact        -0.84485    0.09728  -8.684  < 2e-16 ***
past_lact    0.84113    0.11326   7.427 1.11e-13 ***
man_heif    -0.35576    0.13582  -2.619  0.00881 ** 
man_lact     0.85947    0.15337   5.604 2.10e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Theta: count = 0.0544
Number of iterations in BFGS optimization: 25 
Log-likelihood: -5211 on 11 Df

We can compare zero-inflated and hurdle models by their log-likelihood. The hurdle model fits better:

logLik(mod4)
'log Lik.' -5211.18 (df=11)
logLik(mod3)
'log Lik.' -5238.622 (df=11)
### Null model for model 3:
mod3.null <- update(mod3, . ~ 1)
logLik(mod3.null)
'log Lik.' -5428.732 (df=3)

Zero-truncated model

Sometimes zero counts are not possible. In a zero-truncated model, the probability of a zero count is computed from a Poisson (or negative binomial) distribution and this value is subtracted from 1. The remaining probabilities are rescaled based on this difference so they total 1. We use the daisy2 dataset and look at the number of services required for conception (which cannot be zero…) with the predictors parity, days from calving to first service, and presence/absence of vaginal discharge.

temp <- tempfile()
download.file(
  "http://ic.upei.ca/ver/sites/ic.upei.ca.ver/files/ver2_data_R.zip", temp)
load(unz(temp, "ver2_data_R/daisy2.rdata"))
unlink(temp)

library(VGAM)
mod5 <- vglm(spc ~ parity + cf + vag_disch, family = posnegbinomial(), data = daisy2)
summary(mod5)
 
Call:
vglm(formula = spc ~ parity + cf + vag_disch, family = posnegbinomial(), 
    data = daisy2)

Pearson Residuals:
               Min       1Q    Median      3Q    Max
log(munb)  -1.1055 -0.90954  0.071527 0.82039 3.5591
log(size) -17.4891 -0.39674 -0.260103 0.82155 1.4480

Coefficients:
                Estimate Std. Error z value
(Intercept):1  0.1243178 0.08161432  1.5232
(Intercept):2 -0.4348170 0.10003096 -4.3468
parity         0.0497743 0.01213893  4.1004
cf            -0.0040649 0.00068602 -5.9254
vag_disch      0.4704433 0.10888570  4.3205

Number of linear predictors:  2 

Names of linear predictors: log(munb), log(size)

Dispersion Parameter for posnegbinomial family:   1

Log-likelihood: -10217.68 on 13731 degrees of freedom

Number of iterations: 5 

As cows get older, the number of services required increase, and the longer the first service was delayed, the fewer services were required.
The first intercept is the usual intercept. The second intercept is the over dispersion parameter \alpha .

Veterinary Epidemiologic Research: Count and Rate Data – Poisson & Negative Binomial Regressions

Still going through the book Veterinary Epidemiologic Research and today it’s chapter 18, modelling count and rate data. I’ll have a look at Poisson and negative binomial regressions in R.
We use count regression when the outcome we are measuring is a count of number of times an event occurs in an individual or group of individuals. We will use a dataset holding information on outbreaks of tuberculosis in dairy and beef cattle, cervids and bison in Canada between 1985 and 1994.

temp <- tempfile()
download.file(
"http://ic.upei.ca/ver/sites/ic.upei.ca.ver/files/ver2_data_R.zip",
temp)
load(unz(temp, "ver2_data_R/tb_real.rdata"))
unlink(temp)

library(Hmisc)
tb_real<- upData(tb_real, labels = c(farm_id = 'Farm identification',
                            type = 'Type of animal',
                            sex = 'Sex', age = 'Age category',
                            reactors = 'Number of pos/reactors in the group',
                            par = 'Animal days at risk in the group'),
                 levels = list(type = list('Dairy' = 1, 'Beef' = 2,
                                 'Cervid' = 3, 'Other' = 4),
                   sex = list('Female' = 0, 'Male' = 1),
                   age = list('0-12 mo' = 0, '12-24 mo' = 1, '>24 mo' = 2)))

An histogram of the count of TB reactors is shown hereafter:

hist(tb_real$reactors, breaks = 0:30 - 0.5)

In a Poisson regression, the mean and variance are equal.

A Poisson regression model with type of animal, sex and age as predictors, and the time at risk is:

mod1 <- glm(reactors ~ type + sex + age + offset(log(par)), family = poisson, data = tb_real)
(mod1.sum <- summary(mod1))

Call:
glm(formula = reactors ~ type + sex + age + offset(log(par)), 
    family = poisson, data = tb_real)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.5386  -0.8607  -0.3364  -0.0429   8.7903  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -11.6899     0.7398 -15.802  < 2e-16 ***
typeBeef      0.4422     0.2364   1.871 0.061410 .  
typeCervid    1.0662     0.2334   4.569 4.91e-06 ***
typeOther     0.4384     0.6149   0.713 0.475898    
sexMale      -0.3619     0.1954  -1.852 0.064020 .  
age12-24 mo   2.6734     0.7217   3.704 0.000212 ***
age>24 mo     2.6012     0.7136   3.645 0.000267 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 409.03  on 133  degrees of freedom
Residual deviance: 348.35  on 127  degrees of freedom
AIC: 491.32

Number of Fisher Scoring iterations: 8

### here's an other way of writing the offset in the formula: 
mod1b <- glm(reactors ~ type + sex + age, offset = log(par), family = poisson, data = tb_real)

Quickly checking the overdispersion, the residual deviance should be equal to the residual degrees of freedom if the Poisson errors assumption is met. We have 348.4 on 127 degrees of freedom. The overdispersion present is due to the clustering of animals within herd, which was not taken into account.

The incidence rate and confidence interval can be obtained with:

cbind(exp(coef(mod1)), exp(confint(mod1)))
Waiting for profiling to be done...
                                2.5 %       97.5 %
(Intercept) 8.378225e-06 1.337987e-06 2.827146e-05
typeBeef    1.556151e+00 9.923864e-01 2.517873e+00
typeCervid  2.904441e+00 1.866320e+00 4.677376e+00
typeOther   1.550202e+00 3.670397e-01 4.459753e+00
sexMale     6.963636e-01 4.687998e-01 1.010750e+00
age12-24 mo 1.448939e+01 4.494706e+00 8.867766e+01
age>24 mo   1.347980e+01 4.284489e+00 8.171187e+01

As said in the post for logistic regression, the profile likelihood-based CI is preferable due to the Hauck-Donner effect (overestimation of the SE) (see also abstract by H. Stryhn at ISVEE X).

The deviance and Pearson goodness-of-fit test statistics can be done with:

# deviance gof
pchisq(deviance(mod1), df.residual(mod1), lower = FALSE)
[1] 1.630092e-22
#Pearson statistic
mod1.pearson <- residuals(mod1, type = "pearson")
1-pchisq(sum(mod1.pearson^2), mod1$df.residual)
[1] 0

Diagnostics can be obtained as usual (see previous posts) but we can also use the Anscombe residuals:

library(wle)
mod1.ansc <- residualsAnscombe(tb_real$reactors, mod1$fitted.values, family = poisson())

plot(predict(mod1, type = "response"), mod1.ansc)
plot(cooks.distance(mod1), mod1.ansc)

Diagnostic plot - Anscombe residuals vs. predicted counts
Diagnostic plot – Anscombe residuals vs. predicted counts

Diagnostic plot - Anscombe residuals vs. Cook's distance
Diagnostic plot – Anscombe residuals vs. Cook’s distance

Negative binomial regression models are for count data for which the variance is not constrained to equal the mean. We can refit the above model as a negative binomial model using full maximum likelihood estimation:

library(MASS)
mod2 <- glm.nb(reactors ~ type + sex + age + offset(log(par)), data = tb_real)
(mod2.sum <- summary(mod2))

Call:
glm.nb(formula = reactors ~ type + sex + age + offset(log(par)), 
    data = tb_real, init.theta = 0.5745887328, link = log)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.77667  -0.74441  -0.45509  -0.09632   2.70012  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -11.18145    0.92302 -12.114  < 2e-16 ***
typeBeef      0.60461    0.62282   0.971 0.331665    
typeCervid    0.66572    0.63176   1.054 0.291993    
typeOther     0.80003    0.96988   0.825 0.409442    
sexMale      -0.05748    0.38337  -0.150 0.880819    
age12-24 mo   2.25304    0.77915   2.892 0.003832 ** 
age>24 mo     2.48095    0.75283   3.296 0.000982 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for Negative Binomial(0.5746) family taken to be 1)

    Null deviance: 111.33  on 133  degrees of freedom
Residual deviance:  99.36  on 127  degrees of freedom
AIC: 331.47

Number of Fisher Scoring iterations: 1


              Theta:  0.575 
          Std. Err.:  0.143 

 2 x log-likelihood:  -315.472 

### to use a specific value for the link parameter (2 for example):
mod2b <- glm(reactors ~ type + sex + age + offset(log(par)),
             family = negative.binomial(2), data = tb_real)