# Veterinary Epidemiologic Research: Modelling Survival Data – Non-Parametric Analyses

Next topic from Veterinary Epidemiologic Research: chapter 19, modelling survival data. We start with non-parametric analyses where we make no assumptions about either the distribution of survival times or the functional form of the relationship between a predictor and survival. There are 3 non-parametric methods to describe time-to-event data: actuarial life tables, Kaplan-Meier method, and Nelson-Aalen method.
We use data on occurrence of calf pneumonia in calves raised in 2 different housing systems. Calves surviving to 150 days without pneumonia are considered censored at that time.

temp <- tempfile()
"http://ic.upei.ca/ver/sites/ic.upei.ca.ver/files/ver2_data_R.zip", temp)

library(Hmisc)
calf_pneu <- upData(calf_pneu, labels = c(calf = 'Calf id',
stock = 'Stocking method',
days = 'Time to onset of pneumonia or censoring',
pn = 'Pneumonia'),
levels = list(stock = list('batch' = 0, 'continuous' = 1)))


Actuarial Life Table

To create a life table, we use the function lifetab from package KMsurv, after calculating the number of censored and events at each time point and grouping them by time interval (with gsummary from package nlme).

library(KMsurv)
interval <- seq(from = 30, to = 165, by = 15)
interval <- floor(calf_pneu$days/15) interval.censor <- data.frame(interval, calf_pneu$pn)
library(nlme)
pneumonia <- gsummary(interval.censor, sum, groups = interval)
total <- gsummary(interval.censor, length, groups = interval)
lt.data <- cbind(pneumonia[ , 1:2], total[ , 2])
length <- length(lt.data$interval) lt.data[length + 1, ]$interval <- NA
nevent <- lt.data[ , 2]
nlost <- lt.data[ , 3] - lt.data[ , 2]
(life.table <- lifetab(lt.data$interval, 24, nlost, nevent)) nsubs nlost nrisk nevent surv pdf hazard se.surv 1-3 24 0 24.0 1 1.0000000 0.02083333 0.02127660 0.00000000 3-4 23 0 23.0 1 0.9583333 0.04166667 0.04444444 0.04078938 4-5 22 0 22.0 1 0.9166667 0.04166667 0.04651163 0.05641693 5-6 21 0 21.0 3 0.8750000 0.12500000 0.15384615 0.06750772 6-7 18 1 17.5 2 0.7500000 0.08571429 0.12121212 0.08838835 7-8 15 6 12.0 3 0.6642857 0.16607143 0.28571429 0.09686316 8-10 6 0 6.0 1 0.4982143 0.04151786 0.09090909 0.11032937 10-NA 5 5 2.5 0 0.4151786 NA NA 0.11915934 NA-3 0 NA NA NA 0.4151786 NA NA 0.11915934 se.pdf se.hazard 1-3 0.02039469 0.02127178 3-4 0.04078938 0.04443347 4-5 0.04078938 0.04649905 5-6 0.06750772 0.08855994 6-7 0.05792828 0.08555236 7-8 0.08649471 0.16326531 8-10 0.03899969 0.09053265 10-NA NA NA NA-3 NA NA  Kaplan-Meier Method To compute the Kaplan-Meier estimator we use the function survfit from package survival. It takes as argument a Surv object, which gives the time variable and the event of interest. You get the Kaplan-Meier estimate with the summary of the survfit object. We can then plot the estimates to show the Kaplan-Meier survivor function. library(survival) km.sf <- survfit(Surv(days, pn == 1) ~ 1, data = calf_pneu) summary(km.sf) Call: survfit(formula = Surv(days, pn == 1) ~ 1, data = calf_pneu) time n.risk n.event survival std.err lower 95% CI upper 95% CI 27 24 1 0.958 0.0408 0.882 1.000 49 23 1 0.917 0.0564 0.813 1.000 72 22 1 0.875 0.0675 0.752 1.000 79 21 2 0.792 0.0829 0.645 0.972 89 19 1 0.750 0.0884 0.595 0.945 90 18 1 0.708 0.0928 0.548 0.916 101 17 1 0.667 0.0962 0.502 0.885 113 15 2 0.578 0.1019 0.409 0.816 117 9 1 0.514 0.1089 0.339 0.778 123 6 1 0.428 0.1198 0.247 0.741 plot(km.sf, xlab = "time (days)", ylab = "cumulative survival probability", conf.int = TRUE)  Nelson-Aalen Method A “hazard” is the probability of failure at a point in time, given that the calf had survived up to that point in time. A cumulative hazard, the Nelson-Aaalen estimate, can be computed. The Nelson-Aalen estimate can be calculated by transforming the Fleming-Harrington estimate of survival. fh.sf <- survfit(Surv(days, pn == 1) ~ 1, data = calf_pneu, type = "fleming") plot(stepfun(fh.sf$time, c(0, -log(fh.sf$surv))), do.points = FALSE, xlab = "time (days)", ylab = "cumulative hazard", main = "", ylim = c(0, 1.5)) lines(stepfun(fh.sf$time, c(0, -log(fh.sf$upper))), lty = 5, do.points = FALSE) lines(stepfun(fh.sf$time, c(0, -log(fh.sf$lower))), lty = 5, do.points = FALSE)  Tests of the Overall Survival Curve Several tests are available to test whether the overall survivor functions in 2 or more groups are equal. We can use the log-rank test, the simplest test, assigning equal weight to each time point estimate and equivalent to a standard Mantel-Haenszel test. Also, there’s the Peto-Peto-Prentice test which weights the stratum-specific estimates by the overall survival experience and so reduces the influence of different censoring patterns between groups. To do these tests, we apply the survdiff function to the Surv object. The argument rho gives the weights according to $S^{(t)}\rho$ and may be any numeric value. Default is rho = 0 which gives the log-rank test. Rho = 1 gives the “Peto & Peto modiﬁcation of the Gehan-Wilcoxon test”. Rho larger than zero gives greater weight to the ﬁrst part of the survival curves. Rho smaller than zero gives weight to the later part of the survival curves. survdiff(Surv(days, pn == 1) ~ stock, data = calf_pneu, rho = 0) # rho is optional Call: survdiff(formula = Surv(days, pn == 1) ~ stock, data = calf_pneu, rho = 0) N Observed Expected (O-E)^2/E (O-E)^2/V stock=batch 12 4 6.89 1.21 2.99 stock=continuous 12 8 5.11 1.63 2.99 Chisq= 3 on 1 degrees of freedom, p= 0.084 survdiff(Surv(days, pn == 1) ~ stock, data = calf_pneu, rho = 1) # rho=1 asks for Peto-Peto test Call: survdiff(formula = Surv(days, pn == 1) ~ stock, data = calf_pneu, rho = 1) N Observed Expected (O-E)^2/E (O-E)^2/V stock=batch 12 2.89 5.25 1.06 3.13 stock=continuous 12 6.41 4.05 1.38 3.13 Chisq= 3.1 on 1 degrees of freedom, p= 0.0766  Finally we can compare survivor function with stock R plot or using ggplot2. With ggplot2, you get the necessary data from the survfit object and create a new data frame from it. The baseline data (time = 0) are not there so you create it yourself: (km.stock <- survfit(Surv(days, pn == 1) ~ stock, data = calf_pneu)) Call: survfit(formula = Surv(days, pn == 1) ~ stock, data = calf_pneu) records n.max n.start events median 0.95LCL 0.95UCL stock=batch 12 12 12 4 NA 123 NA stock=continuous 12 12 12 8 113 79 NA plot(km.stock, conf.int = FALSE, col = c("blue4", "darkorange"), xlab = "time (days)", ylab = "cumulative survival probability") legend("bottomleft", inset = .05, c("batch", "continuous"), text.col = c("blue4", "darkorange")) km.df <- data.frame( time = km.stock$time,
n.risk  = km.stock$n.risk, n.event = km.stock$n.event,
surv    = km.stock$surv, strata = gsub("stock=", "", summary(km.stock, censored = T)$strata),
upper   = km.stock$upper, lower = km.stock$lower
)
zeros <- data.frame(time = 0, surv = 1, strata = gsub("stock=", "",
levels(summary(km.stock)$strata)), upper = 1, lower = 1) library(plyr) km.df <- rbind.fill(zeros, km.df) km.df$strata <- ordered(km.df$strata, levels = c("batch", "continuous")) library(ggplot2) ggplot(km.df, aes(time, surv, colour = strata)) + geom_step(size = 0.6) + xlim(0, 150) + ylim(0, 1) + xlab("time (days)") + ylab("cumulative survival probability") + labs(colour = "stock")  # Veterinary Epidemiologic Research: Count and Rate Data – Poisson Regression and Risk Ratios As noted on paragraph 18.4.1 of the book Veterinary Epidemiologic Research, logistic regression is widely used for binary data, with the estimates reported as odds ratios (OR). If it’s appropriate for case-control studies, risk ratios (RR) are preferred for cohort studies as RR provides estimates of probabilities directly. Moreover, it is often forgotten the assumption of rare event rate, and the OR will overestimate the true effect. One approach to get RR is to fit a generalised linear model (GLM) with a binomial distribution and a log link. But these models may sometimes fail to converge (Zou, 2004). Another option is to use a Poisson regression with no exposure or offset specified (McNutt, 2003). It gives estimates with very little bias but confidence intervals that are too wide. However, using robust standard errors gives correct confidence intervals (Greenland, 2004, Zou, 2004). We use data on culling of dairy cows to demonstrate this. temp <- tempfile() download.file( "http://ic.upei.ca/ver/sites/ic.upei.ca.ver/files/ver2_data_R.zip", temp) load(unz(temp, "ver2_data_R/culling.rdata")) unlink(temp) table(culling$culled)

0   1
255 466
### recoding number of lactation
culling$lact <- with(culling, ifelse(lact_c3 > 1, 2, lact_c3))  The logistic regression: log.reg <- glm(culled ~ lact, family = binomial("logit"), data = culling) summary(log.reg) Call: glm(formula = culled ~ lact, family = binomial("logit"), data = culling) Deviance Residuals: Min 1Q Median 3Q Max -1.546 -1.199 0.849 0.849 1.156 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.734 0.299 -2.45 0.014 * lact 0.784 0.171 4.59 4.4e-06 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 936.86 on 720 degrees of freedom Residual deviance: 915.84 on 719 degrees of freedom AIC: 919.8 Number of Fisher Scoring iterations: 4 cbind(exp(coef(log.reg)), exp(confint(log.reg))) Waiting for profiling to be done... 2.5 % 97.5 % (Intercept) 0.48 0.267 0.864 lact 2.19 1.568 3.065  We could compute by hand the OR and RR: with(culling, table(culled, lact)) lact culled 1 2 0 97 158 1 102 364 ## OR is 2.19: (364 / 158) / (102 / 97) [1] 2.19 ## or the ratio between the cross-products (364 * 97) / (158 * 102) [1] 2.19 ## or with epicalc library(epicalc) with(culling, cc(culled, lact, graph = FALSE)) lact culled 1 2 Total 0 97 158 255 1 102 364 466 Total 199 522 721 OR = 2.19 Exact 95% CI = 1.54, 3.1 Chi-squared = 21.51, 1 d.f., P value = 0 Fisher's exact test (2-sided) P value = 0 # RR is 1.36: (364 / 522) / (102 / 199) [1] 1.36  The GLM with binomial distribution and log link: log.reg2 <- glm(culled ~ lact, family = binomial(link = "log"), data = culling) summary(log.reg2) Call: glm(formula = culled ~ lact, family = binomial(link = "log"), data = culling) Deviance Residuals: Min 1Q Median 3Q Max -1.546 -1.199 0.849 0.849 1.156 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.9762 0.1412 -6.91 4.8e-12 *** lact 0.3078 0.0749 4.11 4.0e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 936.86 on 720 degrees of freedom Residual deviance: 915.84 on 719 degrees of freedom AIC: 919.8 Number of Fisher Scoring iterations: 5 cbind(exp(coef(log.reg2)), exp(confint(log.reg2))) Waiting for profiling to be done... 2.5 % 97.5 % (Intercept) 0.377 0.28 0.488 lact 1.360 1.18 1.588  The modified Poisson with robust estimator as described by Zou, 2004 (GEE with individual observations treated as a repeated measure): ## create id for each observation culling$id <- 1:length(culling$herd) library(geepack) zou.mod <- geeglm(culled ~ lact, family = poisson(link = "log"), id = id, corstr = "exchangeable", data = culling) summary(zou.mod) Call: geeglm(formula = culled ~ lact, family = poisson(link = "log"), data = culling, id = id, corstr = "exchangeable") Coefficients: Estimate Std.err Wald Pr(>|W|) (Intercept) -0.9762 0.1412 47.8 4.8e-12 *** lact 0.3078 0.0749 16.9 4.0e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Estimated Scale Parameters: Estimate Std.err (Intercept) 0.354 0.021 Correlation: Structure = exchangeable Link = identity Estimated Correlation Parameters: Estimate Std.err alpha 0 0 Number of clusters: 721 Maximum cluster size: 1 ## exponentiated coefficients exp(coef(zou.mod)) (Intercept) lact 0.377 1.360 ## getting confidence intervals library(doBy) zou.mod.coefci <- esticon(zou.mod, diag(2)) zou.mod.expci <- exp(cbind(zou.mod.coefci$Estimate, zou.mod.coefci$Lower, zou.mod.coefci$Upper))
rownames(zou.mod.expci) <- names(coef(zou.mod))
colnames(zou.mod.expci) <- c("RR", "Lower RR", "Upper RR")
zou.mod.expci
RR Lower RR Upper RR
(Intercept) 0.377    0.286    0.497
lact        1.360    1.175    1.576


Or the same using glm() and then getting robust standard errors:

pois.reg <- glm(culled ~ lact, family = poisson(link = "log"), data = culling)
library(sandwich) # to get robust estimators
library(lmtest) # to test coefficients
coeftest(pois.reg, vcov = sandwich)

z test of coefficients:

Estimate Std. Error z value Pr(>|z|)
(Intercept)  -0.9762     0.1412   -6.91  4.8e-12 ***
lact          0.3078     0.0749    4.11  4.0e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

## RR
exp(coef(pois.reg))
(Intercept)        lact
0.377       1.360
## CI
0.3078 + qnorm(0.05 / 2) * 0.0749 # upper 95% CI
[1] 0.161
exp(0.3078 + qnorm(0.05 / 2) * 0.0749)
[1] 1.17
0.3078 + qnorm(1 - 0.05 / 2) * 0.0749 # lower 95% CI
[1] 0.455
exp(0.3078 + qnorm(1 - 0.05 / 2) * 0.0749)
[1] 1.58


So no excuse to use odds ratios when risk ratios are more appropriate!

zou.mod2 <- geeglm(culled ~ lact + johnes, family = poisson(link = "log"),
id = id, corstr = "exchangeable", data = culling)
zou.mod2.coefci <- esticon(zou.mod2, diag(3))
zou.mod2.expci <- exp(cbind(zou.mod2.coefci$Estimate, zou.mod2.coefci$Lower,
zou.mod2.coefci$Upper)) rownames(zou.mod2.expci) <- names(coef(zou.mod2)) colnames(zou.mod2.expci) <- c("RR", "LowerCI", "UpperCI") zou.df <- as.data.frame(zou.mod2.expci) zou.df$var <- row.names(zou.df)
library(ggplot2)
ggplot(zou.df[2:3, ], aes(y = RR, x = reorder(var, RR))) +
geom_point() +
geom_pointrange(aes(ymin = LowerCI, ymax = UpperCI)) +
scale_x_discrete(labels = c("Lactation", "Johnes")) +
scale_y_log10(breaks = seq(1, 2, by = 0.1)) +
xlab("") +
geom_hline(yintercept = 1, linetype = "dotdash", colour = "blue") +
coord_flip()


Thanks to Tal Galili for suggesting this addition.

# Veterinary Epidemiologic Research: Count and Rate Data – Zero Counts

Continuing on the examples from the book Veterinary Epidemiologic Research, we look today at modelling count when the count of zeros may be higher or lower than expected from a Poisson or negative binomial distribution. When there’s an excess of zero counts, you can fit either a zero-inflated model or a hurdle model. If zero counts are not possible, a zero-truncated model can be use.

Zero-inflated models

Zero-inflated models manage an excess of zero counts by simultaneously fitting a binary model and a Poisson (or negative binomial) model. In R, you can fit zero-inflated models (and hurdle models) with the library pscl. We use the fec dataset which give the fecal egg counts of gastro-intestinal nematodes from 313 cows in 38 dairy herds where half of the observations have zero counts. The predictors in the model are lactation (primiparous vs. multiparous), access to pasture, manure spread on heifer pasture, and manure spread on cow pasture.

temp <- tempfile()
"http://ic.upei.ca/ver/sites/ic.upei.ca.ver/files/ver2_data_R.zip",
temp)

library(pscl)
mod3 <- zeroinfl(fec ~ lact + past_lact + man_heif + man_lact, data = fec, dist = "negbin")
summary(mod3)

Call:
zeroinfl(formula = fec ~ lact + past_lact + man_heif + man_lact, data = fec,
dist = "negbin")

Pearson residuals:
Min      1Q  Median      3Q     Max
-0.5055 -0.4537 -0.3624 -0.1426 27.3064

Count model coefficients (negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept)  2.36949    0.13117  18.064  < 2e-16 ***
lact        -1.15847    0.10664 -10.864  < 2e-16 ***
past_lact    0.53666    0.14233   3.771 0.000163 ***
man_heif    -0.99849    0.14216  -7.024 2.16e-12 ***
man_lact     1.07858    0.15789   6.831 8.43e-12 ***
Log(theta)  -1.35981    0.05425 -25.065  < 2e-16 ***

Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept)  -0.9303     0.2932  -3.173  0.00151 **
lact          0.8701     0.3083   2.822  0.00477 **
past_lact    -1.8003     0.3989  -4.513  6.4e-06 ***
man_heif     -0.7702     0.4669  -1.650  0.09903 .
man_lact    -12.2380   167.9185  -0.073  0.94190
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Theta = 0.2567
Number of iterations in BFGS optimization: 47
Log-likelihood: -5239 on 11 Df


We can assess if the zero-inflated model fits the data better than a Poisson or negative binomial model with a Vuong test. If the value of the test is 1.96 indicates superiority of model 1 over model 2. If the value lies between -1.96 and 1.96, neither model is preferred.

### fit same model with negative binomial
library(MASS)
mod3.nb <- glm.nb(fec ~ lact + past_lact + man_heif + man_lact, data = fec)
### Vuong test
vuong(mod3, mod3.nb)
Vuong Non-Nested Hypothesis Test-Statistic: 3.308663
(test-statistic is asymptotically distributed N(0,1) under the
null that the models are indistinguishible)
in this case:
model1 > model2, with p-value 0.0004687128

### alpha
1/mod3\$theta
[1] 3.895448


The Vuong statistic is 3.3 (p < 0.001) indicating the first model, the zero-inflated one, is superior to the regular negative binomial. Note that R does not estimate $\alpha$ but its inverse, $\theta$. $\alpha$ is 3.9, suggesting a negative binomial is preferable to a Poisson model.
The parameter modelled in the binary part is the probability of a zero count: having lactating cows on pasture reduced the probability of a zero count ($\beta$ = -1.8), and increased the expected count if it was non-zero ($\beta$ = 0.54).

Hurdle models

A hurdle model has also 2 components but it is based on the assumption that zero counts arise from only one process and non-zero counts are determined by a different process. The odds of non-zero count vs. a zero count is modelled by a binomial model while the distribution of non-zero counts is modelled by a zero-truncated model. We refit the fec dataset above with a negative binomial hurdle model:

mod4 <- hurdle(fec ~ lact + past_lact + man_heif + man_lact, data = fec, dist = "negbin")
summary(mod4)

Call:
hurdle(formula = fec ~ lact + past_lact + man_heif + man_lact, data = fec,
dist = "negbin")

Pearson residuals:
Min      1Q  Median      3Q     Max
-0.4598 -0.3914 -0.3130 -0.1479 23.6774

Count model coefficients (truncated negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept)   1.1790     0.4801   2.456  0.01407 *
lact         -1.1349     0.1386  -8.187 2.69e-16 ***
past_lact     0.5813     0.1782   3.261  0.00111 **
man_heif     -0.9854     0.1832  -5.379 7.50e-08 ***
man_lact      1.0139     0.1998   5.075 3.87e-07 ***
Log(theta)   -2.9111     0.5239  -5.556 2.76e-08 ***
Zero hurdle model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.13078    0.10434  -1.253  0.21006
lact        -0.84485    0.09728  -8.684  < 2e-16 ***
past_lact    0.84113    0.11326   7.427 1.11e-13 ***
man_heif    -0.35576    0.13582  -2.619  0.00881 **
man_lact     0.85947    0.15337   5.604 2.10e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Theta: count = 0.0544
Number of iterations in BFGS optimization: 25
Log-likelihood: -5211 on 11 Df


We can compare zero-inflated and hurdle models by their log-likelihood. The hurdle model fits better:

logLik(mod4)
'log Lik.' -5211.18 (df=11)
logLik(mod3)
'log Lik.' -5238.622 (df=11)
### Null model for model 3:
mod3.null <- update(mod3, . ~ 1)
logLik(mod3.null)
'log Lik.' -5428.732 (df=3)


Zero-truncated model

Sometimes zero counts are not possible. In a zero-truncated model, the probability of a zero count is computed from a Poisson (or negative binomial) distribution and this value is subtracted from 1. The remaining probabilities are rescaled based on this difference so they total 1. We use the daisy2 dataset and look at the number of services required for conception (which cannot be zero…) with the predictors parity, days from calving to first service, and presence/absence of vaginal discharge.

temp <- tempfile()
"http://ic.upei.ca/ver/sites/ic.upei.ca.ver/files/ver2_data_R.zip", temp)

library(VGAM)
mod5 <- vglm(spc ~ parity + cf + vag_disch, family = posnegbinomial(), data = daisy2)
summary(mod5)

Call:
vglm(formula = spc ~ parity + cf + vag_disch, family = posnegbinomial(),
data = daisy2)

Pearson Residuals:
Min       1Q    Median      3Q    Max
log(munb)  -1.1055 -0.90954  0.071527 0.82039 3.5591
log(size) -17.4891 -0.39674 -0.260103 0.82155 1.4480

Coefficients:
Estimate Std. Error z value
(Intercept):1  0.1243178 0.08161432  1.5232
(Intercept):2 -0.4348170 0.10003096 -4.3468
parity         0.0497743 0.01213893  4.1004
cf            -0.0040649 0.00068602 -5.9254
vag_disch      0.4704433 0.10888570  4.3205

Number of linear predictors:  2

Names of linear predictors: log(munb), log(size)

Dispersion Parameter for posnegbinomial family:   1

Log-likelihood: -10217.68 on 13731 degrees of freedom

Number of iterations: 5


As cows get older, the number of services required increase, and the longer the first service was delayed, the fewer services were required.
The first intercept is the usual intercept. The second intercept is the over dispersion parameter $\alpha$.