# Factor Analysis in Epidemiology

This is a short (15′!) presentation given for the Canadian Veterinary Epidemiology Club (CVEC) on February 26, 2016, about new epi tools.

## What is Factor Analysis?

Factor analysis is a group of statistical methods used to identify the structure of data with the help of latent (not observed) variables. These latent variables, called factors, are identified by looking at clusters of correlated variables (the correlation between 2 variables proceed from the similarity of their relation with the latent variables).

In brief, we try to explain the correlation (covariance) between a large set of observed variables according to a smaller set of unobserved latent variables, a.k.a the factors. Therefore a factor is a cluster of related observed variables, representing the underlying dimension of a construct that is as distinct as possible from the other factors.

There are two main types of factor analysis: exploratory (EFA) and confirmatory (CFA).

### Exploratory Factor Analysis (EFA)

It describes a group of variables by a linear combination of the underlying common factors.

Each observed variable is potentially a measure of every factor, and the goal is to determine the relationships between observed variables and factors that are the strongest. We don’t know a priori how many factors are needed to explain the relation among variables.

### Confirmatory Factor Analysis (CFA)

We try to verify the ability of our theoretical model to explain the common variance between several variables with latent variables that we determined a priori. We have prior knowledge about the underlying structure of the construct under investigation. See structural Equation Modelling (SEM).

## Why would you want to use it?

• When you are more interested in identifying the structure underlying a particular phenomenon than testing a specific hypothesis.
• You want to reduce the number of variables to a few factors explaining a large proportion of the original variables variance (Data reduction).
• You want to identify groups of highly correlated variables, choose one among them to be used in subsequent analyses.

## What are the 2 main methods?

### Principal Component Analysis (PCA)

“The central idea of principal component analysis (PCA) is to reduce the dimensionality of a data set consisting of a large number of interrelated variables, while retaining as much as possible of the variation present in the data set” (Jolliffe 2002).

Objective: data reduction.

### Factor Analysis

In factor analysis the goal is to explain the covariance among variables (the common variance between one variable and at least one another, a factor).

Objective: identifying data structure or causal modelling.

## Assumptions

• Within a set of observed variables there is a set of underlying factors (latent variables) that can explain the relationships among variables. The observed variables are dependent upon these latent variables but are also subject to random variation,
• No outliers,
• From Pearson correlation coefficient:
• Large sample size (15-20 observations per variables),
• Continuous distribution and sufficient variation within the variables,
• Linear relationship among the correlated variables.

## Examples

### PCA for data reduction

In this study, we started with ~30 variables that were potentially highly correlated. We had first to identify clusters of correlated variables to reduce the likelihood of overfitting the model. PCA was used to discard redundant variables.

We further used Lasso (least absolute shrinkage and selection operator) regression on the clusters identified by PCA.

### Multiple Factor Analysis (MFA)

MFA is dedicated to datasets where variables are structured into groups, studying several set of variables simultaneously. If you have a set of variables, some may be related together and you might want to study the relationships between the groups of variables, not only the variables.

In the above study, we wanted to see if we could identify groups of herds according to certain factors. Cow-level data were aggregated to provide herd-level information. These aggregated data formed groups of variables related to management, reproduction, milk production, and health.

Unfortunately there were no relationships between herds. However you can keep going your analysis and run a PCA on specific groups of variables.

### Factor analysis

Here’s an example, on Q-methodology.

## Conclusion

Factor analyses are additional tools that allow epidemiologists to carry out a data-driven exploration of their dataset, addressing the impact of exposures and their context.

# Bias in Observational Studies – Sensitivity Analysis with R package episensr

When it’s time to interpret the study results from your observational study, you have to estimate if the effect measure you obtained is the truth, if it’s due to bias (systematic error, the effect measure’s precision), or if it’s due to chance (random error, the effect measure’s validity) (Rothman and Greenland, 2008, pp115-134). Every study has some random error due to its limited sample size, and is susceptible to systematic errors as well, from selection bias to the presence of (un)known confounders or information bias (measurement error, including misclassification).

Bias analysis, or sensitivity analysis, tries to quantify the direction, magnitude, and uncertainty of the bias affecting an estimate of association (Greenland and Lash, 2008, pp345-380; Lash et al., 2009).

Very often bias is evaluated qualitatively without any quantitative assessment of its magnitude. This might be due to the very few tools available to epidemiologists. Hence this R package, episensr, which is built following the book by Lash et al. I will illustrate its use with 3 studies from this book.

First you have to install and load the package.

install.packages("episensr")
# or get it from Github repo with
# library(devtools)
# install_github("dhaine/episensr")

library(episensr)


### Selection Bias

We will use a case-control study by Stang et al. on the relation between mobile phone use and uveal melanoma. The observed odds ratio for the association between regular mobile phone use vs. no mobile phone use with uveal melanoma incidence is 0.71 [95% CI 0.51-0.97]. But there was a substantial difference in participation rates between cases and controls (94% vs 55%, respectively) and so selection bias could have an impact on the association estimate. The 2X2 table for this study is the following:

 Regular Use No Use Cases 136 107 Controls 297 165

We use the function selection as shown below.

selection(matrix(c(136, 107, 297, 165),
dimnames = list(c("UM+", "UM-"), c("Mobile+", "Mobile-")),
nrow = 2, byrow = TRUE),
selprob = c(.94, .85, .64, .25))

Observed Data:
---------------------------------------------------
Outcome   : UM+
Comparing : Mobile+ vs. Mobile-

Mobile+ Mobile-
UM+     136     107
UM-     297     165

Data Corrected for Selected Proportions:
---------------------------------------------------

Mobile+  Mobile-
UM+ 144.6809 125.8824
UM- 464.0625 660.0000

95% conf. interval
Observed Relative Risk: 0.7984    0.6518   0.9780
Observed Odds Ratio: 0.7061    0.5144   0.9693

[,1]
Selection Bias Corrected Relative Risk: 1.4838
Selection Bias Corrected Odds Ratio: 1.6346

[,1]
Selection probability among cases exposed: 0.94
Selection probability among cases unexposed: 0.85
Selection probability among noncases exposed: 0.64
Selection probability among noncases unexposed: 0.25


The 2X2 table is provided as a matrix and selection probabilities given with the argument selprob, a vector with the 4 probabilities (guided by the participation rates in cases and controls) in the following order: among cases exposed, among cases unexposed, among noncases exposed, and among noncases unexposed. The output shows the observed 2X2 table, the same table corrected for the selection proportions, the observed odds ratio (and relative risk) followed by the corrected ones, and the input parameters.

### Uncontrolled Confounders

We will use date from a cross-sectional study by Tyndall et al. on the association between male circumcision and the risk of acquiring HIV, which might be confounded by religion. The code to account for unmeasured or unknown confounders is the following, where the 2X2 table is given as a matrix. We choose a risk ratio implementation, provide a vector defining the prevalence of the confounder, religion, among the exposed and the unexposed, and give the risk ratio associating the confounder with the disease.

confounders(matrix(c(105, 85, 527, 93),
dimnames = list(c("HIV+", "HIV-"), c("Circ+", "Circ-")),
nrow = 2, byrow = TRUE),
implement = "RR",
p = c(.8, .05),
RR.cd = .63)

Observed Data:
--------------
Outcome   : HIV+
Comparing : Circ+ vs. Circ-

Circ+ Circ-
HIV+   105    85
HIV-   527    93

Data, Counfounder +:
--------------------

Circ+ Circ-
HIV+  75.1705 2.728
HIV- 430.4295 6.172

Data, Counfounder -:
--------------------

Circ+  Circ-
HIV+ 29.8295 82.272
HIV- 96.5705 86.828

Crude and Unmeasured Confounder Specific Measures of Exposure-Outcome Relationship:
-----------------------------------------------------------------------------------

95% conf. interval
Crude Relative Risk: 0.3479    0.2757    0.439
Relative Risk, Confounder +: 0.4851
Relative Risk, Confounder -: 0.4851

------------------------------------------------------

Standardized Morbidity Ratio     SMRrr: 0.4851    RR adjusted using SMR estimate: 0.7173
Mantel-Haenszel                   MHrr: 0.4851     RR adjusted using MH estimate: 0.7173

Bias Parameters:
----------------

p(Confounder+|Exposure+): 0.8
p(Confounder+|Exposure-): 0.05
RR(Confounder-Outcome): 0.63


The output gives the crude 2X2 table, the 2X2 tables by levels of the confounder, the crude relative risk and confounder specific measures of association between exposure and outcome, and the relationship adjusted for the unknown confounder, using a standardized morbidity ratio (SMR) or a Mantel-Haenszel (MH) estimate of the risk ratio. Finally, the bias parameters are shown.

### Probabilistic Sensitivity Analysis for Exposure Misclassification

We use a study on the effect of smoking during pregnancy on breast cancer risk (Fink and Lash), where we assume nondifferential misclassification of the exposure, smoking, with probability density functions for sensitivities (Se) and specificities (Sp) among cases and noncases equal to uniform distributions with a minimum of 0.7 and a maximum of 0.95 for sensitivities (0.9 and 0.99 respectively for specificities). As usual, the 2X2 table is provided as a matrix. We choose to correct for exposure misclassification with the argument type = exposure. We ask for 10000 replications (default is 1000). The Se and Sp for cases (seca, spca) are given as a list with its first element referring to the type of distribution (choice between uniform, triangular and trapezoidal) and the second element giving the distribution parameters (min and max for uniform distribution). By avoiding to provide information on the noncases (seexp, spexp), we are referring to a nondifferential misclassification.

smoke.nd <- probsens(matrix(c(215, 1449, 668, 4296),
dimnames = list(c("BC+", "BC-"), c("Smoke+", "Smoke-")),
nrow = 2, byrow = TRUE),
type = "exposure",
reps = 10000,
seca.parms = list("uniform", c(.7, .95)),
spca.parms = list("uniform", c(.9, .99)))

Observed Data:
--------------
Outcome   : BC+
Comparing : Smoke+ vs. Smoke-

Smoke+ Smoke-
BC+    215   1449
BC-    668   4296

Observed Measures of Exposure-Outcome Relationship:
-----------------------------------------------------------------------------------

95% conf. interval
Observed Relative Risk: 0.9654    0.8524   1.0934
Observed Odds Ratio: 0.9542    0.8092   1.1252

Median 2.5th percentile
Relative Risk -- systematic error: 0.9432           0.8816
Odds Ratio -- systematic error: 0.9254           0.8477
Relative Risk -- systematic and random error: 0.9372           0.8181
Odds Ratio -- systematic and random error: 0.9178           0.7671
97.5th percentile
Relative Risk -- systematic error:            0.9612
Odds Ratio -- systematic error:            0.9488
Relative Risk -- systematic and random error:            1.0662
Odds Ratio -- systematic and random error:            1.0884

Bias Parameters:
----------------

Se|Cases: uniform ( 0.7 0.95 )
Sp|Cases: uniform ( 0.9 0.99 )
Se|No-cases: ( )
Sp|No-cases: ( )


The output gives the 2X2 table, the observed measures of association, the corrected measures of association, and the input bias parameters.

We saved the probsens analysis in a new variable smoke.nd. We can see the element of the object probsens with the function str():

str(smoke.nd)

List of 4
$obs.data : num [1:2, 1:2] 215 668 1449 4296 ..- attr(*, "dimnames")=List of 2 .. ..$ : chr [1:2] "BC+" "BC-"
.. ..$: chr [1:2] "Smoke+" "Smoke-"$ obs.measures: num [1:2, 1:3] 0.965 0.954 0.852 0.809 1.093 ...
..- attr(*, "dimnames")=List of 2
.. ..$: chr [1:2] " Observed Relative Risk:" " Observed Odds Ratio:" .. ..$ : chr [1:3] "     " "95% conf." "interval"
$adj.measures: num [1:4, 1:3] 0.943 0.925 0.937 0.918 0.882 ... ..- attr(*, "dimnames")=List of 2 .. ..$ : chr [1:4] "           Relative Risk -- systematic error:" "              Odds Ratio -- systematic error:" "Relative Risk -- systematic and random error:" "   Odds Ratio -- systematic and random error:"
.. ..$: chr [1:3] "Median" "2.5th percentile" "97.5th percentile"$ sim.df      :'data.frame':	10000 obs. of  12 variables:
..$seca : num [1:10000] 0.878 0.75 0.928 0.796 0.841 ... ..$ seexp  : num [1:10000] 0.878 0.75 0.928 0.796 0.841 ...
..$spca : num [1:10000] 0.958 0.948 0.971 0.906 0.969 ... ..$ spexp  : num [1:10000] 0.958 0.948 0.971 0.906 0.969 ...
..$A1 : num [1:10000] 173.5 183.1 185.5 82.6 201.6 ... ..$ B1     : num [1:10000] 1490 1481 1479 1581 1462 ...
..$C1 : num [1:10000] 550 584 583 284 634 ... ..$ D1     : num [1:10000] 4414 4380 4381 4680 4330 ...
..$corr.RR: num [1:10000] 0.951 0.944 0.957 0.891 0.955 ... ..$ corr.OR: num [1:10000] 0.935 0.927 0.943 0.86 0.941 ...
..$tot.RR : num [1:10000] 0.917 0.855 0.895 0.882 0.883 ... ..$ tot.OR : num [1:10000] 0.892 0.813 0.863 0.848 0.848 ...


smoke.nd is a list of 4 elements where different information on the analysis done are saved. We have smoke.nd$obs.data where we have the observed 2X2 table, smoke.nd$obs.measures (the observed measures of association), smoke.nd$adj.measures (the adjusted measures of association), and smoke.nd$sim.df, a data frame with the simulated variables from each replication, like the Se and Sp, the 4 cells of the adjusted 2X2 table, and the adjusted measures. We can plot the Se prior distribution (not forgetting to discard the draws that led to negative adjustments).

hist(smoke.nd$sim.df[!is.na(smoke.nd$sim.df$corr.RR), ]$seca,
breaks = seq(0.65, 1, 0.01),
col = "lightgreen",
main = NULL,
xlab = "Sensitivity for Cases")


Additional functions can be found on the reference manual and more will be added in the coming months.

# Veterinary Epidemiologic Research: Modelling Survival Data – Parametric and Frailty Models

Last post on modelling survival data from Veterinary Epidemiologic Research: parametric analyses. The Cox proportional hazards model described in the last post make no assumption about the shape of the baseline hazard, which is an advantage if you have no idea about what that shape might be. With a parametric survival model, the survival time is assumed to follow a known distribution: Weibull, exponential (which is a special case of the Weibull), log-logistic, log-normal, and generalized gamma.

Exponential Model
The exponential model is the simplest, the hazard $h_0(t)$ is constant over time: the rate at which failures are occurring is constant, $h(t) = \lambda$. We use again the pgtrial dataset:

temp <- tempfile()
"http://ic.upei.ca/ver/sites/ic.upei.ca.ver/files/ver2_data_R.zip", temp)

library(Hmisc)
pgtrial <- upData(pgtrial, labels = c(herd = 'Herd id', cow = 'Cow id',
tx = 'Treatment', lact = 'Lactation number',
thin = 'Body condition', dar = 'Days at risk',
preg = 'Pregnant or censored'),
levels = list(thin = list('normal' = 0, 'thin' = 1),
preg = list('censored' = 0, 'pregnant' = 1)))
pgtrial$herd <- as.factor(pgtrial$herd)

library(survival)
exp.mod <- survreg(Surv(dar, preg == 'pregnant') ~ herd + tx + (lact - 1) +
thin, data = pgtrial, dist = "exponential")
summary(exp.mod)

Call:
survreg(formula = Surv(dar, preg == "pregnant") ~ herd + tx +
(lact - 1) + thin, data = pgtrial, dist = "exponential")
Value Std. Error     z         p
herd1     4.3629     0.1827 23.88 4.66e-126
herd2     4.6776     0.1711 27.34 1.41e-164
herd3     4.3253     0.1617 26.75 1.12e-157
tx       -0.2178     0.1255 -1.74  8.26e-02
lact      0.0416     0.0413  1.01  3.14e-01
thinthin  0.1574     0.1383  1.14  2.55e-01

Scale fixed at 1

Exponential distribution
Loglik(model)= -1459.9   Loglik(intercept only)= -1465.6
Chisq= 11.42 on 5 degrees of freedom, p= 0.044
Number of Newton-Raphson Iterations: 5
n= 319


Interpretation is the same as for a Cox model. Exponentiated coefficients are hazard ratios. R outputs the parameter estimates of the AFT (accelerated failure time) form of the exponential model. If you multiply the estimated coefficients by minus one you get estimates that are consistent with the proportional hazards parameterization of the model. So for tx, the estimated hazard ratio is exp(0.2178) = 1.24 (at any given point in time, a treated cow is 1.24 times more likely to conceive than a non-treated cow). The corresponding accelerating factor for an exponential model is the reciprocal of the hazard ratio, exp(-0.2178) = 0.80: treating a cow accelerates the time to conception by a factor of 0.80.

Weibull Model

In a Weibull model, the hazard function is $h(t) = \lambda p t^{p-1}$ where $p$ and $\lambda$ are > 0. $p$ is the shape parameter and determines the shape of the hazard function. If it’s $> 1$, the hazard increases with time. If $p = 1$, the hazard is constant and the model reduces to an exponential model. If $p < 1$, the hazard decreases over time.

library(car)
pgtrial$parity <- recode(pgtrial$lact, "1 = 1; 2:hi = '2+'")
weib.mod <- survreg(Surv(dar, preg == 'pregnant') ~ herd + tx + parity +
thin, data = pgtrial, dist = "weibull")
summary(weib.mod)

Call:
survreg(formula = Surv(dar, preg == "pregnant") ~ herd + tx +
parity + thin, data = pgtrial, dist = "weibull")
Value Std. Error       z         p
(Intercept)  4.23053     0.1937 21.8412 9.42e-106
herd2        0.36117     0.1947  1.8548  6.36e-02
herd3       -0.00822     0.1980 -0.0415  9.67e-01
tx          -0.23386     0.1438 -1.6262  1.04e-01
parity2+     0.33819     0.1490  2.2698  2.32e-02
thinthin     0.11222     0.1576  0.7119  4.77e-01
Log(scale)   0.13959     0.0509  2.7407  6.13e-03

Scale= 1.15

Weibull distribution
Loglik(model)= -1453.7   Loglik(intercept only)= -1460.7
Chisq= 14 on 5 degrees of freedom, p= 0.016
Number of Newton-Raphson Iterations: 5
n= 319


The shape parameter is the reciprocal of what is called by R the scale parameter. The shape parameter is then 1/1.15 = 0.869.

We can also use a piecewise constant exponential regression model, which is a model allowing the baseline hazard to vary between time periods but forces it to remain constant within time periods. In order to run such a model, we need data in a counting process format with a start and stop time for each interval. However, survreg does not allow for a data in that format. The trick would be to use a glm and fitting a Poisson model, including time intervals. See this post by Stephanie Kovalchik which explains how to construct the data and model. The example below is using the same approach, for a time interval of 40 days:

interval.width <- 40
# function to compute time breaks given the exit time = dar
cow.breaks <- function(dar) unique(c(seq(0, dar, by = interval.width),
dar))
# list of each subject's time periods
the.breaks <- lapply(unique(pgtrial$cow), function(id){ cow.breaks(max(pgtrial$dar[pgtrial$cow == id])) }) # the expanded period of observation: start <- lapply(the.breaks, function(x) x[-length(x)]) # for left time points stop <- lapply(the.breaks, function(x) x[-1]) # for right time points count.per.cow <- sapply(start, length) index <- tapply(pgtrial$cow, pgtrial$cow, length) index <- cumsum(index) # index of last observation for each cow event <- rep(0, sum(count.per.cow)) event[cumsum(count.per.cow)] <- pgtrial$preg[index]

# creating the expanded dataset
pw.pgtrial <- data.frame(
cow = rep(pgtrial$cow[index], count.per.cow), dar = rep(pgtrial$dar[index], count.per.cow),
herd = rep(pgtrial$herd[index], count.per.cow), tx = rep(pgtrial$tx[index], count.per.cow),
lact = rep(pgtrial$lact[index], count.per.cow), thin = rep(pgtrial$thin[index], count.per.cow),
start = unlist(start),
stop = unlist(stop),
event = event
)

# create time variable which indicates the period of observation (offset in Poisson model)
pw.pgtrial$time <- pw.pgtrial$stop - pw.pgtrial$start # length of observation # create a factor for each interval, allowing to have a different rate for each period pw.pgtrial$interval <- factor(pw.pgtrial$start) pw.pgtrial[100:110, ] cow dar herd tx lact thin start stop event time interval 100 61 113 1 1 4 thin 0 40 0 40 0 101 61 113 1 1 4 thin 40 80 0 40 40 102 61 113 1 1 4 thin 80 113 1 33 80 103 62 117 1 0 7 normal 0 40 0 40 0 104 62 117 1 0 7 normal 40 80 0 40 40 105 62 117 1 0 7 normal 80 117 2 37 80 106 63 121 1 1 1 thin 0 40 0 40 0 107 63 121 1 1 1 thin 40 80 0 40 40 108 63 121 1 1 1 thin 80 120 0 40 80 109 63 121 1 1 1 thin 120 121 2 1 120 110 64 122 1 1 3 normal 0 40 0 40 0 # Poisson model pw.model <- glm(event ~ offset(log(time)) + interval + herd + tx + lact + + thin, data = pw.pgtrial, family = "poisson") summary(pw.model) Call: glm(formula = event ~ offset(log(time)) + interval + herd + tx + lact + thin, family = "poisson", data = pw.pgtrial) Deviance Residuals: Min 1Q Median 3Q Max -1.858 -1.373 -1.227 1.357 3.904 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.602545 0.132436 -27.202 < 2e-16 *** interval40 -0.112838 0.106807 -1.056 0.29076 interval80 -0.064105 0.125396 -0.511 0.60920 interval120 -0.007682 0.147919 -0.052 0.95858 interval160 -0.005743 0.191778 -0.030 0.97611 interval200 -0.427775 0.309143 -1.384 0.16644 interval240 0.199904 0.297331 0.672 0.50137 interval280 0.737508 0.385648 1.912 0.05583 . interval320 0.622366 1.006559 0.618 0.53637 herd2 -0.254389 0.114467 -2.222 0.02626 * herd3 0.026851 0.119416 0.225 0.82209 tx 0.219584 0.084824 2.589 0.00963 ** lact -0.023528 0.027511 -0.855 0.39241 thinthin -0.139915 0.093632 -1.494 0.13509 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 2155.6 on 798 degrees of freedom Residual deviance: 2131.1 on 785 degrees of freedom AIC: 2959.1 Number of Fisher Scoring iterations: 7  Log-logistic Model loglog.mod <- survreg(Surv(dar, preg == 'pregnant') ~ herd + tx + lact + thin, data = pgtrial, dist = "loglogistic") summary(loglog.mod) Call: survreg(formula = Surv(dar, preg == "pregnant") ~ herd + tx + lact + thin, data = pgtrial, dist = "loglogistic") Value Std. Error z p (Intercept) 3.9544 0.2531 15.625 4.91e-55 herd2 0.2537 0.2355 1.077 2.81e-01 herd3 -0.1019 0.2437 -0.418 6.76e-01 tx -0.3869 0.1768 -2.189 2.86e-02 lact 0.0612 0.0550 1.112 2.66e-01 thinthin 0.0400 0.1894 0.211 8.33e-01 Log(scale) -0.1260 0.0515 -2.447 1.44e-02 Scale= 0.882 Log logistic distribution Loglik(model)= -1467.2 Loglik(intercept only)= -1472.2 Chisq= 9.85 on 5 degrees of freedom, p= 0.08 Number of Newton-Raphson Iterations: 4 n= 319  Individual Frailty Model In an individual frailty model, we add variance unique to individuals in order to account for additional variability in the hazard (like negative binomial model vs. Poisson model). For example, let’s fit a Weibull model with gamma individual frailty to the prostaglandin dataset: library(parfm) pgtrial$preg.bin <- as.numeric(pgtrial$preg) - 1 indfr.mod <- parfm(Surv(dar, preg.bin) ~ herd + tx + lact + thin, cluster = "cow", data = pgtrial, dist = "weibull", frailty = "gamma") Execution time: 17.872 second(s) indfr.mod Frailty distribution: gamma Baseline hazard distribution: Weibull Loglikelihood: -1455.679 ESTIMATE SE p-val theta 0.000 0.003 rho 0.867 0.044 lambda 0.024 0.006 herd2 -0.289 0.169 0.088 . herd3 0.039 0.175 0.824 tx 0.204 0.125 0.103 lact -0.041 0.041 0.314 thinthin -0.136 0.138 0.323 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1  Shared Frailty Shared frailty is a way to deal with clustered data. We will use the “culling” dataset and fit a shared frailty model with a Weibull distribution and a gamma distributed frailty common to all cows in a herd: temp <- tempfile() download.file( "http://ic.upei.ca/ver/sites/ic.upei.ca.ver/files/ver2_data_R.zip", temp) load(unz(temp, "ver2_data_R/culling.rdata")) unlink(temp) library(frailtypack) shfrw.mod <- frailtyPenal(Surv(dar, culled) ~ as.factor(lact_c3) + johnes + cluster(herd), hazard = 'Weibull', data = culling, Frailty = TRUE) shfrw.sum <- cbind(shfrw.mod$coef, sqrt(diag(shfrw.mod$varH)), shfrw.mod$coef / sqrt(diag(shfrw.mod$varH)), signif(1 - pchisq((shfrw.mod$coef/sqrt(diag(shfrw.mod$varH)))^2, 1)), exp(shfrw.mod$coef),
exp(shfrw.mod$coef - abs(qnorm((1 - 0.95) / 2)) * sqrt(diag(shfrw.mod$varH))),
exp(shfrw.mod$coef + abs(qnorm((1 - 0.95) / 2)) * sqrt(diag(shfrw.mod$varH))))
row.names(shfrw.sum) <- c("Lactation 2", "Lactation 3+", "Johnes")
colnames(shfrw.sum) <- c("Coef.", "Std. Err.", "z", "p-value", "Hazard Ratio",
"Lower CI", "Upper CI")
shfrw.sum
Coef. Std. Err.        z     p-value Hazard Ratio  Lower CI
Lactation 2  0.2518627 0.1450806 1.736019 8.25605e-02     1.286419 0.9680321
Lactation 3+ 0.7636558 0.1227840 6.219508 4.98717e-10     2.146108 1.6870874
Johnes       0.5914741 0.3045475 1.942141 5.21200e-02     1.806650 0.9945867
Upper CI
Lactation 2  1.709525
Lactation 3+ 2.730017
Johnes       3.281748


That’s it for reproducing the examples from Dohoo’s book, chapter on modelling survival data. Next time I’ll look at mixed models.

# Veterinary Epidemiologic Research: Modelling Survival Data – Semi-Parametric Analyses

Next on modelling survival data from Veterinary Epidemiologic Research: semi-parametric analyses. With non-parametric analyses, we could only evaluate the effect one or a small number of variables. To evaluate multiple explanatory variables, we analyze data with a proportional hazards model, the Cox regression. The functional form of the baseline hazard is not specified, which make the Cox model a semi-parametric model.
A Cox proportional hazards model is fit hereafter, on data from a clinical trial of the effect of prostaglandin adminsitration on the start of breeding period of dairy cows:

temp <- tempfile()
"http://ic.upei.ca/ver/sites/ic.upei.ca.ver/files/ver2_data_R.zip", temp)

library(Hmisc)
pgtrial <- upData(pgtrial, labels = c(herd = 'Herd id', cow = 'Cow id',
tx = 'Treatment', lact = 'Lactation number',
thin = 'Body condition', dar = 'Days at risk',
preg = 'Pregnant or censored'),
levels = list(thin = list('normal' = 0, 'thin' = 1),
preg = list('censored' = 0, 'pregnant' = 1)))
pgtrial$herd <- as.factor(pgtrial$herd)

library(survival)
coxph.mod <- coxph(Surv(dar, preg == 'pregnant') ~ herd + tx + lact + thin,
data = pgtrial, ties = 'breslow')
(coxph.sum <- summary(coxph.mod))
Call:
coxph(formula = Surv(dar, preg == "pregnant") ~ herd + tx + lact +
thin, data = pgtrial, ties = "breslow")

n= 319, number of events= 264

coef exp(coef) se(coef)      z Pr(>|z|)
herd2    -0.28445   0.75243  0.16981 -1.675   0.0939 .
herd3     0.03676   1.03744  0.17426  0.211   0.8329
tx        0.18359   1.20152  0.12543  1.464   0.1433
lact     -0.04283   0.95807  0.04109 -1.042   0.2972
thinthin -0.14557   0.86453  0.13794 -1.055   0.2913
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

exp(coef) exp(-coef) lower .95 upper .95
herd2       0.7524     1.3290    0.5394     1.050
herd3       1.0374     0.9639    0.7373     1.460
tx          1.2015     0.8323    0.9396     1.536
lact        0.9581     1.0438    0.8839     1.038
thinthin    0.8645     1.1567    0.6597     1.133

Concordance= 0.564  (se = 0.021 )
Rsquare= 0.029   (max possible= 1 )
Likelihood ratio test= 9.5  on 5 df,   p=0.09084
Wald test            = 9.32  on 5 df,   p=0.09685
Score (logrank) test = 9.34  on 5 df,   p=0.09611


R gives several options to control ties in case several events occurred at the same time: the Efron method (default in R), Breslow method (default in software like SAS or Stata), and the exact method. Breslow is the simplest and adequate if not too many ties in the dataset. Efron is closer to the exact approximation.

Stratified Cox Propotional Hazards Model

In a stratified Cox model, different baseline hazards are assumed across groups of subjects. The Cox model is modified to allow the control of a predictor which do not satisfy the proportional hazards (PH) assumption. We refit the above model by stratifying by herd and including a treatment by herd interaction:

scoxph.mod <- coxph(Surv(dar, preg == 'pregnant') ~ tx + tx*herd + lact + thin +
strata(herd), data = pgtrial, method = 'breslow')
summary(scoxph.mod)
Call:
coxph(formula = Surv(dar, preg == "pregnant") ~ tx + tx * herd +
lact + thin + strata(herd), data = pgtrial, method = "breslow")

n= 319, number of events= 264

coef exp(coef) se(coef)      z Pr(>|z|)
tx       -0.02160   0.97863  0.25528 -0.085   0.9326
herd2          NA        NA  0.00000     NA       NA
herd3          NA        NA  0.00000     NA       NA
lact     -0.04600   0.95504  0.04065 -1.132   0.2578
thinthin -0.13593   0.87291  0.13833 -0.983   0.3258
tx:herd2 -0.05659   0.94498  0.33570 -0.169   0.8661
tx:herd3  0.54494   1.72451  0.31823  1.712   0.0868 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

exp(coef) exp(-coef) lower .95 upper .95
tx          0.9786     1.0218    0.5934     1.614
herd2           NA         NA        NA        NA
herd3           NA         NA        NA        NA
lact        0.9550     1.0471    0.8819     1.034
thinthin    0.8729     1.1456    0.6656     1.145
tx:herd2    0.9450     1.0582    0.4894     1.825
tx:herd3    1.7245     0.5799    0.9242     3.218

Concordance= 0.56  (se = 0.035 )
Rsquare= 0.032   (max possible= 0.998 )
Likelihood ratio test= 10.32  on 5 df,   p=0.06658
Wald test            = 10.5  on 5 df,   p=0.0623
Score (logrank) test = 10.66  on 5 df,   p=0.05851


Evaluating the Assumption of Proportional Hazards

We can evaluate it graphically, by examining the log-cumulative hazard plot vs. ln(time) and check if the curves are parallel:

coxph.mod2 <- coxph(Surv(dar, preg == 'pregnant') ~ tx, data = pgtrial,
ties = 'breslow')
pgtrial2 <- with(pgtrial, data.frame(tx = c(0, 1)))
tfit.add <- survfit(coxph.mod2, newdata = pgtrial2)
df1 <- data.frame(
time    = tfit.add[1, ]$time, n.risk = tfit.add[1, ]$n.risk,
n.event = tfit.add[1, ]$n.event, surv = tfit.add[1, ]$surv,
strata  = "0",
upper   = tfit.add[1, ]$upper, lower = tfit.add[1, ]$lower,
log.surv = log(-log(tfit.add[1, ]$surv)) ) df2 <- data.frame( time = tfit.add[2, ]$time,
n.risk  = tfit.add[2, ]$n.risk, n.event = tfit.add[2, ]$n.event,
surv    = tfit.add[2, ]$surv, strata = "1", upper = tfit.add[2, ]$upper,
lower   = tfit.add[2, ]$lower, log.surv = log(-log(tfit.add[2, ]$surv))
)
zeros <- data.frame(time = 0, surv = 1, strata = c(1, 2),
upper = 1, lower = 1)
dfpar.add$strata <- factor(dfpar.add$strata, labels = c("No tx", "Tx"))
ggplot(dfpar.add, aes(log(time), log.surv, colour = strata)) +
geom_step(size = 0.6) +
scale_color_manual("Tx", values = c('blue4', 'darkorange')) +
xlab("ln(time)") + ylab("Log-log survival")


Another graphical approach is to compare plots of predicted survival times from a Cox model (assuming PH) to Kaplan-Meier survivor function (which do not assume PH):

tfit.km <- survfit(Surv(dar, preg == 'pregnant') ~ tx, data = pgtrial)
df3.km <- data.frame(
time    = tfit.km$time, n.risk = tfit.km$n.risk,
n.event = tfit.km$n.event, surv = tfit.km$surv,
strata  = gsub("tx=", "", summary(tfit.km, censored = T)$strata), upper = tfit.km$upper,
lower   = tfit.km$lower ) zeros <- data.frame(time = 0, surv = 1, strata = gsub("tx=", "", levels(summary(tfit.km)$strata)),
upper = 1, lower = 1)
df3.km <- rbind.fill(df3.km, zeros)
df3.km$cat <- with(df3.km, ifelse(strata == "0", "No tx, observed", "Tx, observed")) dfpar.add$cat <- with(dfpar.add, ifelse(strata == "No tx", "No tx, expected",
"Tx, expected"))
ggplot(dfpar.obs, aes(time, surv, colour = cat)) +
geom_step(size = 0.6) +
scale_color_manual("", values = c('blue1', 'blue4', 'darkorange1',
'darkorange4')) +
xlab("time") + ylab("survival probability")


You can also assess PH statistically with the Schoenfeld residuals using cox.zph function:

(schoen <- cox.zph(coxph.mod))
rho chisq      p
herd2    -0.0630 1.100 0.2942
herd3    -0.0443 0.569 0.4506
tx       -0.1078 3.141 0.0763
lact      0.0377 0.447 0.5035
thinthin -0.0844 2.012 0.1560
GLOBAL        NA 7.631 0.1778

plot(schoen, var = 4)


Evaluating the Overall Fit of the Model

First we can look at the Cox-Snell residuals, which are the estimated cumulative hazards for individuals at their failure (or censoring) times. The default residuals of coxph in R are the martingale residuals, not the Cox-Snell. But it can be computed:

cox.snell <- (as.numeric(pgtrial$preg) - 1) - resid(coxph.mod, type = "martingale") coxph.res <- survfit(coxph(Surv(cox.snell, pgtrial$preg == 'pregnant') ~ 1,
method = 'breslow'), type = 'aalen')

plot(coxph.res$time, -log(coxph.res$surv), type = 's',
xlab = 'Modified Cox-Snell residuals', ylab = 'Cumulative hazard')
abline(0, 1, col = 'red', lty = 2)

## Alternatively:
coxph.res2 <- survfit(Surv(cox.snell, pgtrial$preg == 'pregnant') ~ 1) Htilde <- cumsum(coxph.res2$n.event / coxph.res$n.risk) plot(coxph.res2$time, Htilde, type = 's', col = 'blue')
abline(0, 1, col = 'red', lty = 2)


We can also use a goodness-of-fit test:

## GOF (Gronnesby and Borgan omnibus gof)
library(gof)
cumres(coxph.mod)

Kolmogorov-Smirnov-test: p-value=0.35
Cramer von Mises-test: p-value=0.506
Based on 1000 realizations. Cumulated residuals ordered by herd2-variable.
---
Kolmogorov-Smirnov-test: p-value=0.041
Cramer von Mises-test: p-value=0.589
Based on 1000 realizations. Cumulated residuals ordered by herd3-variable.
---
Kolmogorov-Smirnov-test: p-value=0
Cramer von Mises-test: p-value=0.071
Based on 1000 realizations. Cumulated residuals ordered by tx-variable.
---
Kolmogorov-Smirnov-test: p-value=0.728
Cramer von Mises-test: p-value=0.733
Based on 1000 realizations. Cumulated residuals ordered by lact-variable.
---
Kolmogorov-Smirnov-test: p-value=0.106
Cramer von Mises-test: p-value=0.091
Based on 1000 realizations. Cumulated residuals ordered by thinthin-variable.


We can evaluate the concordance between the predicted and observed sequence of pairs of events. Harrell’s c index computes the proportion of all pairs of subjects in which the model correctly predicts the sequence of events. It ranges from 0 to 1 with 0.5 for random predictions and 1 for a perfectly discriminating model. It is obtained from the Somer’s Dxy rank correlation:

library(rms)
fit.cph <- cph(Surv(dar, preg == 'pregnant') ~ herd + tx + lact + thin,
data = pgtrial, x = TRUE, y = TRUE, surv = TRUE)
v <- validate(fit.cph, dxy = TRUE, B = 100)
Dxy <- v[rownames(v) == "Dxy", colnames(v) == "index.corrected"]
(Dxy / 2) + 0.5 # c index
[1] 0.4538712


Evaluating the Functional Form of Predictors

We can use martingale residuals to evaluate the functional form of the relationship between a continuous predictor and the survival expectation for individuals:

lact.mod <- coxph(Surv(dar, preg == 'pregnant') ~ lact, data = pgtrial,
ties = 'breslow')
lact.res <- resid(lact.mod, type = "martingale")
plot(pgtrial$lact, lact.res, xlab = 'lactation', ylab = 'Martingale residuals') lines(lowess(pgtrial$lact, lact.res, iter = 0))

lact.mod <- update(lact.mod, . ~ . + I(lact^2))
lact.res <- resid(lact.mod, type = "martingale")
plot(pgtrial$lact, lact.res, xlab = 'lactation', ylab = 'Martingale residuals') lines(lowess(pgtrial$lact, lact.res, iter = 0))


Checking for Outliers

Deviance residuals can be used to identify outliers:

## deviance residuals
dev.res <- resid(coxph.mod, type = "deviance")
plot(pgtrial$dar, dev.res, xlab = 'time (days)', ylab = 'deviance residuals') cbind(dev.res, pgtrial)[abs(dev.res) > 2, ] dev.res herd cow tx lact thin dar preg 1 2.557832 1 1 0 1 normal 1 pregnant 2 2.592492 1 2 1 4 thin 1 pregnant 3 2.319351 1 3 1 1 normal 2 pregnant 73 -2.693731 1 76 1 1 normal 277 censored 74 2.734508 2 78 0 2 thin 1 pregnant 75 2.644885 2 79 1 4 normal 1 pregnant 76 2.436308 2 80 1 1 normal 2 pregnant 176 -2.015925 2 180 1 2 normal 201 censored 180 -2.196008 2 184 1 2 normal 250 censored 183 -2.081493 2 187 1 3 thin 288 censored 185 -2.238729 2 189 0 1 normal 346 censored 314 -2.274912 3 318 0 1 thin 262 censored 315 -2.226711 3 319 0 2 thin 262 censored 316 -2.182517 3 320 0 4 thin 287 censored 317 -2.278029 3 321 0 2 thin 288 censored 318 -2.341736 3 322 0 3 thin 308 censored 319 -2.392427 3 323 0 2 thin 320 censored  Score residuals and scaled score residuals can be used to identify influential observations: ### Detecting influential points # score residuals score.res <- resid(coxph.mod, type = "score") # score residuals for tx plot(pgtrial$dar, score.res[ , 3], xlab = 'time (days)',
ylab = 'score residuals')
text(pgtrial$dar, score.res[ , 3], rownames(pgtrial), cex = 0.6, pos = 4) cbind(score.res[ , 3], pgtrial)[abs(score.res[ , 3]) > 2, ] score.res[, 3] herd cow tx lact thin dar preg 73 -2.025537 1 76 1 1 normal 277 censored ## influential observations dfbeta <- resid(coxph.mod, type = "dfbeta") # dfbeta residuals for tx plot(pgtrial$dar, dfbeta[ , 3], xlab = 'time (days)',
ylab = 'scaled score residual')
text(pgtrial$dar, dfbeta[ , 3], rownames(pgtrial), cex = 0.6, pos = 4) # with standardized dfbeta dfbetas <- resid(coxph.mod, type = "dfbetas") plot(pgtrial$dar, dfbetas[ , 3], xlab = 'time (days)',
ylab = 'standardized score residuals')
text(pgtrial$dar, dfbetas[ , 3], rownames(pgtrial), cex = 0.6, pos = 4)  # Veterinary Epidemiologic Research: Modelling Survival Data – Non-Parametric Analyses Next topic from Veterinary Epidemiologic Research: chapter 19, modelling survival data. We start with non-parametric analyses where we make no assumptions about either the distribution of survival times or the functional form of the relationship between a predictor and survival. There are 3 non-parametric methods to describe time-to-event data: actuarial life tables, Kaplan-Meier method, and Nelson-Aalen method. We use data on occurrence of calf pneumonia in calves raised in 2 different housing systems. Calves surviving to 150 days without pneumonia are considered censored at that time. temp <- tempfile() download.file( "http://ic.upei.ca/ver/sites/ic.upei.ca.ver/files/ver2_data_R.zip", temp) load(unz(temp, "ver2_data_R/calf_pneu.rdata")) unlink(temp) library(Hmisc) calf_pneu <- upData(calf_pneu, labels = c(calf = 'Calf id', stock = 'Stocking method', days = 'Time to onset of pneumonia or censoring', pn = 'Pneumonia'), levels = list(stock = list('batch' = 0, 'continuous' = 1)))  Actuarial Life Table To create a life table, we use the function lifetab from package KMsurv, after calculating the number of censored and events at each time point and grouping them by time interval (with gsummary from package nlme). library(KMsurv) interval <- seq(from = 30, to = 165, by = 15) interval <- floor(calf_pneu$days/15)
interval.censor <- data.frame(interval, calf_pneu$pn) library(nlme) pneumonia <- gsummary(interval.censor, sum, groups = interval) total <- gsummary(interval.censor, length, groups = interval) lt.data <- cbind(pneumonia[ , 1:2], total[ , 2]) length <- length(lt.data$interval)
lt.data[length + 1, ]$interval <- NA nevent <- lt.data[ , 2] nlost <- lt.data[ , 3] - lt.data[ , 2] (life.table <- lifetab(lt.data$interval, 24, nlost, nevent))
nsubs nlost nrisk nevent      surv        pdf     hazard    se.surv
1-3      24     0  24.0      1 1.0000000 0.02083333 0.02127660 0.00000000
3-4      23     0  23.0      1 0.9583333 0.04166667 0.04444444 0.04078938
4-5      22     0  22.0      1 0.9166667 0.04166667 0.04651163 0.05641693
5-6      21     0  21.0      3 0.8750000 0.12500000 0.15384615 0.06750772
6-7      18     1  17.5      2 0.7500000 0.08571429 0.12121212 0.08838835
7-8      15     6  12.0      3 0.6642857 0.16607143 0.28571429 0.09686316
8-10      6     0   6.0      1 0.4982143 0.04151786 0.09090909 0.11032937
10-NA     5     5   2.5      0 0.4151786         NA         NA 0.11915934
NA-3      0    NA    NA     NA 0.4151786         NA         NA 0.11915934
se.pdf  se.hazard
1-3   0.02039469 0.02127178
3-4   0.04078938 0.04443347
4-5   0.04078938 0.04649905
5-6   0.06750772 0.08855994
6-7   0.05792828 0.08555236
7-8   0.08649471 0.16326531
8-10  0.03899969 0.09053265
10-NA         NA         NA
NA-3          NA         NA


Kaplan-Meier Method

To compute the Kaplan-Meier estimator we use the function survfit from package survival. It takes as argument a Surv object, which gives the time variable and the event of interest. You get the Kaplan-Meier estimate with the summary of the survfit object. We can then plot the estimates to show the Kaplan-Meier survivor function.

library(survival)
km.sf <- survfit(Surv(days, pn == 1) ~ 1, data = calf_pneu)
summary(km.sf)
Call: survfit(formula = Surv(days, pn == 1) ~ 1, data = calf_pneu)

time n.risk n.event survival std.err lower 95% CI upper 95% CI
27     24       1    0.958  0.0408        0.882        1.000
49     23       1    0.917  0.0564        0.813        1.000
72     22       1    0.875  0.0675        0.752        1.000
79     21       2    0.792  0.0829        0.645        0.972
89     19       1    0.750  0.0884        0.595        0.945
90     18       1    0.708  0.0928        0.548        0.916
101     17       1    0.667  0.0962        0.502        0.885
113     15       2    0.578  0.1019        0.409        0.816
117      9       1    0.514  0.1089        0.339        0.778
123      6       1    0.428  0.1198        0.247        0.741

plot(km.sf, xlab = "time (days)", ylab = "cumulative survival probability", conf.int = TRUE)


Nelson-Aalen Method

A “hazard” is the probability of failure at a point in time, given that the calf had survived up to that point in time. A cumulative hazard, the Nelson-Aaalen estimate, can be computed. The Nelson-Aalen estimate can be calculated by transforming the Fleming-Harrington estimate of survival.

fh.sf <- survfit(Surv(days, pn == 1) ~ 1, data = calf_pneu, type = "fleming")

plot(stepfun(fh.sf$time, c(0, -log(fh.sf$surv))), do.points = FALSE,
xlab = "time (days)", ylab = "cumulative hazard",
main = "", ylim = c(0, 1.5))
lines(stepfun(fh.sf$time, c(0, -log(fh.sf$upper))), lty = 5, do.points = FALSE)
lines(stepfun(fh.sf$time, c(0, -log(fh.sf$lower))), lty = 5, do.points = FALSE)


Tests of the Overall Survival Curve

Several tests are available to test whether the overall survivor functions in 2 or more groups are equal. We can use the log-rank test, the simplest test, assigning equal weight to each time point estimate and equivalent to a standard Mantel-Haenszel test. Also, there’s the Peto-Peto-Prentice test which weights the stratum-specific estimates by the overall survival experience and so reduces the influence of different censoring patterns between groups.
To do these tests, we apply the survdiff function to the Surv object. The argument rho gives the weights according to $S^{(t)}\rho$ and may be any numeric value. Default is rho = 0 which gives the log-rank test. Rho = 1 gives the “Peto & Peto modiﬁcation of the Gehan-Wilcoxon test”. Rho larger than zero gives greater weight to the ﬁrst part of the survival curves. Rho smaller than zero gives weight to the later part of the survival curves.

survdiff(Surv(days, pn == 1) ~ stock, data = calf_pneu, rho = 0) # rho is optional
Call:
survdiff(formula = Surv(days, pn == 1) ~ stock, data = calf_pneu,
rho = 0)

N Observed Expected (O-E)^2/E (O-E)^2/V
stock=batch      12        4     6.89      1.21      2.99
stock=continuous 12        8     5.11      1.63      2.99

Chisq= 3  on 1 degrees of freedom, p= 0.084

survdiff(Surv(days, pn == 1) ~ stock, data = calf_pneu, rho = 1) # rho=1 asks for Peto-Peto test
Call:
survdiff(formula = Surv(days, pn == 1) ~ stock, data = calf_pneu,
rho = 1)

N Observed Expected (O-E)^2/E (O-E)^2/V
stock=batch      12     2.89     5.25      1.06      3.13
stock=continuous 12     6.41     4.05      1.38      3.13

Chisq= 3.1  on 1 degrees of freedom, p= 0.0766


Finally we can compare survivor function with stock R plot or using ggplot2. With ggplot2, you get the necessary data from the survfit object and create a new data frame from it. The baseline data (time = 0) are not there so you create it yourself:

(km.stock <- survfit(Surv(days, pn == 1) ~ stock, data = calf_pneu))
Call: survfit(formula = Surv(days, pn == 1) ~ stock, data = calf_pneu)

records n.max n.start events median 0.95LCL 0.95UCL
stock=batch           12    12      12      4     NA     123      NA
stock=continuous      12    12      12      8    113      79      NA

plot(km.stock, conf.int = FALSE, col = c("blue4", "darkorange"),
xlab = "time (days)", ylab = "cumulative survival probability")
legend("bottomleft", inset = .05, c("batch", "continuous"),
text.col = c("blue4", "darkorange"))

km.df <- data.frame(
time    = km.stock$time, n.risk = km.stock$n.risk,
n.event = km.stock$n.event, surv = km.stock$surv,
strata  = gsub("stock=", "", summary(km.stock, censored = T)$strata), upper = km.stock$upper,
lower   = km.stock$lower ) zeros <- data.frame(time = 0, surv = 1, strata = gsub("stock=", "", levels(summary(km.stock)$strata)),
upper = 1, lower = 1)
library(plyr)
km.df <- rbind.fill(zeros, km.df)
km.df$strata <- ordered(km.df$strata, levels = c("batch", "continuous"))
library(ggplot2)
ggplot(km.df, aes(time, surv, colour = strata)) +
geom_step(size = 0.6) + xlim(0, 150) + ylim(0, 1) +
xlab("time (days)") + ylab("cumulative survival probability") +
labs(colour = "stock")


# Veterinary Epidemiologic Research: Count and Rate Data – Poisson Regression and Risk Ratios

As noted on paragraph 18.4.1 of the book Veterinary Epidemiologic Research, logistic regression is widely used for binary data, with the estimates reported as odds ratios (OR). If it’s appropriate for case-control studies, risk ratios (RR) are preferred for cohort studies as RR provides estimates of probabilities directly. Moreover, it is often forgotten the assumption of rare event rate, and the OR will overestimate the true effect.

One approach to get RR is to fit a generalised linear model (GLM) with a binomial distribution and a log link. But these models may sometimes fail to converge (Zou, 2004). Another option is to use a Poisson regression with no exposure or offset specified (McNutt, 2003). It gives estimates with very little bias but confidence intervals that are too wide. However, using robust standard errors gives correct confidence intervals (Greenland, 2004, Zou, 2004).

We use data on culling of dairy cows to demonstrate this.

temp <- tempfile()
"http://ic.upei.ca/ver/sites/ic.upei.ca.ver/files/ver2_data_R.zip",
temp)

table(culling$culled) 0 1 255 466 ### recoding number of lactation culling$lact <- with(culling, ifelse(lact_c3 > 1, 2, lact_c3))


The logistic regression:

log.reg <- glm(culled ~ lact, family = binomial("logit"), data = culling)
summary(log.reg)

Call:
glm(formula = culled ~ lact, family = binomial("logit"), data = culling)

Deviance Residuals:
Min      1Q  Median      3Q     Max
-1.546  -1.199   0.849   0.849   1.156

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)   -0.734      0.299   -2.45    0.014 *
lact           0.784      0.171    4.59  4.4e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 936.86  on 720  degrees of freedom
Residual deviance: 915.84  on 719  degrees of freedom
AIC: 919.8

Number of Fisher Scoring iterations: 4

cbind(exp(coef(log.reg)), exp(confint(log.reg)))
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) 0.48 0.267  0.864
lact        2.19 1.568  3.065


We could compute by hand the OR and RR:

with(culling, table(culled, lact))
lact
culled   1   2
0  97 158
1 102 364
## OR is 2.19:
(364 / 158) / (102 / 97)
[1] 2.19
## or the ratio between the cross-products
(364 * 97) / (158 * 102)
[1] 2.19
## or with epicalc
library(epicalc)
with(culling, cc(culled, lact, graph = FALSE))

lact
culled    1   2 Total
0      97 158   255
1     102 364   466
Total 199 522   721

OR =  2.19
Exact 95% CI =  1.54, 3.1
Chi-squared = 21.51, 1 d.f., P value = 0
Fisher's exact test (2-sided) P value = 0

# RR is 1.36:
(364 / 522) / (102 / 199)
[1] 1.36


The GLM with binomial distribution and log link:

log.reg2 <- glm(culled ~ lact, family = binomial(link = "log"), data = culling)
summary(log.reg2)

Call:
glm(formula = culled ~ lact, family = binomial(link = "log"),
data = culling)

Deviance Residuals:
Min      1Q  Median      3Q     Max
-1.546  -1.199   0.849   0.849   1.156

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)  -0.9762     0.1412   -6.91  4.8e-12 ***
lact          0.3078     0.0749    4.11  4.0e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 936.86  on 720  degrees of freedom
Residual deviance: 915.84  on 719  degrees of freedom
AIC: 919.8

Number of Fisher Scoring iterations: 5

cbind(exp(coef(log.reg2)), exp(confint(log.reg2)))
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) 0.377  0.28  0.488
lact        1.360  1.18  1.588


The modified Poisson with robust estimator as described by Zou, 2004 (GEE with individual observations treated as a repeated measure):

## create id for each observation
culling$id <- 1:length(culling$herd)
library(geepack)
zou.mod <- geeglm(culled ~ lact, family = poisson(link = "log"), id = id, corstr = "exchangeable", data = culling)
summary(zou.mod)

Call:
geeglm(formula = culled ~ lact, family = poisson(link = "log"),
data = culling, id = id, corstr = "exchangeable")

Coefficients:
Estimate Std.err Wald Pr(>|W|)
(Intercept)  -0.9762  0.1412 47.8  4.8e-12 ***
lact          0.3078  0.0749 16.9  4.0e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Estimated Scale Parameters:
Estimate Std.err
(Intercept)    0.354   0.021

Correlation: Structure = exchangeable  Link = identity

Estimated Correlation Parameters:
Estimate Std.err
alpha        0       0
Number of clusters:   721   Maximum cluster size: 1
## exponentiated coefficients
exp(coef(zou.mod))
(Intercept)        lact
0.377       1.360
## getting confidence intervals
library(doBy)
zou.mod.coefci <- esticon(zou.mod, diag(2))
zou.mod.expci <- exp(cbind(zou.mod.coefci$Estimate, zou.mod.coefci$Lower, zou.mod.coefci$Upper)) rownames(zou.mod.expci) <- names(coef(zou.mod)) colnames(zou.mod.expci) <- c("RR", "Lower RR", "Upper RR") zou.mod.expci RR Lower RR Upper RR (Intercept) 0.377 0.286 0.497 lact 1.360 1.175 1.576  Or the same using glm() and then getting robust standard errors: pois.reg <- glm(culled ~ lact, family = poisson(link = "log"), data = culling) library(sandwich) # to get robust estimators library(lmtest) # to test coefficients coeftest(pois.reg, vcov = sandwich) z test of coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.9762 0.1412 -6.91 4.8e-12 *** lact 0.3078 0.0749 4.11 4.0e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 ## RR exp(coef(pois.reg)) (Intercept) lact 0.377 1.360 ## CI 0.3078 + qnorm(0.05 / 2) * 0.0749 # upper 95% CI [1] 0.161 exp(0.3078 + qnorm(0.05 / 2) * 0.0749) [1] 1.17 0.3078 + qnorm(1 - 0.05 / 2) * 0.0749 # lower 95% CI [1] 0.455 exp(0.3078 + qnorm(1 - 0.05 / 2) * 0.0749) [1] 1.58  So no excuse to use odds ratios when risk ratios are more appropriate! Addition: Plotting the Risk Ratios zou.mod2 <- geeglm(culled ~ lact + johnes, family = poisson(link = "log"), id = id, corstr = "exchangeable", data = culling) zou.mod2.coefci <- esticon(zou.mod2, diag(3)) zou.mod2.expci <- exp(cbind(zou.mod2.coefci$Estimate, zou.mod2.coefci$Lower, zou.mod2.coefci$Upper))
rownames(zou.mod2.expci) <- names(coef(zou.mod2))
colnames(zou.mod2.expci) <- c("RR", "LowerCI", "UpperCI")
zou.df <- as.data.frame(zou.mod2.expci)
zou.df$var <- row.names(zou.df) library(ggplot2) ggplot(zou.df[2:3, ], aes(y = RR, x = reorder(var, RR))) + geom_point() + geom_pointrange(aes(ymin = LowerCI, ymax = UpperCI)) + scale_x_discrete(labels = c("Lactation", "Johnes")) + scale_y_log10(breaks = seq(1, 2, by = 0.1)) + xlab("") + geom_hline(yintercept = 1, linetype = "dotdash", colour = "blue") + coord_flip()  Thanks to Tal Galili for suggesting this addition. # Veterinary Epidemiologic Research: Count and Rate Data – Zero Counts Continuing on the examples from the book Veterinary Epidemiologic Research, we look today at modelling count when the count of zeros may be higher or lower than expected from a Poisson or negative binomial distribution. When there’s an excess of zero counts, you can fit either a zero-inflated model or a hurdle model. If zero counts are not possible, a zero-truncated model can be use. Zero-inflated models Zero-inflated models manage an excess of zero counts by simultaneously fitting a binary model and a Poisson (or negative binomial) model. In R, you can fit zero-inflated models (and hurdle models) with the library pscl. We use the fec dataset which give the fecal egg counts of gastro-intestinal nematodes from 313 cows in 38 dairy herds where half of the observations have zero counts. The predictors in the model are lactation (primiparous vs. multiparous), access to pasture, manure spread on heifer pasture, and manure spread on cow pasture. temp <- tempfile() download.file( "http://ic.upei.ca/ver/sites/ic.upei.ca.ver/files/ver2_data_R.zip", temp) load(unz(temp, "ver2_data_R/fec.rdata")) unlink(temp) library(pscl) mod3 <- zeroinfl(fec ~ lact + past_lact + man_heif + man_lact, data = fec, dist = "negbin") summary(mod3) Call: zeroinfl(formula = fec ~ lact + past_lact + man_heif + man_lact, data = fec, dist = "negbin") Pearson residuals: Min 1Q Median 3Q Max -0.5055 -0.4537 -0.3624 -0.1426 27.3064 Count model coefficients (negbin with log link): Estimate Std. Error z value Pr(>|z|) (Intercept) 2.36949 0.13117 18.064 < 2e-16 *** lact -1.15847 0.10664 -10.864 < 2e-16 *** past_lact 0.53666 0.14233 3.771 0.000163 *** man_heif -0.99849 0.14216 -7.024 2.16e-12 *** man_lact 1.07858 0.15789 6.831 8.43e-12 *** Log(theta) -1.35981 0.05425 -25.065 < 2e-16 *** Zero-inflation model coefficients (binomial with logit link): Estimate Std. Error z value Pr(>|z|) (Intercept) -0.9303 0.2932 -3.173 0.00151 ** lact 0.8701 0.3083 2.822 0.00477 ** past_lact -1.8003 0.3989 -4.513 6.4e-06 *** man_heif -0.7702 0.4669 -1.650 0.09903 . man_lact -12.2380 167.9185 -0.073 0.94190 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Theta = 0.2567 Number of iterations in BFGS optimization: 47 Log-likelihood: -5239 on 11 Df  We can assess if the zero-inflated model fits the data better than a Poisson or negative binomial model with a Vuong test. If the value of the test is 1.96 indicates superiority of model 1 over model 2. If the value lies between -1.96 and 1.96, neither model is preferred. ### fit same model with negative binomial library(MASS) mod3.nb <- glm.nb(fec ~ lact + past_lact + man_heif + man_lact, data = fec) ### Vuong test vuong(mod3, mod3.nb) Vuong Non-Nested Hypothesis Test-Statistic: 3.308663 (test-statistic is asymptotically distributed N(0,1) under the null that the models are indistinguishible) in this case: model1 > model2, with p-value 0.0004687128 ### alpha 1/mod3$theta
[1] 3.895448


The Vuong statistic is 3.3 (p < 0.001) indicating the first model, the zero-inflated one, is superior to the regular negative binomial. Note that R does not estimate $\alpha$ but its inverse, $\theta$. $\alpha$ is 3.9, suggesting a negative binomial is preferable to a Poisson model.
The parameter modelled in the binary part is the probability of a zero count: having lactating cows on pasture reduced the probability of a zero count ($\beta$ = -1.8), and increased the expected count if it was non-zero ($\beta$ = 0.54).

Hurdle models

A hurdle model has also 2 components but it is based on the assumption that zero counts arise from only one process and non-zero counts are determined by a different process. The odds of non-zero count vs. a zero count is modelled by a binomial model while the distribution of non-zero counts is modelled by a zero-truncated model. We refit the fec dataset above with a negative binomial hurdle model:

mod4 <- hurdle(fec ~ lact + past_lact + man_heif + man_lact, data = fec, dist = "negbin")
summary(mod4)

Call:
hurdle(formula = fec ~ lact + past_lact + man_heif + man_lact, data = fec,
dist = "negbin")

Pearson residuals:
Min      1Q  Median      3Q     Max
-0.4598 -0.3914 -0.3130 -0.1479 23.6774

Count model coefficients (truncated negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept)   1.1790     0.4801   2.456  0.01407 *
lact         -1.1349     0.1386  -8.187 2.69e-16 ***
past_lact     0.5813     0.1782   3.261  0.00111 **
man_heif     -0.9854     0.1832  -5.379 7.50e-08 ***
man_lact      1.0139     0.1998   5.075 3.87e-07 ***
Log(theta)   -2.9111     0.5239  -5.556 2.76e-08 ***
Zero hurdle model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.13078    0.10434  -1.253  0.21006
lact        -0.84485    0.09728  -8.684  < 2e-16 ***
past_lact    0.84113    0.11326   7.427 1.11e-13 ***
man_heif    -0.35576    0.13582  -2.619  0.00881 **
man_lact     0.85947    0.15337   5.604 2.10e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Theta: count = 0.0544
Number of iterations in BFGS optimization: 25
Log-likelihood: -5211 on 11 Df


We can compare zero-inflated and hurdle models by their log-likelihood. The hurdle model fits better:

logLik(mod4)
'log Lik.' -5211.18 (df=11)
logLik(mod3)
'log Lik.' -5238.622 (df=11)
### Null model for model 3:
mod3.null <- update(mod3, . ~ 1)
logLik(mod3.null)
'log Lik.' -5428.732 (df=3)


Zero-truncated model

Sometimes zero counts are not possible. In a zero-truncated model, the probability of a zero count is computed from a Poisson (or negative binomial) distribution and this value is subtracted from 1. The remaining probabilities are rescaled based on this difference so they total 1. We use the daisy2 dataset and look at the number of services required for conception (which cannot be zero…) with the predictors parity, days from calving to first service, and presence/absence of vaginal discharge.

temp <- tempfile()
"http://ic.upei.ca/ver/sites/ic.upei.ca.ver/files/ver2_data_R.zip", temp)

library(VGAM)
mod5 <- vglm(spc ~ parity + cf + vag_disch, family = posnegbinomial(), data = daisy2)
summary(mod5)

Call:
vglm(formula = spc ~ parity + cf + vag_disch, family = posnegbinomial(),
data = daisy2)

Pearson Residuals:
Min       1Q    Median      3Q    Max
log(munb)  -1.1055 -0.90954  0.071527 0.82039 3.5591
log(size) -17.4891 -0.39674 -0.260103 0.82155 1.4480

Coefficients:
Estimate Std. Error z value
(Intercept):1  0.1243178 0.08161432  1.5232
(Intercept):2 -0.4348170 0.10003096 -4.3468
parity         0.0497743 0.01213893  4.1004
cf            -0.0040649 0.00068602 -5.9254
vag_disch      0.4704433 0.10888570  4.3205

Number of linear predictors:  2

Names of linear predictors: log(munb), log(size)

Dispersion Parameter for posnegbinomial family:   1

Log-likelihood: -10217.68 on 13731 degrees of freedom

Number of iterations: 5


As cows get older, the number of services required increase, and the longer the first service was delayed, the fewer services were required.
The first intercept is the usual intercept. The second intercept is the over dispersion parameter $\alpha$.