Veterinary Epidemiologic Research: Modelling Survival Data – Semi-Parametric Analyses

Next on modelling survival data from Veterinary Epidemiologic Research: semi-parametric analyses. With non-parametric analyses, we could only evaluate the effect one or a small number of variables. To evaluate multiple explanatory variables, we analyze data with a proportional hazards model, the Cox regression. The functional form of the baseline hazard is not specified, which make the Cox model a semi-parametric model.
A Cox proportional hazards model is fit hereafter, on data from a clinical trial of the effect of prostaglandin adminsitration on the start of breeding period of dairy cows:

temp <- tempfile()
"http://ic.upei.ca/ver/sites/ic.upei.ca.ver/files/ver2_data_R.zip", temp)

library(Hmisc)
pgtrial <- upData(pgtrial, labels = c(herd = 'Herd id', cow = 'Cow id',
tx = 'Treatment', lact = 'Lactation number',
thin = 'Body condition', dar = 'Days at risk',
preg = 'Pregnant or censored'),
levels = list(thin = list('normal' = 0, 'thin' = 1),
preg = list('censored' = 0, 'pregnant' = 1)))
pgtrial$herd <- as.factor(pgtrial$herd)

library(survival)
coxph.mod <- coxph(Surv(dar, preg == 'pregnant') ~ herd + tx + lact + thin,
data = pgtrial, ties = 'breslow')
(coxph.sum <- summary(coxph.mod))
Call:
coxph(formula = Surv(dar, preg == "pregnant") ~ herd + tx + lact +
thin, data = pgtrial, ties = "breslow")

n= 319, number of events= 264

coef exp(coef) se(coef)      z Pr(>|z|)
herd2    -0.28445   0.75243  0.16981 -1.675   0.0939 .
herd3     0.03676   1.03744  0.17426  0.211   0.8329
tx        0.18359   1.20152  0.12543  1.464   0.1433
lact     -0.04283   0.95807  0.04109 -1.042   0.2972
thinthin -0.14557   0.86453  0.13794 -1.055   0.2913
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

exp(coef) exp(-coef) lower .95 upper .95
herd2       0.7524     1.3290    0.5394     1.050
herd3       1.0374     0.9639    0.7373     1.460
tx          1.2015     0.8323    0.9396     1.536
lact        0.9581     1.0438    0.8839     1.038
thinthin    0.8645     1.1567    0.6597     1.133

Concordance= 0.564  (se = 0.021 )
Rsquare= 0.029   (max possible= 1 )
Likelihood ratio test= 9.5  on 5 df,   p=0.09084
Wald test            = 9.32  on 5 df,   p=0.09685
Score (logrank) test = 9.34  on 5 df,   p=0.09611


R gives several options to control ties in case several events occurred at the same time: the Efron method (default in R), Breslow method (default in software like SAS or Stata), and the exact method. Breslow is the simplest and adequate if not too many ties in the dataset. Efron is closer to the exact approximation.

Stratified Cox Propotional Hazards Model

In a stratified Cox model, different baseline hazards are assumed across groups of subjects. The Cox model is modified to allow the control of a predictor which do not satisfy the proportional hazards (PH) assumption. We refit the above model by stratifying by herd and including a treatment by herd interaction:

scoxph.mod <- coxph(Surv(dar, preg == 'pregnant') ~ tx + tx*herd + lact + thin +
strata(herd), data = pgtrial, method = 'breslow')
summary(scoxph.mod)
Call:
coxph(formula = Surv(dar, preg == "pregnant") ~ tx + tx * herd +
lact + thin + strata(herd), data = pgtrial, method = "breslow")

n= 319, number of events= 264

coef exp(coef) se(coef)      z Pr(>|z|)
tx       -0.02160   0.97863  0.25528 -0.085   0.9326
herd2          NA        NA  0.00000     NA       NA
herd3          NA        NA  0.00000     NA       NA
lact     -0.04600   0.95504  0.04065 -1.132   0.2578
thinthin -0.13593   0.87291  0.13833 -0.983   0.3258
tx:herd2 -0.05659   0.94498  0.33570 -0.169   0.8661
tx:herd3  0.54494   1.72451  0.31823  1.712   0.0868 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

exp(coef) exp(-coef) lower .95 upper .95
tx          0.9786     1.0218    0.5934     1.614
herd2           NA         NA        NA        NA
herd3           NA         NA        NA        NA
lact        0.9550     1.0471    0.8819     1.034
thinthin    0.8729     1.1456    0.6656     1.145
tx:herd2    0.9450     1.0582    0.4894     1.825
tx:herd3    1.7245     0.5799    0.9242     3.218

Concordance= 0.56  (se = 0.035 )
Rsquare= 0.032   (max possible= 0.998 )
Likelihood ratio test= 10.32  on 5 df,   p=0.06658
Wald test            = 10.5  on 5 df,   p=0.0623
Score (logrank) test = 10.66  on 5 df,   p=0.05851


Evaluating the Assumption of Proportional Hazards

We can evaluate it graphically, by examining the log-cumulative hazard plot vs. ln(time) and check if the curves are parallel:

coxph.mod2 <- coxph(Surv(dar, preg == 'pregnant') ~ tx, data = pgtrial,
ties = 'breslow')
pgtrial2 <- with(pgtrial, data.frame(tx = c(0, 1)))
tfit.add <- survfit(coxph.mod2, newdata = pgtrial2)
df1 <- data.frame(
time    = tfit.add[1, ]$time, n.risk = tfit.add[1, ]$n.risk,
n.event = tfit.add[1, ]$n.event, surv = tfit.add[1, ]$surv,
strata  = "0",
upper   = tfit.add[1, ]$upper, lower = tfit.add[1, ]$lower,
log.surv = log(-log(tfit.add[1, ]$surv)) ) df2 <- data.frame( time = tfit.add[2, ]$time,
n.risk  = tfit.add[2, ]$n.risk, n.event = tfit.add[2, ]$n.event,
surv    = tfit.add[2, ]$surv, strata = "1", upper = tfit.add[2, ]$upper,
lower   = tfit.add[2, ]$lower, log.surv = log(-log(tfit.add[2, ]$surv))
)
zeros <- data.frame(time = 0, surv = 1, strata = c(1, 2),
upper = 1, lower = 1)
dfpar.add$strata <- factor(dfpar.add$strata, labels = c("No tx", "Tx"))
ggplot(dfpar.add, aes(log(time), log.surv, colour = strata)) +
geom_step(size = 0.6) +
scale_color_manual("Tx", values = c('blue4', 'darkorange')) +
xlab("ln(time)") + ylab("Log-log survival")


Another graphical approach is to compare plots of predicted survival times from a Cox model (assuming PH) to Kaplan-Meier survivor function (which do not assume PH):

tfit.km <- survfit(Surv(dar, preg == 'pregnant') ~ tx, data = pgtrial)
df3.km <- data.frame(
time    = tfit.km$time, n.risk = tfit.km$n.risk,
n.event = tfit.km$n.event, surv = tfit.km$surv,
strata  = gsub("tx=", "", summary(tfit.km, censored = T)$strata), upper = tfit.km$upper,
lower   = tfit.km$lower ) zeros <- data.frame(time = 0, surv = 1, strata = gsub("tx=", "", levels(summary(tfit.km)$strata)),
upper = 1, lower = 1)
df3.km <- rbind.fill(df3.km, zeros)
df3.km$cat <- with(df3.km, ifelse(strata == "0", "No tx, observed", "Tx, observed")) dfpar.add$cat <- with(dfpar.add, ifelse(strata == "No tx", "No tx, expected",
"Tx, expected"))
ggplot(dfpar.obs, aes(time, surv, colour = cat)) +
geom_step(size = 0.6) +
scale_color_manual("", values = c('blue1', 'blue4', 'darkorange1',
'darkorange4')) +
xlab("time") + ylab("survival probability")


You can also assess PH statistically with the Schoenfeld residuals using cox.zph function:

(schoen <- cox.zph(coxph.mod))
rho chisq      p
herd2    -0.0630 1.100 0.2942
herd3    -0.0443 0.569 0.4506
tx       -0.1078 3.141 0.0763
lact      0.0377 0.447 0.5035
thinthin -0.0844 2.012 0.1560
GLOBAL        NA 7.631 0.1778

plot(schoen, var = 4)


Evaluating the Overall Fit of the Model

First we can look at the Cox-Snell residuals, which are the estimated cumulative hazards for individuals at their failure (or censoring) times. The default residuals of coxph in R are the martingale residuals, not the Cox-Snell. But it can be computed:

cox.snell <- (as.numeric(pgtrial$preg) - 1) - resid(coxph.mod, type = "martingale") coxph.res <- survfit(coxph(Surv(cox.snell, pgtrial$preg == 'pregnant') ~ 1,
method = 'breslow'), type = 'aalen')

plot(coxph.res$time, -log(coxph.res$surv), type = 's',
xlab = 'Modified Cox-Snell residuals', ylab = 'Cumulative hazard')
abline(0, 1, col = 'red', lty = 2)

## Alternatively:
coxph.res2 <- survfit(Surv(cox.snell, pgtrial$preg == 'pregnant') ~ 1) Htilde <- cumsum(coxph.res2$n.event / coxph.res$n.risk) plot(coxph.res2$time, Htilde, type = 's', col = 'blue')
abline(0, 1, col = 'red', lty = 2)


We can also use a goodness-of-fit test:

## GOF (Gronnesby and Borgan omnibus gof)
library(gof)
cumres(coxph.mod)

Kolmogorov-Smirnov-test: p-value=0.35
Cramer von Mises-test: p-value=0.506
Based on 1000 realizations. Cumulated residuals ordered by herd2-variable.
---
Kolmogorov-Smirnov-test: p-value=0.041
Cramer von Mises-test: p-value=0.589
Based on 1000 realizations. Cumulated residuals ordered by herd3-variable.
---
Kolmogorov-Smirnov-test: p-value=0
Cramer von Mises-test: p-value=0.071
Based on 1000 realizations. Cumulated residuals ordered by tx-variable.
---
Kolmogorov-Smirnov-test: p-value=0.728
Cramer von Mises-test: p-value=0.733
Based on 1000 realizations. Cumulated residuals ordered by lact-variable.
---
Kolmogorov-Smirnov-test: p-value=0.106
Cramer von Mises-test: p-value=0.091
Based on 1000 realizations. Cumulated residuals ordered by thinthin-variable.


We can evaluate the concordance between the predicted and observed sequence of pairs of events. Harrell’s c index computes the proportion of all pairs of subjects in which the model correctly predicts the sequence of events. It ranges from 0 to 1 with 0.5 for random predictions and 1 for a perfectly discriminating model. It is obtained from the Somer’s Dxy rank correlation:

library(rms)
fit.cph <- cph(Surv(dar, preg == 'pregnant') ~ herd + tx + lact + thin,
data = pgtrial, x = TRUE, y = TRUE, surv = TRUE)
v <- validate(fit.cph, dxy = TRUE, B = 100)
Dxy <- v[rownames(v) == "Dxy", colnames(v) == "index.corrected"]
(Dxy / 2) + 0.5 # c index
[1] 0.4538712


Evaluating the Functional Form of Predictors

We can use martingale residuals to evaluate the functional form of the relationship between a continuous predictor and the survival expectation for individuals:

lact.mod <- coxph(Surv(dar, preg == 'pregnant') ~ lact, data = pgtrial,
ties = 'breslow')
lact.res <- resid(lact.mod, type = "martingale")
plot(pgtrial$lact, lact.res, xlab = 'lactation', ylab = 'Martingale residuals') lines(lowess(pgtrial$lact, lact.res, iter = 0))

lact.mod <- update(lact.mod, . ~ . + I(lact^2))
lact.res <- resid(lact.mod, type = "martingale")
plot(pgtrial$lact, lact.res, xlab = 'lactation', ylab = 'Martingale residuals') lines(lowess(pgtrial$lact, lact.res, iter = 0))


Checking for Outliers

Deviance residuals can be used to identify outliers:

## deviance residuals
dev.res <- resid(coxph.mod, type = "deviance")
plot(pgtrial$dar, dev.res, xlab = 'time (days)', ylab = 'deviance residuals') cbind(dev.res, pgtrial)[abs(dev.res) > 2, ] dev.res herd cow tx lact thin dar preg 1 2.557832 1 1 0 1 normal 1 pregnant 2 2.592492 1 2 1 4 thin 1 pregnant 3 2.319351 1 3 1 1 normal 2 pregnant 73 -2.693731 1 76 1 1 normal 277 censored 74 2.734508 2 78 0 2 thin 1 pregnant 75 2.644885 2 79 1 4 normal 1 pregnant 76 2.436308 2 80 1 1 normal 2 pregnant 176 -2.015925 2 180 1 2 normal 201 censored 180 -2.196008 2 184 1 2 normal 250 censored 183 -2.081493 2 187 1 3 thin 288 censored 185 -2.238729 2 189 0 1 normal 346 censored 314 -2.274912 3 318 0 1 thin 262 censored 315 -2.226711 3 319 0 2 thin 262 censored 316 -2.182517 3 320 0 4 thin 287 censored 317 -2.278029 3 321 0 2 thin 288 censored 318 -2.341736 3 322 0 3 thin 308 censored 319 -2.392427 3 323 0 2 thin 320 censored  Score residuals and scaled score residuals can be used to identify influential observations: ### Detecting influential points # score residuals score.res <- resid(coxph.mod, type = "score") # score residuals for tx plot(pgtrial$dar, score.res[ , 3], xlab = 'time (days)',
ylab = 'score residuals')
text(pgtrial$dar, score.res[ , 3], rownames(pgtrial), cex = 0.6, pos = 4) cbind(score.res[ , 3], pgtrial)[abs(score.res[ , 3]) > 2, ] score.res[, 3] herd cow tx lact thin dar preg 73 -2.025537 1 76 1 1 normal 277 censored ## influential observations dfbeta <- resid(coxph.mod, type = "dfbeta") # dfbeta residuals for tx plot(pgtrial$dar, dfbeta[ , 3], xlab = 'time (days)',
ylab = 'scaled score residual')
text(pgtrial$dar, dfbeta[ , 3], rownames(pgtrial), cex = 0.6, pos = 4) # with standardized dfbeta dfbetas <- resid(coxph.mod, type = "dfbetas") plot(pgtrial$dar, dfbetas[ , 3], xlab = 'time (days)',
ylab = 'standardized score residuals')
text(pgtrial$dar, dfbetas[ , 3], rownames(pgtrial), cex = 0.6, pos = 4)  Veterinary Epidemiologic Research: Modelling Survival Data – Non-Parametric Analyses Next topic from Veterinary Epidemiologic Research: chapter 19, modelling survival data. We start with non-parametric analyses where we make no assumptions about either the distribution of survival times or the functional form of the relationship between a predictor and survival. There are 3 non-parametric methods to describe time-to-event data: actuarial life tables, Kaplan-Meier method, and Nelson-Aalen method. We use data on occurrence of calf pneumonia in calves raised in 2 different housing systems. Calves surviving to 150 days without pneumonia are considered censored at that time. temp <- tempfile() download.file( "http://ic.upei.ca/ver/sites/ic.upei.ca.ver/files/ver2_data_R.zip", temp) load(unz(temp, "ver2_data_R/calf_pneu.rdata")) unlink(temp) library(Hmisc) calf_pneu <- upData(calf_pneu, labels = c(calf = 'Calf id', stock = 'Stocking method', days = 'Time to onset of pneumonia or censoring', pn = 'Pneumonia'), levels = list(stock = list('batch' = 0, 'continuous' = 1)))  Actuarial Life Table To create a life table, we use the function lifetab from package KMsurv, after calculating the number of censored and events at each time point and grouping them by time interval (with gsummary from package nlme). library(KMsurv) interval <- seq(from = 30, to = 165, by = 15) interval <- floor(calf_pneu$days/15)
interval.censor <- data.frame(interval, calf_pneu$pn) library(nlme) pneumonia <- gsummary(interval.censor, sum, groups = interval) total <- gsummary(interval.censor, length, groups = interval) lt.data <- cbind(pneumonia[ , 1:2], total[ , 2]) length <- length(lt.data$interval)
lt.data[length + 1, ]$interval <- NA nevent <- lt.data[ , 2] nlost <- lt.data[ , 3] - lt.data[ , 2] (life.table <- lifetab(lt.data$interval, 24, nlost, nevent))
nsubs nlost nrisk nevent      surv        pdf     hazard    se.surv
1-3      24     0  24.0      1 1.0000000 0.02083333 0.02127660 0.00000000
3-4      23     0  23.0      1 0.9583333 0.04166667 0.04444444 0.04078938
4-5      22     0  22.0      1 0.9166667 0.04166667 0.04651163 0.05641693
5-6      21     0  21.0      3 0.8750000 0.12500000 0.15384615 0.06750772
6-7      18     1  17.5      2 0.7500000 0.08571429 0.12121212 0.08838835
7-8      15     6  12.0      3 0.6642857 0.16607143 0.28571429 0.09686316
8-10      6     0   6.0      1 0.4982143 0.04151786 0.09090909 0.11032937
10-NA     5     5   2.5      0 0.4151786         NA         NA 0.11915934
NA-3      0    NA    NA     NA 0.4151786         NA         NA 0.11915934
se.pdf  se.hazard
1-3   0.02039469 0.02127178
3-4   0.04078938 0.04443347
4-5   0.04078938 0.04649905
5-6   0.06750772 0.08855994
6-7   0.05792828 0.08555236
7-8   0.08649471 0.16326531
8-10  0.03899969 0.09053265
10-NA         NA         NA
NA-3          NA         NA


Kaplan-Meier Method

To compute the Kaplan-Meier estimator we use the function survfit from package survival. It takes as argument a Surv object, which gives the time variable and the event of interest. You get the Kaplan-Meier estimate with the summary of the survfit object. We can then plot the estimates to show the Kaplan-Meier survivor function.

library(survival)
km.sf <- survfit(Surv(days, pn == 1) ~ 1, data = calf_pneu)
summary(km.sf)
Call: survfit(formula = Surv(days, pn == 1) ~ 1, data = calf_pneu)

time n.risk n.event survival std.err lower 95% CI upper 95% CI
27     24       1    0.958  0.0408        0.882        1.000
49     23       1    0.917  0.0564        0.813        1.000
72     22       1    0.875  0.0675        0.752        1.000
79     21       2    0.792  0.0829        0.645        0.972
89     19       1    0.750  0.0884        0.595        0.945
90     18       1    0.708  0.0928        0.548        0.916
101     17       1    0.667  0.0962        0.502        0.885
113     15       2    0.578  0.1019        0.409        0.816
117      9       1    0.514  0.1089        0.339        0.778
123      6       1    0.428  0.1198        0.247        0.741

plot(km.sf, xlab = "time (days)", ylab = "cumulative survival probability", conf.int = TRUE)


Nelson-Aalen Method

A “hazard” is the probability of failure at a point in time, given that the calf had survived up to that point in time. A cumulative hazard, the Nelson-Aaalen estimate, can be computed. The Nelson-Aalen estimate can be calculated by transforming the Fleming-Harrington estimate of survival.

fh.sf <- survfit(Surv(days, pn == 1) ~ 1, data = calf_pneu, type = "fleming")

plot(stepfun(fh.sf$time, c(0, -log(fh.sf$surv))), do.points = FALSE,
xlab = "time (days)", ylab = "cumulative hazard",
main = "", ylim = c(0, 1.5))
lines(stepfun(fh.sf$time, c(0, -log(fh.sf$upper))), lty = 5, do.points = FALSE)
lines(stepfun(fh.sf$time, c(0, -log(fh.sf$lower))), lty = 5, do.points = FALSE)


Tests of the Overall Survival Curve

Several tests are available to test whether the overall survivor functions in 2 or more groups are equal. We can use the log-rank test, the simplest test, assigning equal weight to each time point estimate and equivalent to a standard Mantel-Haenszel test. Also, there’s the Peto-Peto-Prentice test which weights the stratum-specific estimates by the overall survival experience and so reduces the influence of different censoring patterns between groups.
To do these tests, we apply the survdiff function to the Surv object. The argument rho gives the weights according to $S^{(t)}\rho$ and may be any numeric value. Default is rho = 0 which gives the log-rank test. Rho = 1 gives the “Peto & Peto modiﬁcation of the Gehan-Wilcoxon test”. Rho larger than zero gives greater weight to the ﬁrst part of the survival curves. Rho smaller than zero gives weight to the later part of the survival curves.

survdiff(Surv(days, pn == 1) ~ stock, data = calf_pneu, rho = 0) # rho is optional
Call:
survdiff(formula = Surv(days, pn == 1) ~ stock, data = calf_pneu,
rho = 0)

N Observed Expected (O-E)^2/E (O-E)^2/V
stock=batch      12        4     6.89      1.21      2.99
stock=continuous 12        8     5.11      1.63      2.99

Chisq= 3  on 1 degrees of freedom, p= 0.084

survdiff(Surv(days, pn == 1) ~ stock, data = calf_pneu, rho = 1) # rho=1 asks for Peto-Peto test
Call:
survdiff(formula = Surv(days, pn == 1) ~ stock, data = calf_pneu,
rho = 1)

N Observed Expected (O-E)^2/E (O-E)^2/V
stock=batch      12     2.89     5.25      1.06      3.13
stock=continuous 12     6.41     4.05      1.38      3.13

Chisq= 3.1  on 1 degrees of freedom, p= 0.0766


Finally we can compare survivor function with stock R plot or using ggplot2. With ggplot2, you get the necessary data from the survfit object and create a new data frame from it. The baseline data (time = 0) are not there so you create it yourself:

(km.stock <- survfit(Surv(days, pn == 1) ~ stock, data = calf_pneu))
Call: survfit(formula = Surv(days, pn == 1) ~ stock, data = calf_pneu)

records n.max n.start events median 0.95LCL 0.95UCL
stock=batch           12    12      12      4     NA     123      NA
stock=continuous      12    12      12      8    113      79      NA

plot(km.stock, conf.int = FALSE, col = c("blue4", "darkorange"),
xlab = "time (days)", ylab = "cumulative survival probability")
legend("bottomleft", inset = .05, c("batch", "continuous"),
text.col = c("blue4", "darkorange"))

km.df <- data.frame(
time    = km.stock$time, n.risk = km.stock$n.risk,
n.event = km.stock$n.event, surv = km.stock$surv,
strata  = gsub("stock=", "", summary(km.stock, censored = T)$strata), upper = km.stock$upper,
lower   = km.stock$lower ) zeros <- data.frame(time = 0, surv = 1, strata = gsub("stock=", "", levels(summary(km.stock)$strata)),
upper = 1, lower = 1)
library(plyr)
km.df <- rbind.fill(zeros, km.df)
km.df$strata <- ordered(km.df$strata, levels = c("batch", "continuous"))
library(ggplot2)
ggplot(km.df, aes(time, surv, colour = strata)) +
geom_step(size = 0.6) + xlim(0, 150) + ylim(0, 1) +
xlab("time (days)") + ylab("cumulative survival probability") +
labs(colour = "stock")


Veterinary Epidemiologic Research: Count and Rate Data – Poisson Regression and Risk Ratios

As noted on paragraph 18.4.1 of the book Veterinary Epidemiologic Research, logistic regression is widely used for binary data, with the estimates reported as odds ratios (OR). If it’s appropriate for case-control studies, risk ratios (RR) are preferred for cohort studies as RR provides estimates of probabilities directly. Moreover, it is often forgotten the assumption of rare event rate, and the OR will overestimate the true effect.

One approach to get RR is to fit a generalised linear model (GLM) with a binomial distribution and a log link. But these models may sometimes fail to converge (Zou, 2004). Another option is to use a Poisson regression with no exposure or offset specified (McNutt, 2003). It gives estimates with very little bias but confidence intervals that are too wide. However, using robust standard errors gives correct confidence intervals (Greenland, 2004, Zou, 2004).

We use data on culling of dairy cows to demonstrate this.

temp <- tempfile()
"http://ic.upei.ca/ver/sites/ic.upei.ca.ver/files/ver2_data_R.zip",
temp)

table(culling$culled) 0 1 255 466 ### recoding number of lactation culling$lact <- with(culling, ifelse(lact_c3 > 1, 2, lact_c3))


The logistic regression:

log.reg <- glm(culled ~ lact, family = binomial("logit"), data = culling)
summary(log.reg)

Call:
glm(formula = culled ~ lact, family = binomial("logit"), data = culling)

Deviance Residuals:
Min      1Q  Median      3Q     Max
-1.546  -1.199   0.849   0.849   1.156

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)   -0.734      0.299   -2.45    0.014 *
lact           0.784      0.171    4.59  4.4e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 936.86  on 720  degrees of freedom
Residual deviance: 915.84  on 719  degrees of freedom
AIC: 919.8

Number of Fisher Scoring iterations: 4

cbind(exp(coef(log.reg)), exp(confint(log.reg)))
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) 0.48 0.267  0.864
lact        2.19 1.568  3.065


We could compute by hand the OR and RR:

with(culling, table(culled, lact))
lact
culled   1   2
0  97 158
1 102 364
## OR is 2.19:
(364 / 158) / (102 / 97)
[1] 2.19
## or the ratio between the cross-products
(364 * 97) / (158 * 102)
[1] 2.19
## or with epicalc
library(epicalc)
with(culling, cc(culled, lact, graph = FALSE))

lact
culled    1   2 Total
0      97 158   255
1     102 364   466
Total 199 522   721

OR =  2.19
Exact 95% CI =  1.54, 3.1
Chi-squared = 21.51, 1 d.f., P value = 0
Fisher's exact test (2-sided) P value = 0

# RR is 1.36:
(364 / 522) / (102 / 199)
[1] 1.36


The GLM with binomial distribution and log link:

log.reg2 <- glm(culled ~ lact, family = binomial(link = "log"), data = culling)
summary(log.reg2)

Call:
glm(formula = culled ~ lact, family = binomial(link = "log"),
data = culling)

Deviance Residuals:
Min      1Q  Median      3Q     Max
-1.546  -1.199   0.849   0.849   1.156

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)  -0.9762     0.1412   -6.91  4.8e-12 ***
lact          0.3078     0.0749    4.11  4.0e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 936.86  on 720  degrees of freedom
Residual deviance: 915.84  on 719  degrees of freedom
AIC: 919.8

Number of Fisher Scoring iterations: 5

cbind(exp(coef(log.reg2)), exp(confint(log.reg2)))
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) 0.377  0.28  0.488
lact        1.360  1.18  1.588


The modified Poisson with robust estimator as described by Zou, 2004 (GEE with individual observations treated as a repeated measure):

## create id for each observation
culling$id <- 1:length(culling$herd)
library(geepack)
zou.mod <- geeglm(culled ~ lact, family = poisson(link = "log"), id = id, corstr = "exchangeable", data = culling)
summary(zou.mod)

Call:
geeglm(formula = culled ~ lact, family = poisson(link = "log"),
data = culling, id = id, corstr = "exchangeable")

Coefficients:
Estimate Std.err Wald Pr(>|W|)
(Intercept)  -0.9762  0.1412 47.8  4.8e-12 ***
lact          0.3078  0.0749 16.9  4.0e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Estimated Scale Parameters:
Estimate Std.err
(Intercept)    0.354   0.021

Correlation: Structure = exchangeable  Link = identity

Estimated Correlation Parameters:
Estimate Std.err
alpha        0       0
Number of clusters:   721   Maximum cluster size: 1
## exponentiated coefficients
exp(coef(zou.mod))
(Intercept)        lact
0.377       1.360
## getting confidence intervals
library(doBy)
zou.mod.coefci <- esticon(zou.mod, diag(2))
zou.mod.expci <- exp(cbind(zou.mod.coefci$Estimate, zou.mod.coefci$Lower, zou.mod.coefci$Upper)) rownames(zou.mod.expci) <- names(coef(zou.mod)) colnames(zou.mod.expci) <- c("RR", "Lower RR", "Upper RR") zou.mod.expci RR Lower RR Upper RR (Intercept) 0.377 0.286 0.497 lact 1.360 1.175 1.576  Or the same using glm() and then getting robust standard errors: pois.reg <- glm(culled ~ lact, family = poisson(link = "log"), data = culling) library(sandwich) # to get robust estimators library(lmtest) # to test coefficients coeftest(pois.reg, vcov = sandwich) z test of coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.9762 0.1412 -6.91 4.8e-12 *** lact 0.3078 0.0749 4.11 4.0e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 ## RR exp(coef(pois.reg)) (Intercept) lact 0.377 1.360 ## CI 0.3078 + qnorm(0.05 / 2) * 0.0749 # upper 95% CI [1] 0.161 exp(0.3078 + qnorm(0.05 / 2) * 0.0749) [1] 1.17 0.3078 + qnorm(1 - 0.05 / 2) * 0.0749 # lower 95% CI [1] 0.455 exp(0.3078 + qnorm(1 - 0.05 / 2) * 0.0749) [1] 1.58  So no excuse to use odds ratios when risk ratios are more appropriate! Addition: Plotting the Risk Ratios zou.mod2 <- geeglm(culled ~ lact + johnes, family = poisson(link = "log"), id = id, corstr = "exchangeable", data = culling) zou.mod2.coefci <- esticon(zou.mod2, diag(3)) zou.mod2.expci <- exp(cbind(zou.mod2.coefci$Estimate, zou.mod2.coefci$Lower, zou.mod2.coefci$Upper))
rownames(zou.mod2.expci) <- names(coef(zou.mod2))
colnames(zou.mod2.expci) <- c("RR", "LowerCI", "UpperCI")
zou.df <- as.data.frame(zou.mod2.expci)
zou.df$var <- row.names(zou.df) library(ggplot2) ggplot(zou.df[2:3, ], aes(y = RR, x = reorder(var, RR))) + geom_point() + geom_pointrange(aes(ymin = LowerCI, ymax = UpperCI)) + scale_x_discrete(labels = c("Lactation", "Johnes")) + scale_y_log10(breaks = seq(1, 2, by = 0.1)) + xlab("") + geom_hline(yintercept = 1, linetype = "dotdash", colour = "blue") + coord_flip()  Thanks to Tal Galili for suggesting this addition. Veterinary Epidemiologic Research: GLM – Evaluating Logistic Regression Models (part 3) Third part on logistic regression (first here, second here). Two steps in assessing the fit of the model: first is to determine if the model fits using summary measures of goodness of fit or by assessing the predictive ability of the model; second is to deterime if there’s any observations that do not fit the model or that have an influence on the model. Covariate pattern A covariate pattern is a unique combination of values of predictor variables. mod3 <- glm(casecont ~ dcpct + dneo + dclox + dneo*dclox, + family = binomial("logit"), data = nocardia) summary(mod3) Call: glm(formula = casecont ~ dcpct + dneo + dclox + dneo * dclox, family = binomial("logit"), data = nocardia) Deviance Residuals: Min 1Q Median 3Q Max -1.9191 -0.7682 0.1874 0.5876 2.6755 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.776896 0.993251 -3.803 0.000143 *** dcpct 0.022618 0.007723 2.928 0.003406 ** dneoYes 3.184002 0.837199 3.803 0.000143 *** dcloxYes 0.445705 1.026026 0.434 0.663999 dneoYes:dcloxYes -2.551997 1.205075 -2.118 0.034200 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 149.72 on 107 degrees of freedom Residual deviance: 103.42 on 103 degrees of freedom AIC: 113.42 Number of Fisher Scoring iterations: 5 library(epiR) Package epiR 0.9-45 is loaded Type help(epi.about) for summary information mod3.mf <- model.frame(mod3) (mod3.cp <- epi.cp(mod3.mf[-1]))$cov.pattern
id  n dcpct dneo dclox
1    1  7     0   No    No
2    2 38   100  Yes    No
3    3  1    25   No    No
4    4  1     1   No    No
5    5 11   100   No   Yes
8    6  1    25  Yes   Yes
10   7  1    14  Yes    No
12   8  4    75  Yes    No
13   9  1    90  Yes   Yes
14  10  1    30   No    No
15  11  3     5  Yes    No
17  12  9   100  Yes   Yes
22  13  2    20  Yes    No
23  14  8   100   No    No
25  15  2    50  Yes   Yes
26  16  1     7   No    No
27  17  4    50  Yes    No
28  18  1    50   No    No
31  19  1    30  Yes    No
34  20  1    99   No    No
35  21  1    99  Yes   Yes
40  22  1    80  Yes   Yes
48  23  1     3  Yes    No
59  24  1     1  Yes    No
77  25  1    10   No    No
84  26  1    83   No   Yes
85  27  1    95  Yes    No
88  28  1    99  Yes    No
89  29  1    25  Yes    No
105 30  1    40  Yes    No

$id [1] 1 2 3 4 5 1 1 6 5 7 5 8 9 10 11 11 12 1 12 1 5 13 14 2 15 [26] 16 17 18 1 2 19 2 14 20 21 12 14 5 8 22 14 5 5 5 1 14 14 23 2 12 [51] 14 12 11 5 15 2 8 2 24 2 2 8 2 17 2 2 2 2 12 12 12 2 2 2 5 [76] 2 25 2 17 2 2 2 2 26 27 13 17 28 29 14 2 12 2 2 2 2 2 2 2 2 [101] 2 2 2 2 30 2 2 5  There are 30 covariate patterns in the dataset. The pattern dcpct=100, dneo=Yes, dclox=No appears 38 times. Pearson and deviance residuals Residuals represent the difference between the data and the model. The Pearson residuals are comparable to standardized residuals used for linear regression models. Deviance residuals represent the contribution of each observation to the overall deviance. residuals(mod3) # deviance residuals residuals(mod3, "pearson") # pearson residuals  Goodness-of-fit test All goodness-of-fit tests are based on the premise that the data will be divided into subsets and within each subset the predicted number of outcomes will be computed and compared to the observed number of outcomes. The Pearson $\chi^2$ and the deviance $\chi^2$ are based on dividing the data up into the natural covariate patterns. The Hosmer-Lemeshow test is based on a more arbitrary division of the data. The Pearson $\chi^2$ is similar to the residual sum of squares used in linear models. It will be close in size to the deviance, but the model is fit to minimize the deviance and not the Pearson $\chi^2$. It is thus possible even if unlikely that the $\chi^2$ could increase as a predictor is added to the model. sum(residuals(mod3, type = "pearson")^2) [1] 123.9656 deviance(mod3) [1] 103.4168 1 - pchisq(deviance(mod3), df.residual(mod3)) [1] 0.4699251  The p-value is large indicating no evidence of lack of fit. However, when using the deviance statistic to assess the goodness-of-fit for a nonsaturated logistic model, the $\chi^2$ approximation for the likelihood ratio test is questionable. When the covariate pattern is almost as large as N, the deviance cannot be assumed to have a $\chi^2$ distribution. Now the Hosmer-Lemeshow test, usually dividing by 10 the data: hosmerlem <- function (y, yhat, g = 10) { + cutyhat <- cut(yhat, breaks = quantile(yhat, probs = seq(0, 1, 1/g)), + include.lowest = TRUE) + obs <- xtabs(cbind(1 - y, y) ~ cutyhat) + expect <- xtabs(cbind(1 - yhat, yhat) ~ cutyhat) + chisq <- sum((obs - expect)^2 / expect) + P <- 1 - pchisq(chisq, g - 2) + c("X^2" = chisq, Df = g - 2, "P(>Chi)" = P) + } hosmerlem(y = nocardia$casecont, yhat = fitted(mod3))
Erreur dans cut.default(yhat, breaks = quantile(yhat, probs = seq(0, 1, 1/g)),  (depuis #2) :
'breaks' are not unique


The model used has many ties in its predicted probabilities (too few covariate values?) resulting in an error when running the Hosmer-Lemeshow test. Using fewer cut-points (g = 5 or 7) does not solve the problem. This is a typical example when not to use this test. A better goodness-of-fit test than Hosmer-Lemeshow and Pearson / deviance $\chi^2$ tests is the le Cessie – van Houwelingen – Copas – Hosmer unweighted sum of squares test for global goodness of fit (also here) implemented in the rms package (but you have to implement your model with the lrm function of this package):

mod3b <- lrm(casecont ~ dcpct + dneo + dclox + dneo*dclox, nocardia,
+              method = "lrm.fit", model = TRUE, x = TRUE, y = TRUE,
+              linear.predictors = TRUE, se.fit = FALSE)
residuals(mod3b, type = "gof")
Sum of squared errors     Expected value|H0                    SD
16.4288056            16.8235055             0.2775694
Z                     P
-1.4219860             0.1550303


The p-value is 0.16 so there’s no evidence the model is incorrect. Even better than these tests would be to check for linearity of the predictors.

Overdispersion
Sometimes we can get a deviance that is much larger than expected if the model was correct. It can be due to the presence of outliers, sparse data or clustering of data. The approach to deal with overdispersion is to add a dispersion parameter $\sigma^2$. It can be estimated with: $\hat{\sigma}^2 = \frac{\chi^2}{n - p}$ (p = probability of success). A half-normal plot of the residuals can help checking for outliers:

library(faraway)
halfnorm(residuals(mod1))


The dispesion parameter of model 1 can be found as:

(sigma2 <- sum(residuals(mod1, type = "pearson")^2) / 104)
[1] 1.128778
drop1(mod1, scale = sigma2, test = "F")
Single term deletions

Model:
casecont ~ dcpct + dneo + dclox

scale:  1.128778

Df Deviance    AIC F value    Pr(>F)
<none>      107.99 115.99
dcpct   1   119.34 124.05 10.9350  0.001296 **
dneo    1   125.86 129.82 17.2166 6.834e-05 ***
dclox   1   114.73 119.96  6.4931  0.012291 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Message d'avis :
In drop1.glm(mod1, scale = sigma2, test = "F") :
le test F implique une famille 'quasibinomial'


The dispersion parameter is not very different than one (no dispersion). If dispersion was present, you could use it in the F-tests for the predictors, adding scale to drop1.

Predictive ability of the model
A ROC curve can be drawn:

predicted <- predict(mod3)
library(ROCR)
prob <- prediction(predicted, nocardia$casecont, + label.ordering = c('Control', 'Case')) tprfpr <- performance(prob, "tpr", "fpr") tpr <- unlist(slot(tprfpr, "y.values")) fpr <- unlist(slot(tprfpr, "x.values")) roc <- data.frame(tpr, fpr) ggplot(roc) + geom_line(aes(x = fpr, y = tpr)) + + geom_abline(intercept = 0, slope = 1, colour = "gray") + + ylab("Sensitivity") + + xlab("1 - Specificity")  Identifying important observations Like for linear regression, large positive or negative standardized residuals allow to identify points which are not well fit by the model. A plot of Pearson residuals as a function of the logit for model 1 is drawn here, with bubbles relative to size of the covariate pattern. The plot should be an horizontal band with observations between -3 and +3. Covariate patterns 25 and 26 are problematic. nocardia$casecont.num <- as.numeric(nocardia$casecont) - 1 mod1 <- glm(casecont.num ~ dcpct + dneo + dclox, family = binomial("logit"), + data = nocardia) # "logit" can be omitted as it is the default mod1.mf <- model.frame(mod1) mod1.cp <- epi.cp(mod1.mf[-1]) nocardia.cp <- as.data.frame(cbind(cpid = mod1.cp$id,
+                                    nocardia[ , c(1, 9:11, 13)],
+                                    fit = fitted(mod1)))
### Residuals and delta betas based on covariate pattern:
mod1.obs <- as.vector(by(as.numeric(nocardia.cp$casecont.num), + as.factor(nocardia.cp$cpid), FUN = sum))
mod1.fit <- as.vector(by(nocardia.cp$fit, as.factor(nocardia.cp$cpid),
+                          FUN = min))
mod1.res <- epi.cpresids(obs = mod1.obs, fit = mod1.fit,
+                          covpattern = mod1.cp)

mod1.lodds <- as.vector(by(predict(mod1), as.factor(nocardia.cp$cpid), + FUN = min)) plot(mod1.lodds, mod1.res$spearson,
+      type = "n", ylab = "Pearson Residuals", xlab = "Logit")
text(mod1.lodds, mod1.res$spearson, labels = mod1.res$cpid, cex = 0.8)
symbols(mod1.lodds, mod1.res$pearson, circles = mod1.res$n, add = TRUE)


The hat matrix is used to calculate leverage values and other diagnostic parameters. Leverage measures the potential impact of an observation. Points with high leverage have a potential impact. Covariate patterns 2, 14, 12 and 5 have the largest leverage values.

mod1.res[sort.list(mod1.res$leverage, decreasing = TRUE), ] cpid leverage 2 0.74708052 14 0.54693851 12 0.54017700 5 0.42682684 11 0.21749664 1 0.19129427 ...  Delta-betas provides an overall estimate of the effect of the $j^{th}$ covariate pattern on the regression coefficients. It is analogous to Cook’s distance in linear regression. Covariate pattern 2 has the largest delta-beta (and represents 38 observations). mod1.res[sort.list(mod1.res$sdeltabeta, decreasing = TRUE), ]
cpid sdeltabeta
2    7.890878470
14   3.983840529
...


Veterinary Epidemiologic Research: Linear Regression Part 2 – Checking assumptions

We continue on the linear regression chapter the book Veterinary Epidemiologic Research.

Using same data as last post and running example 14.12:

tmp <- tempfile()
load(unz(tmp, "ver2_data_R/daisy2.rdata")) # extract the target file from the temporary file
unlink(tmp) # remove the temporary file

### some data management
daisy2 <- daisy2[daisy2$h7 == 1, ] # we only use a subset of the data library(Hmisc) daisy2 <- upData(daisy2, labels = c(region = 'Region', herd = 'Herd number', cow = 'Cow number', study_lact = 'Study lactation number', herd_size = 'Herd size', mwp = "Minimum wait period for herd", parity = 'Lactation number', milk120 = 'Milk volume in first 120 days of lactation', calv_dt = 'Calving date', cf = 'Calving to first service interval', fs = 'Conception at first service', cc = 'Calving to conception interval', wpc = 'Interval from wait period to conception', spc = 'Services to conception', twin = 'Twins born', dyst = 'Dystocia at calving', rp = 'Retained placenta at calving', vag_disch = 'Vaginal discharge observed', h7 = 'Indicator for 7 herd subset'), levels = list(fs = list('No' = 0, 'Yes' = 1), twin = list('No' = 0, 'Yes' = 1), dyst = list('No' = 0, 'Yes' = 1), rp = list('No' = 0, 'Yes' = 1), vag_disch = list('No' = 0, 'Yes' = 1)), units = c(milk120 = "litres")) library(lubridate) daisy2$date <- as.character(daisy2$calv_dt) daisy2$date <- ymd(daisy2$date) daisy2$mth <- month(daisy2$date) # calving in automn or not: daisy2$aut_calv <- with(daisy2, ifelse(mth %in% c(9:12), "fall", "other"))
# centering variables:
daisy2$hs100 <- daisy2$herd_size / 100
daisy2$hs100_ct <- daisy2$hs100 - mean(daisy2$hs100) daisy2$hs100_ctsq <- daisy2$hs100_ct^2 daisy2$parity_sc <- daisy2$parity - mean(daisy2$parity)

daisy3 <- daisy2[complete.cases(daisy2), ] # df with only complete cases

lm.wpc <- lm(wpc ~ aut_calv + hs100_ct + hs100_ctsq + parity_sc + twin +
+              dyst + rp + vag_disch + rp*vag_disch, data = daisy3)
(lm.wpc.sum <- summary(lm.wpc))

Call:
lm(formula = wpc ~ aut_calv + hs100_ct + hs100_ctsq + parity_sc +
twin + dyst + rp + vag_disch + rp * vag_disch, data = daisy3)

Residuals:
Interval from wait period to conception
Min     1Q Median     3Q    Max
-76.40 -35.23 -15.22  22.64 210.63

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)         66.9831     2.4970  26.825  < 2e-16 ***
aut_calvother       -6.1336     2.6420  -2.322 0.020385 *
hs100_ct            21.8485     2.2546   9.690  < 2e-16 ***
hs100_ctsq          11.7763     3.1952   3.686 0.000236 ***
parity_sc            1.1421     0.8727   1.309 0.190848
twinYes             19.7621     9.8677   2.003 0.045387 *
dystYes             11.0347     5.5386   1.992 0.046515 *
rpYes                7.1111     4.8654   1.462 0.144067
vag_dischYes         2.0977     7.2588   0.289 0.772629
rpYes:vag_dischYes  19.9543    12.7239   1.568 0.117031
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 50 on 1515 degrees of freedom
Multiple R-squared: 0.07355,	Adjusted R-squared: 0.06805
F-statistic: 13.36 on 9 and 1515 DF,  p-value: < 2.2e-16


Now we can create some plots to assess the major assumptions of linear regression. First, let’s have a look at homoscedasticity, or constant variance of residuals. You can run a statistical test, the Cook-Weisberg test (or Breusch-Pagan test), available in libraries car and lmtest:

library(car)
ncvTest(lm.wpc)
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 23.08052    Df = 1     p = 1.553565e-06

library(lmtest)
bptest(lm.wpc)

studentized Breusch-Pagan test

data:  lm.wpc
BP = 24.3082, df = 9, p-value = 0.00384


Both tests are significant indicating the presence of heteroscedasticity but they are slightly different. Quoting John Fox from this post: “bp.test() performs the same score test as ncvTest(), except that the default alternative hypothesis is different — in bp.test() the error variance is a function of a linear combination of the regressors and in ncvTest() the error variance is a function of the fitted values (i.e., a *particular* linear combination of regressors). Testing against the fitted values with 1 df will have greater power if this is the real pattern of heteroscedasticity”.

But the evaluation of linear model assumptions relies more often on graphical methods. You can use the 6 diagnostic plots readily available after fitting your model, or try it with ggplot2 using the fortify() method which allow combining a model with its data (“the model fortifies the data, and the data fortifies the model”):

plot.lm(lm.wpc, which = 1) # diag plot in base stats package
library(ggplot2)
ggplot(lm.wpc, aes(.fitted, .resid)) +
geom_hline(yintercept = 0) +
geom_point() +
geom_smooth(se = FALSE)


We can check the normality of the residuals by a statistical test, the Shapiro-Wilk test, and a normal probability plot (Q-Q plot):

shapiro.test(resid(lm.wpc))

Shapiro-Wilk normality test

data:  resid(lm.wpc)
W = 0.8817, p-value < 2.2e-16

ggplot(lm.wpc, aes(sample = .stdresid)) +
stat_qq() +
geom_abline()


Other diagnostic plots: Scale-Location plot of sqrt(|residuals|) against fitted values, Cook’s distance plot, Residuals vs. leverages, Cook’s distances vs. leverages:

## Scale-Location plot of sqrt(|residuals|) against fitted values (which = 3)
ggplot(lm.wpc, aes(.fitted, abs(.stdresid))) +
geom_point() +
geom_smooth(se = FALSE) +
scale_y_sqrt()

# Cook's distance plot (which = 4)
ggplot(lm.wpc, aes(seq_along(.cooksd), .cooksd)) +
geom_bar(stat = "identity")

# Residuals vs. leverages (which = 5)
ggplot(lm.wpc, aes(.hat, .stdresid)) +
geom_vline(size = 2, colour = "white", xintercept = 0) +
geom_hline(size = 2, colour = "white", yintercept = 0) +
geom_point() +
geom_smooth(se = FALSE)

# Cook's distances  vs leverages/(1-leverages) (which = 6)
ggplot(lm.wpc, aes(.hat, .cooksd)) +
geom_vline(colour = NA) +
geom_abline(slope = seq(0, 3, by = 0.5), colour = "white") +
geom_smooth(se = FALSE) +
geom_point()


You could also check the linearity of the predictor-outcome association by plotting the residuals against each of the continuous predictor variables, for example here centered herd size:

ggplot(lm.wpc, aes(hs100_ct, .resid)) +
geom_point() +
geom_smooth(method = "loess", se = FALSE)


Log Transformation
Let’s do a log transformation (with a simpler model) and back-transform the coefficients:

daisy3$milk120.sq <- daisy3$milk120^2
lm.wpc2 <- lm(wpc ~ vag_disch + milk120 + milk120.sq, data = daisy3)
lm.lwpc <- lm(log(wpc) ~ vag_disch + milk120 + milk120.sq, data = daisy3)
(lm.lwpc.sum <- summary(lm.lwpc))

Call:
lm(formula = log(wpc) ~ vag_disch + milk120 + milk120.sq, data = daisy3)

Residuals:
Interval from wait period to conception
Min      1Q  Median      3Q     Max
-3.9229 -0.5554  0.0231  0.5794  1.7245

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   4.765e+00  3.140e-01  15.175   <2e-16 ***
vag_dischYes  1.477e-01  8.761e-02   1.686   0.0921 .
milk120      -4.664e-04  1.951e-04  -2.390   0.0170 *
milk120.sq    6.399e-08  2.965e-08   2.159   0.0310 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7615 on 1521 degrees of freedom
Multiple R-squared: 0.006997,	Adjusted R-squared: 0.005039
F-statistic: 3.573 on 3 and 1521 DF,  p-value: 0.01356

(ci.lwpc <- confint(lm.lwpc)) # confidence intervals
2.5 %        97.5 %
(Intercept)   4.149060e+00  5.380862e+00
vag_dischYes -2.416148e-02  3.195190e-01
milk120      -8.491117e-04 -8.367357e-05
milk120.sq    5.844004e-09  1.221453e-07
exp(lm.lwpc$coefficients[2]) vag_dischYes 1.15914 exp(ci.lwpc[2, 1]) [1] 0.9761281 exp(ci.lwpc[2, 2]) [1] 1.376465  The trick to get the confidence interval is to get it on the transformed scale and then going back to the original scale. The most difficult part is in interpreting the results. A proprty of the logarithm is that “the difference between logs is the log of the ratio”. So the mean waiting period length difference between cows with and without vaginal discharge is 0.15 on a log scale, or exp(0.15) = 1.15 times longer for cows with vaginal discharge. Better maybe, express as ratios: the ratio of waiting periods from cows with vaginal discharge or no discharge is 1.15 (95% CI: 0.98 to 1.38). Note that the back-transformation of the confidence interval is not symmetric anymore. So for your model $\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 + ... + \hat{\beta}_n x_n$, or $\hat{y} = e^{\hat{\beta}_0}e^{\hat{\beta}_{1}x_i}...e^{\hat{\beta}_{n}x_n}$, an increase of 1 for $x_1$ multiply the predicted response (in the original scale) by $e^{\hat{\beta}_1}$ (the regression coefficients are interpreted in a multiplicative rather than additive manner). If $\hat{\beta}_1 >0$, when $x$ increases by 1, the median of $\hat{y}$ increases by $(e^{\hat{\beta}_1} - 1) \times 100\%$. If $\hat{\beta}_1 <0$, when $x$ increases by 1, the median of $\hat{y}$ decreases by $(1 - e^{\hat{\beta}_1}) \times 100\%$. Beeswarm Plot with ggplot2 A colleague showed me results of his study project with beeswarm plots made by GraphPad. I was wondering if it could be implemented in R and more specifically with ggplot2. There is a R package allowing to draw such graphs, the beeswarm package (beeswarm, cran). An implementation was shown on R-statistics blog but not with ggplot. First here’s the example from the beeswarm package: library(beeswarm) data(breast) breast2 <- breast[order(breast$event_survival, breast$ER),] beeswarm(time_survival ~ event_survival, data = breast2, pch = 16, pwcol = as.numeric(ER), xlab = '', ylab = 'Follow-up time (months)', labels = c('Censored', 'Metastasis')) legend('topright', legend = levels(breast$ER), title = 'ER',
pch = 16, col = 1:2)


Or even like in Tal Galili’s blog, with a boxplot:

beeswarm(time_survival ~ event_survival, data = breast2, pch = 16,
pwcol = as.numeric(ER), xlab = '',
ylab = 'Follow-up time (months)',
labels = c('Censored', 'Metastasis'))
boxplot(time_survival ~ event_survival, data = breast2, add = T,
names = c("",""), col="#0000ff22")
legend('topright', legend = levels(breast$ER), title = 'ER', pch = 16, col = 1:2)  The trick is to use the beeswarm call to get the x and y position. Beeswarm creates a dataframe from which we can get the necessary positionings. beeswarm <- beeswarm(time_survival ~ event_survival, data = breast, method = 'swarm', pwcol = ER)[, c(1, 2, 4, 6)] colnames(beeswarm) <- c("x", "y", "ER", "event_survival") library(ggplot2) library(plyr) beeswarm.plot <- ggplot(beeswarm, aes(x, y)) + xlab("") + scale_y_continuous(expression("Follow-up time (months)")) beeswarm.plot2 <- beeswarm.plot + geom_boxplot(aes(x, y, group = round_any(x, 1, round)), outlier.shape = NA) beeswarm.plot3 <- beeswarm.plot2 + geom_point(aes(colour = ER)) + scale_colour_manual(values = c("black", "red")) + scale_x_continuous(breaks = c(1:2), labels = c("Censored", "Metastasis"), expand = c(0, 0.5))  Do not forget to remove the outliers from your boxplot or they will superimpose with the points created by geom_point. I wonder if these plots are more useful in certain field. If anybody has references for beeswarm plots, I would be very grateful. R, JAGS and ggplot2 Last week a question was asked on the ggplot2 list about using ggplot2 and jags in R (). Here’s what was my answer (a bit updated): Using as an example the school dataset from R2WinBUGS package: library(rjags) library(coda) library(ggplot2) library(R2WinBUGS) data(schools) schoolmodel <- function(){ for (j in 1:J) { y[j] ~ dnorm (theta[j], tau.y[j]) theta[j] ~ dnorm (mu.theta, tau.theta) tau.y[j] <- pow(sigma.y[j], -2) } mu.theta ~ dnorm (0.0, 1.0E-6) tau.theta <- pow(sigma.theta, -2) sigma.theta ~ dunif (0, 1000) } write.model(schoolmodel, "~/schoolmodel.bug") J <- nrow(schools) y <- schools$estimate
sigma.y <- schools$sd forJags <- list(J = nrow(schools), y = schools$estimate, sigma.y = schools\$sd)
inits <- function(){
list(theta = rnorm(J, 0, 100), mu.theta = rnorm(1, 0, 100),
sigma.theta = runif(1, 0, 100))
}

school.sim <- jags.model(file = "schoolmodel.bug", data = forJags,
inits = inits, n.chains = 3, n.adapt = 500)
school.out <- coda.samples(school.sim, variable.names =
c("theta", "mu.theta", "sigma.theta"),
n.iter = 1000)
summary(school.out)


Than you can use the mcmcplots package which give a “feel” of ggplot2:

library(mcmcplots)
denplot(school.out, parms = c('mu.theta', collapse = FALSE, auto.layout = TRUE))
traplot(school.out, parms = c('mu.theta'))
autplot1(school.out[,"mu.theta", drop = FALSE])


If you really want to use ggplot2, you have to extract the information you need, like for example:

varnames(school.out)
str(school.out)
mu.theta.out <- cbind(school.out[[1]][,1], school.out[[2]][,1], school.out[[3]][,1])
attributes(mu.theta.out) <- NULL
dens.mu.theta <- density(mu.theta.out)
q25.mu.theta <- quantile(mu.theta.out, .025)
q975.mu.theta <- quantile(mu.theta.out, .975)
dd.mu.theta <- with(dens.mu.theta, data.frame(x,y))

qplot(x, y, data = dd.mu.theta, geom = "line", ylab = "", xlab = "") +
geom_ribbon(data = subset(dd.mu.theta, x>q25.mu.theta & x<q975.mu.theta),
aes(ymax = y), ymin = 0, fill = "red", colour = NA, alpha = 0.5)


This would give you the density plot with its 95% credible interval.