Category Archives: graphics

Veterinary Epidemiologic Research: Modelling Survival Data – Parametric and Frailty Models

Last post on modelling survival data from Veterinary Epidemiologic Research: parametric analyses. The Cox proportional hazards model described in the last post make no assumption about the shape of the baseline hazard, which is an advantage if you have no idea about what that shape might be. With a parametric survival model, the survival time is assumed to follow a known distribution: Weibull, exponential (which is a special case of the Weibull), log-logistic, log-normal, and generalized gamma.

Exponential Model
The exponential model is the simplest, the hazard h_0(t) is constant over time: the rate at which failures are occurring is constant, h(t) = \lambda . We use again the pgtrial dataset:

temp <- tempfile()
download.file(
"http://ic.upei.ca/ver/sites/ic.upei.ca.ver/files/ver2_data_R.zip", temp)
load(unz(temp, "ver2_data_R/pgtrial.rdata"))
unlink(temp)

library(Hmisc)
pgtrial <- upData(pgtrial, labels = c(herd = 'Herd id', cow = 'Cow id',
tx = 'Treatment', lact = 'Lactation number',
thin = 'Body condition', dar = 'Days at risk',
preg = 'Pregnant or censored'),
levels = list(thin = list('normal' = 0, 'thin' = 1),
preg = list('censored' = 0, 'pregnant' = 1)))
pgtrial$herd <- as.factor(pgtrial$herd)

library(survival)
exp.mod <- survreg(Surv(dar, preg == 'pregnant') ~ herd + tx + (lact - 1) +
                   thin, data = pgtrial, dist = "exponential")
summary(exp.mod)

Call:
survreg(formula = Surv(dar, preg == "pregnant") ~ herd + tx + 
    (lact - 1) + thin, data = pgtrial, dist = "exponential")
           Value Std. Error     z         p
herd1     4.3629     0.1827 23.88 4.66e-126
herd2     4.6776     0.1711 27.34 1.41e-164
herd3     4.3253     0.1617 26.75 1.12e-157
tx       -0.2178     0.1255 -1.74  8.26e-02
lact      0.0416     0.0413  1.01  3.14e-01
thinthin  0.1574     0.1383  1.14  2.55e-01

Scale fixed at 1 

Exponential distribution
Loglik(model)= -1459.9   Loglik(intercept only)= -1465.6
	Chisq= 11.42 on 5 degrees of freedom, p= 0.044 
Number of Newton-Raphson Iterations: 5 
n= 319

Interpretation is the same as for a Cox model. Exponentiated coefficients are hazard ratios. R outputs the parameter estimates of the AFT (accelerated failure time) form of the exponential model. If you multiply the estimated coefficients by minus one you get estimates that are consistent with the proportional hazards parameterization of the model. So for tx, the estimated hazard ratio is exp(0.2178) = 1.24 (at any given point in time, a treated cow is 1.24 times more likely to conceive than a non-treated cow). The corresponding accelerating factor for an exponential model is the reciprocal of the hazard ratio, exp(-0.2178) = 0.80: treating a cow accelerates the time to conception by a factor of 0.80.

Weibull Model

In a Weibull model, the hazard function is h(t) = \lambda p t^{p-1} where p and \lambda are > 0. p is the shape parameter and determines the shape of the hazard function. If it’s > 1 , the hazard increases with time. If p = 1 , the hazard is constant and the model reduces to an exponential model. If p < 1 , the hazard decreases over time.

library(car)
pgtrial$parity <- recode(pgtrial$lact, "1 = 1; 2:hi = '2+'")
weib.mod <- survreg(Surv(dar, preg == 'pregnant') ~ herd + tx + parity +
                   thin, data = pgtrial, dist = "weibull")
summary(weib.mod)

Call:
survreg(formula = Surv(dar, preg == "pregnant") ~ herd + tx + 
    parity + thin, data = pgtrial, dist = "weibull")
               Value Std. Error       z         p
(Intercept)  4.23053     0.1937 21.8412 9.42e-106
herd2        0.36117     0.1947  1.8548  6.36e-02
herd3       -0.00822     0.1980 -0.0415  9.67e-01
tx          -0.23386     0.1438 -1.6262  1.04e-01
parity2+     0.33819     0.1490  2.2698  2.32e-02
thinthin     0.11222     0.1576  0.7119  4.77e-01
Log(scale)   0.13959     0.0509  2.7407  6.13e-03

Scale= 1.15 

Weibull distribution
Loglik(model)= -1453.7   Loglik(intercept only)= -1460.7
	Chisq= 14 on 5 degrees of freedom, p= 0.016 
Number of Newton-Raphson Iterations: 5 
n= 319 

The shape parameter is the reciprocal of what is called by R the scale parameter. The shape parameter is then 1/1.15 = 0.869.

We can also use a piecewise constant exponential regression model, which is a model allowing the baseline hazard to vary between time periods but forces it to remain constant within time periods. In order to run such a model, we need data in a counting process format with a start and stop time for each interval. However, survreg does not allow for a data in that format. The trick would be to use a glm and fitting a Poisson model, including time intervals. See this post by Stephanie Kovalchik which explains how to construct the data and model. The example below is using the same approach, for a time interval of 40 days:

interval.width <- 40
# function to compute time breaks given the exit time = dar
cow.breaks <- function(dar) unique(c(seq(0, dar, by = interval.width),
                                      dar))
# list of each subject's time periods
the.breaks <- lapply(unique(pgtrial$cow), function(id){
  cow.breaks(max(pgtrial$dar[pgtrial$cow == id]))
})
# the expanded period of observation:
start <- lapply(the.breaks, function(x) x[-length(x)]) # for left time points
stop <-  lapply(the.breaks, function(x) x[-1]) # for right time points

count.per.cow <- sapply(start, length)
index <- tapply(pgtrial$cow, pgtrial$cow, length)
index <- cumsum(index) # index of last observation for each cow

event <- rep(0, sum(count.per.cow))
event[cumsum(count.per.cow)] <- pgtrial$preg[index]

# creating the expanded dataset
pw.pgtrial <- data.frame(
    cow = rep(pgtrial$cow[index], count.per.cow),
    dar = rep(pgtrial$dar[index], count.per.cow),
    herd = rep(pgtrial$herd[index], count.per.cow),
    tx = rep(pgtrial$tx[index], count.per.cow),
    lact = rep(pgtrial$lact[index], count.per.cow),
    thin = rep(pgtrial$thin[index], count.per.cow),
    start = unlist(start),
    stop = unlist(stop),
    event = event
  )

# create time variable which indicates the period of observation (offset in Poisson model)
pw.pgtrial$time <- pw.pgtrial$stop - pw.pgtrial$start # length of observation

# create a factor for each interval, allowing to have a different rate for each period
pw.pgtrial$interval <- factor(pw.pgtrial$start)

pw.pgtrial[100:110, ]
    cow dar herd tx lact   thin start stop event time interval
100  61 113    1  1    4   thin     0   40     0   40        0
101  61 113    1  1    4   thin    40   80     0   40       40
102  61 113    1  1    4   thin    80  113     1   33       80
103  62 117    1  0    7 normal     0   40     0   40        0
104  62 117    1  0    7 normal    40   80     0   40       40
105  62 117    1  0    7 normal    80  117     2   37       80
106  63 121    1  1    1   thin     0   40     0   40        0
107  63 121    1  1    1   thin    40   80     0   40       40
108  63 121    1  1    1   thin    80  120     0   40       80
109  63 121    1  1    1   thin   120  121     2    1      120
110  64 122    1  1    3 normal     0   40     0   40        0

# Poisson model
pw.model <- glm(event ~ offset(log(time)) + interval + herd + tx + lact +
+                 thin, data = pw.pgtrial, family = "poisson")
summary(pw.model)

Call:
glm(formula = event ~ offset(log(time)) + interval + herd + tx + 
    lact + thin, family = "poisson", data = pw.pgtrial)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.858  -1.373  -1.227   1.357   3.904  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.602545   0.132436 -27.202  < 2e-16 ***
interval40  -0.112838   0.106807  -1.056  0.29076    
interval80  -0.064105   0.125396  -0.511  0.60920    
interval120 -0.007682   0.147919  -0.052  0.95858    
interval160 -0.005743   0.191778  -0.030  0.97611    
interval200 -0.427775   0.309143  -1.384  0.16644    
interval240  0.199904   0.297331   0.672  0.50137    
interval280  0.737508   0.385648   1.912  0.05583 .  
interval320  0.622366   1.006559   0.618  0.53637    
herd2       -0.254389   0.114467  -2.222  0.02626 *  
herd3        0.026851   0.119416   0.225  0.82209    
tx           0.219584   0.084824   2.589  0.00963 ** 
lact        -0.023528   0.027511  -0.855  0.39241    
thinthin    -0.139915   0.093632  -1.494  0.13509    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 2155.6  on 798  degrees of freedom
Residual deviance: 2131.1  on 785  degrees of freedom
AIC: 2959.1

Number of Fisher Scoring iterations: 7

Log-logistic Model

loglog.mod <- survreg(Surv(dar, preg == 'pregnant') ~ herd + tx + lact +
                   thin, data = pgtrial, dist = "loglogistic")
summary(loglog.mod)

Call:
survreg(formula = Surv(dar, preg == "pregnant") ~ herd + tx + 
    lact + thin, data = pgtrial, dist = "loglogistic")
              Value Std. Error      z        p
(Intercept)  3.9544     0.2531 15.625 4.91e-55
herd2        0.2537     0.2355  1.077 2.81e-01
herd3       -0.1019     0.2437 -0.418 6.76e-01
tx          -0.3869     0.1768 -2.189 2.86e-02
lact         0.0612     0.0550  1.112 2.66e-01
thinthin     0.0400     0.1894  0.211 8.33e-01
Log(scale)  -0.1260     0.0515 -2.447 1.44e-02

Scale= 0.882 

Log logistic distribution
Loglik(model)= -1467.2   Loglik(intercept only)= -1472.2
	Chisq= 9.85 on 5 degrees of freedom, p= 0.08 
Number of Newton-Raphson Iterations: 4 
n= 319 

Individual Frailty Model
In an individual frailty model, we add variance unique to individuals in order to account for additional variability in the hazard (like negative binomial model vs. Poisson model). For example, let’s fit a Weibull model with gamma individual frailty to the prostaglandin dataset:

library(parfm)
pgtrial$preg.bin <- as.numeric(pgtrial$preg) - 1
indfr.mod <- parfm(Surv(dar, preg.bin) ~ herd + tx + lact + thin,
             cluster = "cow", data = pgtrial, dist = "weibull",
             frailty = "gamma")
Execution time: 17.872 second(s) 
indfr.mod
Frailty distribution: gamma 
Baseline hazard distribution: Weibull 
Loglikelihood: -1455.679 

         ESTIMATE SE    p-val    
theta     0.000   0.003          
rho       0.867   0.044          
lambda    0.024   0.006          
herd2    -0.289   0.169 0.088 .  
herd3     0.039   0.175 0.824    
tx        0.204   0.125 0.103    
lact     -0.041   0.041 0.314    
thinthin -0.136   0.138 0.323    
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Shared Frailty
Shared frailty is a way to deal with clustered data. We will use the “culling” dataset and fit a shared frailty model with a Weibull distribution and a gamma distributed frailty common to all cows in a herd:

temp <- tempfile()
download.file(
  "http://ic.upei.ca/ver/sites/ic.upei.ca.ver/files/ver2_data_R.zip", temp)
load(unz(temp, "ver2_data_R/culling.rdata"))
unlink(temp)

library(frailtypack)
shfrw.mod <- frailtyPenal(Surv(dar, culled) ~ as.factor(lact_c3) + johnes +
                          cluster(herd),
                          hazard = 'Weibull', data = culling, Frailty = TRUE)
shfrw.sum <- cbind(shfrw.mod$coef, sqrt(diag(shfrw.mod$varH)), 
                 shfrw.mod$coef / sqrt(diag(shfrw.mod$varH)), 
        signif(1 - pchisq((shfrw.mod$coef/sqrt(diag(shfrw.mod$varH)))^2, 1)),
                   exp(shfrw.mod$coef),
 exp(shfrw.mod$coef - abs(qnorm((1 - 0.95) / 2)) * sqrt(diag(shfrw.mod$varH))), 
 exp(shfrw.mod$coef + abs(qnorm((1 - 0.95) / 2)) * sqrt(diag(shfrw.mod$varH))))
row.names(shfrw.sum) <- c("Lactation 2", "Lactation 3+", "Johnes")
colnames(shfrw.sum) <- c("Coef.", "Std. Err.", "z", "p-value", "Hazard Ratio", 
                       "Lower CI", "Upper CI")
shfrw.sum
                 Coef. Std. Err.        z     p-value Hazard Ratio  Lower CI
Lactation 2  0.2518627 0.1450806 1.736019 8.25605e-02     1.286419 0.9680321
Lactation 3+ 0.7636558 0.1227840 6.219508 4.98717e-10     2.146108 1.6870874
Johnes       0.5914741 0.3045475 1.942141 5.21200e-02     1.806650 0.9945867
             Upper CI
Lactation 2  1.709525
Lactation 3+ 2.730017
Johnes       3.281748

That’s it for reproducing the examples from Dohoo’s book, chapter on modelling survival data. Next time I’ll look at mixed models.

Veterinary Epidemiologic Research: Modelling Survival Data – Semi-Parametric Analyses

Next on modelling survival data from Veterinary Epidemiologic Research: semi-parametric analyses. With non-parametric analyses, we could only evaluate the effect one or a small number of variables. To evaluate multiple explanatory variables, we analyze data with a proportional hazards model, the Cox regression. The functional form of the baseline hazard is not specified, which make the Cox model a semi-parametric model.
A Cox proportional hazards model is fit hereafter, on data from a clinical trial of the effect of prostaglandin adminsitration on the start of breeding period of dairy cows:

temp <- tempfile()
download.file(
"http://ic.upei.ca/ver/sites/ic.upei.ca.ver/files/ver2_data_R.zip", temp)
load(unz(temp, "ver2_data_R/pgtrial.rdata"))
unlink(temp)

library(Hmisc)
pgtrial <- upData(pgtrial, labels = c(herd = 'Herd id', cow = 'Cow id',
                             tx = 'Treatment', lact = 'Lactation number',
                             thin = 'Body condition', dar = 'Days at risk',
                             preg = 'Pregnant or censored'),
                  levels = list(thin = list('normal' = 0, 'thin' = 1),
                    preg = list('censored' = 0, 'pregnant' = 1)))
pgtrial$herd <- as.factor(pgtrial$herd)

library(survival)
coxph.mod <- coxph(Surv(dar, preg == 'pregnant') ~ herd + tx + lact + thin,
                   data = pgtrial, ties = 'breslow')
(coxph.sum <- summary(coxph.mod))
Call:
coxph(formula = Surv(dar, preg == "pregnant") ~ herd + tx + lact + 
    thin, data = pgtrial, ties = "breslow")

  n= 319, number of events= 264 

             coef exp(coef) se(coef)      z Pr(>|z|)  
herd2    -0.28445   0.75243  0.16981 -1.675   0.0939 .
herd3     0.03676   1.03744  0.17426  0.211   0.8329  
tx        0.18359   1.20152  0.12543  1.464   0.1433  
lact     -0.04283   0.95807  0.04109 -1.042   0.2972  
thinthin -0.14557   0.86453  0.13794 -1.055   0.2913  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

         exp(coef) exp(-coef) lower .95 upper .95
herd2       0.7524     1.3290    0.5394     1.050
herd3       1.0374     0.9639    0.7373     1.460
tx          1.2015     0.8323    0.9396     1.536
lact        0.9581     1.0438    0.8839     1.038
thinthin    0.8645     1.1567    0.6597     1.133

Concordance= 0.564  (se = 0.021 )
Rsquare= 0.029   (max possible= 1 )
Likelihood ratio test= 9.5  on 5 df,   p=0.09084
Wald test            = 9.32  on 5 df,   p=0.09685
Score (logrank) test = 9.34  on 5 df,   p=0.09611

R gives several options to control ties in case several events occurred at the same time: the Efron method (default in R), Breslow method (default in software like SAS or Stata), and the exact method. Breslow is the simplest and adequate if not too many ties in the dataset. Efron is closer to the exact approximation.

Stratified Cox Propotional Hazards Model

In a stratified Cox model, different baseline hazards are assumed across groups of subjects. The Cox model is modified to allow the control of a predictor which do not satisfy the proportional hazards (PH) assumption. We refit the above model by stratifying by herd and including a treatment by herd interaction:

scoxph.mod <- coxph(Surv(dar, preg == 'pregnant') ~ tx + tx*herd + lact + thin +
                    strata(herd), data = pgtrial, method = 'breslow')
summary(scoxph.mod)
Call:
coxph(formula = Surv(dar, preg == "pregnant") ~ tx + tx * herd + 
    lact + thin + strata(herd), data = pgtrial, method = "breslow")

  n= 319, number of events= 264 

             coef exp(coef) se(coef)      z Pr(>|z|)  
tx       -0.02160   0.97863  0.25528 -0.085   0.9326  
herd2          NA        NA  0.00000     NA       NA  
herd3          NA        NA  0.00000     NA       NA  
lact     -0.04600   0.95504  0.04065 -1.132   0.2578  
thinthin -0.13593   0.87291  0.13833 -0.983   0.3258  
tx:herd2 -0.05659   0.94498  0.33570 -0.169   0.8661  
tx:herd3  0.54494   1.72451  0.31823  1.712   0.0868 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

         exp(coef) exp(-coef) lower .95 upper .95
tx          0.9786     1.0218    0.5934     1.614
herd2           NA         NA        NA        NA
herd3           NA         NA        NA        NA
lact        0.9550     1.0471    0.8819     1.034
thinthin    0.8729     1.1456    0.6656     1.145
tx:herd2    0.9450     1.0582    0.4894     1.825
tx:herd3    1.7245     0.5799    0.9242     3.218

Concordance= 0.56  (se = 0.035 )
Rsquare= 0.032   (max possible= 0.998 )
Likelihood ratio test= 10.32  on 5 df,   p=0.06658
Wald test            = 10.5  on 5 df,   p=0.0623
Score (logrank) test = 10.66  on 5 df,   p=0.05851

Evaluating the Assumption of Proportional Hazards

We can evaluate it graphically, by examining the log-cumulative hazard plot vs. ln(time) and check if the curves are parallel:

coxph.mod2 <- coxph(Surv(dar, preg == 'pregnant') ~ tx, data = pgtrial,
                   ties = 'breslow')
pgtrial2 <- with(pgtrial, data.frame(tx = c(0, 1)))
tfit.add <- survfit(coxph.mod2, newdata = pgtrial2)
df1 <- data.frame(
    time    = tfit.add[1, ]$time,
    n.risk  = tfit.add[1, ]$n.risk,
    n.event = tfit.add[1, ]$n.event,
    surv    = tfit.add[1, ]$surv,
    strata  = "0",
    upper   = tfit.add[1, ]$upper,
    lower   = tfit.add[1, ]$lower,
    log.surv = log(-log(tfit.add[1, ]$surv))
 ) 
df2 <- data.frame(
    time    = tfit.add[2, ]$time,
    n.risk  = tfit.add[2, ]$n.risk,
    n.event = tfit.add[2, ]$n.event,
    surv    = tfit.add[2, ]$surv,
    strata  = "1",
    upper   = tfit.add[2, ]$upper,
    lower   = tfit.add[2, ]$lower,
    log.surv = log(-log(tfit.add[2, ]$surv))
 )
dfpar.add <- rbind(df1, df2)
zeros <- data.frame(time = 0, surv = 1, strata = c(1, 2), 
                     upper = 1, lower = 1)
dfpar.add <- rbind.fill(zeros, dfpar.add)
dfpar.add$strata <- factor(dfpar.add$strata, labels = c("No tx", "Tx"))
ggplot(dfpar.add, aes(log(time), log.surv, colour = strata)) + 
  geom_step(size = 0.6) + 
  scale_color_manual("Tx", values = c('blue4', 'darkorange')) + 
  xlab("ln(time)") + ylab("Log-log survival")
Log cumulative hazard plot
Log cumulative hazard plot

Another graphical approach is to compare plots of predicted survival times from a Cox model (assuming PH) to Kaplan-Meier survivor function (which do not assume PH):

tfit.km <- survfit(Surv(dar, preg == 'pregnant') ~ tx, data = pgtrial)
df3.km <- data.frame(
    time    = tfit.km$time,
    n.risk  = tfit.km$n.risk,
    n.event = tfit.km$n.event,
    surv    = tfit.km$surv,
    strata  = gsub("tx=", "", summary(tfit.km, censored = T)$strata),
    upper   = tfit.km$upper,
    lower   = tfit.km$lower
 ) 
zeros <- data.frame(time = 0, surv = 1, strata = gsub("tx=", "", 
                                          levels(summary(tfit.km)$strata)), 
                     upper = 1, lower = 1)
df3.km <- rbind.fill(df3.km, zeros)
df3.km$cat <- with(df3.km, ifelse(strata == "0", "No tx, observed",
                                  "Tx, observed"))
dfpar.add$cat <- with(dfpar.add, ifelse(strata == "No tx", "No tx, expected",
                                        "Tx, expected"))
dfpar.obs <- rbind.fill(dfpar.add, df3.km)
ggplot(dfpar.obs, aes(time, surv, colour = cat)) + 
   geom_step(size = 0.6) + 
   scale_color_manual("", values = c('blue1', 'blue4', 'darkorange1',
                            'darkorange4')) + 
   xlab("time") + ylab("survival probability")
Kaplan-Meier Cox plot
Kaplan-Meier Cox plot

You can also assess PH statistically with the Schoenfeld residuals using cox.zph function:

(schoen <- cox.zph(coxph.mod))
             rho chisq      p
herd2    -0.0630 1.100 0.2942
herd3    -0.0443 0.569 0.4506
tx       -0.1078 3.141 0.0763
lact      0.0377 0.447 0.5035
thinthin -0.0844 2.012 0.1560
GLOBAL        NA 7.631 0.1778

plot(schoen, var = 4)
Schoenfeld residuals for lactation
Schoenfeld residuals for lactation

Evaluating the Overall Fit of the Model

First we can look at the Cox-Snell residuals, which are the estimated cumulative hazards for individuals at their failure (or censoring) times. The default residuals of coxph in R are the martingale residuals, not the Cox-Snell. But it can be computed:

cox.snell <- (as.numeric(pgtrial$preg) - 1) - resid(coxph.mod,
                                                    type = "martingale")
coxph.res <- survfit(coxph(Surv(cox.snell, pgtrial$preg == 'pregnant') ~ 1,
                           method = 'breslow'), type = 'aalen')

plot(coxph.res$time, -log(coxph.res$surv), type = 's',
     xlab = 'Modified Cox-Snell residuals', ylab = 'Cumulative hazard')
abline(0, 1, col = 'red', lty = 2)

## Alternatively:
coxph.res2 <- survfit(Surv(cox.snell, pgtrial$preg == 'pregnant') ~ 1)
Htilde <- cumsum(coxph.res2$n.event / coxph.res$n.risk)
plot(coxph.res2$time, Htilde, type = 's', col = 'blue')
abline(0, 1, col = 'red', lty = 2)
Plot of Cox-Snell residuals
Plot of Cox-Snell residuals

We can also use a goodness-of-fit test:

## GOF (Gronnesby and Borgan omnibus gof)
library(gof)
cumres(coxph.mod)

Kolmogorov-Smirnov-test: p-value=0.35
Cramer von Mises-test: p-value=0.506
Based on 1000 realizations. Cumulated residuals ordered by herd2-variable.
---
Kolmogorov-Smirnov-test: p-value=0.041
Cramer von Mises-test: p-value=0.589
Based on 1000 realizations. Cumulated residuals ordered by herd3-variable.
---
Kolmogorov-Smirnov-test: p-value=0
Cramer von Mises-test: p-value=0.071
Based on 1000 realizations. Cumulated residuals ordered by tx-variable.
---
Kolmogorov-Smirnov-test: p-value=0.728
Cramer von Mises-test: p-value=0.733
Based on 1000 realizations. Cumulated residuals ordered by lact-variable.
---
Kolmogorov-Smirnov-test: p-value=0.106
Cramer von Mises-test: p-value=0.091
Based on 1000 realizations. Cumulated residuals ordered by thinthin-variable.

We can evaluate the concordance between the predicted and observed sequence of pairs of events. Harrell’s c index computes the proportion of all pairs of subjects in which the model correctly predicts the sequence of events. It ranges from 0 to 1 with 0.5 for random predictions and 1 for a perfectly discriminating model. It is obtained from the Somer’s Dxy rank correlation:

library(rms)
fit.cph <- cph(Surv(dar, preg == 'pregnant') ~ herd + tx + lact + thin,
               data = pgtrial, x = TRUE, y = TRUE, surv = TRUE)
v <- validate(fit.cph, dxy = TRUE, B = 100)
Dxy <- v[rownames(v) == "Dxy", colnames(v) == "index.corrected"]
(Dxy / 2) + 0.5 # c index
[1] 0.4538712

Evaluating the Functional Form of Predictors

We can use martingale residuals to evaluate the functional form of the relationship between a continuous predictor and the survival expectation for individuals:

lact.mod <- coxph(Surv(dar, preg == 'pregnant') ~ lact, data = pgtrial,
                  ties = 'breslow')
lact.res <- resid(lact.mod, type = "martingale")
plot(pgtrial$lact, lact.res, xlab = 'lactation', ylab = 'Martingale residuals')
lines(lowess(pgtrial$lact, lact.res, iter = 0))
 
# adding quadratic term
lact.mod <- update(lact.mod, . ~ . + I(lact^2))
lact.res <- resid(lact.mod, type = "martingale")
plot(pgtrial$lact, lact.res, xlab = 'lactation', ylab = 'Martingale residuals')
lines(lowess(pgtrial$lact, lact.res, iter = 0))
Plot of marrtingale residuals vs. lactation number
Plot of marrtingale residuals vs. lactation number
Plot of martingale residuals vs. lactation number (as quadratic term)
Plot of martingale residuals vs. lactation number (as quadratic term)

Checking for Outliers

Deviance residuals can be used to identify outliers:

## deviance residuals
dev.res <- resid(coxph.mod, type = "deviance")
plot(pgtrial$dar, dev.res, xlab = 'time (days)', ylab = 'deviance residuals')

cbind(dev.res, pgtrial)[abs(dev.res) > 2, ]
      dev.res herd cow tx lact   thin dar     preg
1    2.557832    1   1  0    1 normal   1 pregnant
2    2.592492    1   2  1    4   thin   1 pregnant
3    2.319351    1   3  1    1 normal   2 pregnant
73  -2.693731    1  76  1    1 normal 277 censored
74   2.734508    2  78  0    2   thin   1 pregnant
75   2.644885    2  79  1    4 normal   1 pregnant
76   2.436308    2  80  1    1 normal   2 pregnant
176 -2.015925    2 180  1    2 normal 201 censored
180 -2.196008    2 184  1    2 normal 250 censored
183 -2.081493    2 187  1    3   thin 288 censored
185 -2.238729    2 189  0    1 normal 346 censored
314 -2.274912    3 318  0    1   thin 262 censored
315 -2.226711    3 319  0    2   thin 262 censored
316 -2.182517    3 320  0    4   thin 287 censored
317 -2.278029    3 321  0    2   thin 288 censored
318 -2.341736    3 322  0    3   thin 308 censored
319 -2.392427    3 323  0    2   thin 320 censored
Deviance residuals
Deviance residuals

Score residuals and scaled score residuals can be used to identify influential observations:

### Detecting influential points
# score residuals
score.res <- resid(coxph.mod, type = "score")
# score residuals for tx
plot(pgtrial$dar, score.res[ , 3], xlab = 'time (days)',
     ylab = 'score residuals')
text(pgtrial$dar, score.res[ , 3], rownames(pgtrial), cex = 0.6, pos = 4)

cbind(score.res[ , 3], pgtrial)[abs(score.res[ , 3]) > 2, ]
   score.res[, 3] herd cow tx lact   thin dar     preg
73      -2.025537    1  76  1    1 normal 277 censored

## influential observations
dfbeta <- resid(coxph.mod, type = "dfbeta")
# dfbeta residuals for tx
plot(pgtrial$dar, dfbeta[ , 3], xlab = 'time (days)',
     ylab = 'scaled score residual')
text(pgtrial$dar, dfbeta[ , 3], rownames(pgtrial), cex = 0.6, pos = 4)

# with standardized dfbeta
dfbetas <- resid(coxph.mod, type = "dfbetas")
plot(pgtrial$dar, dfbetas[ , 3], xlab = 'time (days)',
     ylab = 'standardized score residuals')
text(pgtrial$dar, dfbetas[ , 3], rownames(pgtrial), cex = 0.6, pos = 4)
Score residuals
Score residuals
Scaled score residuals (delta-beta)
Scaled score residuals (delta-beta)
Standardized score residuals
Standardized score residuals

Veterinary Epidemiologic Research: Modelling Survival Data – Non-Parametric Analyses

Next topic from Veterinary Epidemiologic Research: chapter 19, modelling survival data. We start with non-parametric analyses where we make no assumptions about either the distribution of survival times or the functional form of the relationship between a predictor and survival. There are 3 non-parametric methods to describe time-to-event data: actuarial life tables, Kaplan-Meier method, and Nelson-Aalen method.
We use data on occurrence of calf pneumonia in calves raised in 2 different housing systems. Calves surviving to 150 days without pneumonia are considered censored at that time.

temp <- tempfile()
download.file(
"http://ic.upei.ca/ver/sites/ic.upei.ca.ver/files/ver2_data_R.zip", temp)
load(unz(temp, "ver2_data_R/calf_pneu.rdata"))
unlink(temp)

library(Hmisc)
calf_pneu <- upData(calf_pneu, labels = c(calf = 'Calf id',
                                  stock = 'Stocking method',
                             days = 'Time to onset of pneumonia or censoring',
                                  pn = 'Pneumonia'),
                  levels = list(stock = list('batch' = 0, 'continuous' = 1)))

Actuarial Life Table

To create a life table, we use the function lifetab from package KMsurv, after calculating the number of censored and events at each time point and grouping them by time interval (with gsummary from package nlme).

library(KMsurv)
interval <- seq(from = 30, to = 165, by = 15)
interval <- floor(calf_pneu$days/15)
interval.censor <- data.frame(interval, calf_pneu$pn)
library(nlme)
pneumonia <- gsummary(interval.censor, sum, groups = interval)
total <- gsummary(interval.censor, length, groups = interval)
lt.data <- cbind(pneumonia[ , 1:2], total[ , 2])
length <- length(lt.data$interval)
lt.data[length + 1, ]$interval <- NA
nevent <- lt.data[ , 2]
nlost <- lt.data[ , 3] - lt.data[ , 2]
(life.table <- lifetab(lt.data$interval, 24, nlost, nevent))
      nsubs nlost nrisk nevent      surv        pdf     hazard    se.surv
1-3      24     0  24.0      1 1.0000000 0.02083333 0.02127660 0.00000000
3-4      23     0  23.0      1 0.9583333 0.04166667 0.04444444 0.04078938
4-5      22     0  22.0      1 0.9166667 0.04166667 0.04651163 0.05641693
5-6      21     0  21.0      3 0.8750000 0.12500000 0.15384615 0.06750772
6-7      18     1  17.5      2 0.7500000 0.08571429 0.12121212 0.08838835
7-8      15     6  12.0      3 0.6642857 0.16607143 0.28571429 0.09686316
8-10      6     0   6.0      1 0.4982143 0.04151786 0.09090909 0.11032937
10-NA     5     5   2.5      0 0.4151786         NA         NA 0.11915934
NA-3      0    NA    NA     NA 0.4151786         NA         NA 0.11915934
          se.pdf  se.hazard
1-3   0.02039469 0.02127178
3-4   0.04078938 0.04443347
4-5   0.04078938 0.04649905
5-6   0.06750772 0.08855994
6-7   0.05792828 0.08555236
7-8   0.08649471 0.16326531
8-10  0.03899969 0.09053265
10-NA         NA         NA
NA-3          NA         NA

Kaplan-Meier Method

To compute the Kaplan-Meier estimator we use the function survfit from package survival. It takes as argument a Surv object, which gives the time variable and the event of interest. You get the Kaplan-Meier estimate with the summary of the survfit object. We can then plot the estimates to show the Kaplan-Meier survivor function.

library(survival)
km.sf <- survfit(Surv(days, pn == 1) ~ 1, data = calf_pneu)
summary(km.sf)
Call: survfit(formula = Surv(days, pn == 1) ~ 1, data = calf_pneu)

 time n.risk n.event survival std.err lower 95% CI upper 95% CI
   27     24       1    0.958  0.0408        0.882        1.000
   49     23       1    0.917  0.0564        0.813        1.000
   72     22       1    0.875  0.0675        0.752        1.000
   79     21       2    0.792  0.0829        0.645        0.972
   89     19       1    0.750  0.0884        0.595        0.945
   90     18       1    0.708  0.0928        0.548        0.916
  101     17       1    0.667  0.0962        0.502        0.885
  113     15       2    0.578  0.1019        0.409        0.816
  117      9       1    0.514  0.1089        0.339        0.778
  123      6       1    0.428  0.1198        0.247        0.741

plot(km.sf, xlab = "time (days)", ylab = "cumulative survival probability", conf.int = TRUE)
Kaplan-Meier survivor function (95% CI)
Kaplan-Meier survivor function (95% CI)

Nelson-Aalen Method

A “hazard” is the probability of failure at a point in time, given that the calf had survived up to that point in time. A cumulative hazard, the Nelson-Aaalen estimate, can be computed. The Nelson-Aalen estimate can be calculated by transforming the Fleming-Harrington estimate of survival.

fh.sf <- survfit(Surv(days, pn == 1) ~ 1, data = calf_pneu, type = "fleming")

plot(stepfun(fh.sf$time, c(0, -log(fh.sf$surv))), do.points = FALSE, 
      xlab = "time (days)", ylab = "cumulative hazard",
      main = "", ylim = c(0, 1.5))
lines(stepfun(fh.sf$time, c(0, -log(fh.sf$upper))), lty = 5, do.points = FALSE)
lines(stepfun(fh.sf$time, c(0, -log(fh.sf$lower))), lty = 5, do.points = FALSE)
Nelson-Aalen cumulative hazard function
Nelson-Aalen cumulative hazard function (95% CI)

Tests of the Overall Survival Curve

Several tests are available to test whether the overall survivor functions in 2 or more groups are equal. We can use the log-rank test, the simplest test, assigning equal weight to each time point estimate and equivalent to a standard Mantel-Haenszel test. Also, there’s the Peto-Peto-Prentice test which weights the stratum-specific estimates by the overall survival experience and so reduces the influence of different censoring patterns between groups.
To do these tests, we apply the survdiff function to the Surv object. The argument rho gives the weights according to S^{(t)}\rho and may be any numeric value. Default is rho = 0 which gives the log-rank test. Rho = 1 gives the “Peto & Peto modification of the Gehan-Wilcoxon test”. Rho larger than zero gives greater weight to the first part of the survival curves. Rho smaller than zero gives weight to the later part of the survival curves.

survdiff(Surv(days, pn == 1) ~ stock, data = calf_pneu, rho = 0) # rho is optional
Call:
survdiff(formula = Surv(days, pn == 1) ~ stock, data = calf_pneu, 
    rho = 0)

                  N Observed Expected (O-E)^2/E (O-E)^2/V
stock=batch      12        4     6.89      1.21      2.99
stock=continuous 12        8     5.11      1.63      2.99

 Chisq= 3  on 1 degrees of freedom, p= 0.084

survdiff(Surv(days, pn == 1) ~ stock, data = calf_pneu, rho = 1) # rho=1 asks for Peto-Peto test
Call:
survdiff(formula = Surv(days, pn == 1) ~ stock, data = calf_pneu, 
    rho = 1)

                  N Observed Expected (O-E)^2/E (O-E)^2/V
stock=batch      12     2.89     5.25      1.06      3.13
stock=continuous 12     6.41     4.05      1.38      3.13

 Chisq= 3.1  on 1 degrees of freedom, p= 0.0766

Finally we can compare survivor function with stock R plot or using ggplot2. With ggplot2, you get the necessary data from the survfit object and create a new data frame from it. The baseline data (time = 0) are not there so you create it yourself:

(km.stock <- survfit(Surv(days, pn == 1) ~ stock, data = calf_pneu))
Call: survfit(formula = Surv(days, pn == 1) ~ stock, data = calf_pneu)

                 records n.max n.start events median 0.95LCL 0.95UCL
stock=batch           12    12      12      4     NA     123      NA
stock=continuous      12    12      12      8    113      79      NA

plot(km.stock, conf.int = FALSE, col = c("blue4", "darkorange"),
      xlab = "time (days)", ylab = "cumulative survival probability")
legend("bottomleft", inset = .05, c("batch", "continuous"),
        text.col = c("blue4", "darkorange"))

km.df <- data.frame(
    time    = km.stock$time,
    n.risk  = km.stock$n.risk,
    n.event = km.stock$n.event,
    surv    = km.stock$surv,
    strata  = gsub("stock=", "", summary(km.stock, censored = T)$strata),
    upper   = km.stock$upper,
    lower   = km.stock$lower
 ) 
zeros <- data.frame(time = 0, surv = 1, strata = gsub("stock=", "",
                                           levels(summary(km.stock)$strata)), 
                     upper = 1, lower = 1)
library(plyr)
km.df <- rbind.fill(zeros, km.df)
km.df$strata <- ordered(km.df$strata, levels = c("batch", "continuous"))
library(ggplot2)
ggplot(km.df, aes(time, surv, colour = strata)) + 
   geom_step(size = 0.6) + xlim(0, 150) + ylim(0, 1) + 
   xlab("time (days)") + ylab("cumulative survival probability") +
   labs(colour = "stock")
K-M survival curves, by stocking type
K-M survival curves, by stocking type

kmc-gg

Veterinary Epidemiologic Research: Linear Regression Part 2 – Checking assumptions

We continue on the linear regression chapter the book Veterinary Epidemiologic Research.

Using same data as last post and running example 14.12:

tmp <- tempfile()
download.file("http://ic.upei.ca/ver/sites/ic.upei.ca.ver/files/ver2_data_R.zip", tmp) # fetch the file into the temporary file
load(unz(tmp, "ver2_data_R/daisy2.rdata")) # extract the target file from the temporary file
unlink(tmp) # remove the temporary file

### some data management
daisy2 <- daisy2[daisy2$h7 == 1, ] # we only use a subset of the data
library(Hmisc)
daisy2 <- upData(daisy2, labels = c(region = 'Region', herd = 'Herd number',
                           cow = 'Cow number',
                           study_lact = 'Study lactation number',
                           herd_size = 'Herd size',
                           mwp = "Minimum wait period for herd",
                           parity = 'Lactation number',
                  milk120 = 'Milk volume in first 120 days of lactation',
                           calv_dt = 'Calving date',
                           cf = 'Calving to first service interval',
                           fs = 'Conception at first service',
                           cc = 'Calving to conception interval',
                           wpc = 'Interval from wait period to conception',
                           spc = 'Services to conception', twin = 'Twins born',
                           dyst = 'Dystocia at calving',
                           rp = 'Retained placenta at calving',
                           vag_disch = 'Vaginal discharge observed',
                           h7 = 'Indicator for 7 herd subset'),
                 levels = list(fs = list('No' = 0, 'Yes' = 1),
                   twin = list('No' = 0, 'Yes' = 1),
                   dyst = list('No' = 0, 'Yes' = 1),
                   rp = list('No' = 0, 'Yes' = 1),
                   vag_disch = list('No' = 0, 'Yes' = 1)),
                 units = c(milk120 = "litres"))

library(lubridate)
daisy2$date <- as.character(daisy2$calv_dt)
daisy2$date <- ymd(daisy2$date)
daisy2$mth <- month(daisy2$date)
# calving in automn or not:
daisy2$aut_calv <- with(daisy2, ifelse(mth %in% c(9:12), "fall", "other"))
# centering variables:
daisy2$hs100 <- daisy2$herd_size / 100
daisy2$hs100_ct <- daisy2$hs100 - mean(daisy2$hs100)
daisy2$hs100_ctsq <- daisy2$hs100_ct^2
daisy2$parity_sc <- daisy2$parity - mean(daisy2$parity)

daisy3 <- daisy2[complete.cases(daisy2), ] # df with only complete cases

lm.wpc <- lm(wpc ~ aut_calv + hs100_ct + hs100_ctsq + parity_sc + twin +
+              dyst + rp + vag_disch + rp*vag_disch, data = daisy3)
(lm.wpc.sum <- summary(lm.wpc))

Call:
lm(formula = wpc ~ aut_calv + hs100_ct + hs100_ctsq + parity_sc + 
    twin + dyst + rp + vag_disch + rp * vag_disch, data = daisy3)

Residuals:
Interval from wait period to conception 
   Min     1Q Median     3Q    Max 
-76.40 -35.23 -15.22  22.64 210.63 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)         66.9831     2.4970  26.825  < 2e-16 ***
aut_calvother       -6.1336     2.6420  -2.322 0.020385 *  
hs100_ct            21.8485     2.2546   9.690  < 2e-16 ***
hs100_ctsq          11.7763     3.1952   3.686 0.000236 ***
parity_sc            1.1421     0.8727   1.309 0.190848    
twinYes             19.7621     9.8677   2.003 0.045387 *  
dystYes             11.0347     5.5386   1.992 0.046515 *  
rpYes                7.1111     4.8654   1.462 0.144067    
vag_dischYes         2.0977     7.2588   0.289 0.772629    
rpYes:vag_dischYes  19.9543    12.7239   1.568 0.117031    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 50 on 1515 degrees of freedom
Multiple R-squared: 0.07355,	Adjusted R-squared: 0.06805 
F-statistic: 13.36 on 9 and 1515 DF,  p-value: < 2.2e-16 

Now we can create some plots to assess the major assumptions of linear regression. First, let’s have a look at homoscedasticity, or constant variance of residuals. You can run a statistical test, the Cook-Weisberg test (or Breusch-Pagan test), available in libraries car and lmtest:

library(car)
ncvTest(lm.wpc)
Non-constant Variance Score Test 
Variance formula: ~ fitted.values 
Chisquare = 23.08052    Df = 1     p = 1.553565e-06 

library(lmtest)
bptest(lm.wpc)

	studentized Breusch-Pagan test

data:  lm.wpc 
BP = 24.3082, df = 9, p-value = 0.00384

Both tests are significant indicating the presence of heteroscedasticity but they are slightly different. Quoting John Fox from this post: “bp.test() performs the same score test as ncvTest(), except that the default alternative hypothesis is different — in bp.test() the error variance is a function of a linear combination of the regressors and in ncvTest() the error variance is a function of the fitted values (i.e., a *particular* linear combination of regressors). Testing against the fitted values with 1 df will have greater power if this is the real pattern of heteroscedasticity”.

But the evaluation of linear model assumptions relies more often on graphical methods. You can use the 6 diagnostic plots readily available after fitting your model, or try it with ggplot2 using the fortify() method which allow combining a model with its data (“the model fortifies the data, and the data fortifies the model”):

plot.lm(lm.wpc, which = 1) # diag plot in base stats package
library(ggplot2)
ggplot(lm.wpc, aes(.fitted, .resid)) +
   geom_hline(yintercept = 0) +
   geom_point() +
   geom_smooth(se = FALSE)
Residuals vs. fitted values
Residuals vs. fitted values

We can check the normality of the residuals by a statistical test, the Shapiro-Wilk test, and a normal probability plot (Q-Q plot):

shapiro.test(resid(lm.wpc))

	Shapiro-Wilk normality test

data:  resid(lm.wpc) 
W = 0.8817, p-value < 2.2e-16

ggplot(lm.wpc, aes(sample = .stdresid)) +
   stat_qq() +
   geom_abline()
Q-Q plot
Q-Q plot

Other diagnostic plots: Scale-Location plot of sqrt(|residuals|) against fitted values, Cook’s distance plot, Residuals vs. leverages, Cook’s distances vs. leverages:

## Scale-Location plot of sqrt(|residuals|) against fitted values (which = 3)
ggplot(lm.wpc, aes(.fitted, abs(.stdresid))) +
       geom_point() +
       geom_smooth(se = FALSE) +
       scale_y_sqrt()

# Cook's distance plot (which = 4)
ggplot(lm.wpc, aes(seq_along(.cooksd), .cooksd)) +
  geom_bar(stat = "identity")

# Residuals vs. leverages (which = 5)
ggplot(lm.wpc, aes(.hat, .stdresid)) +
  geom_vline(size = 2, colour = "white", xintercept = 0) +
  geom_hline(size = 2, colour = "white", yintercept = 0) +
  geom_point() +
  geom_smooth(se = FALSE)

# Cook's distances  vs leverages/(1-leverages) (which = 6)
ggplot(lm.wpc, aes(.hat, .cooksd)) +
  geom_vline(colour = NA) +
  geom_abline(slope = seq(0, 3, by = 0.5), colour = "white") +
  geom_smooth(se = FALSE) +
  geom_point()

Scale-Location plot
Scale-Location plot

Cook's distance plot
Cook’s distance plot

Residuals vs. Leverages
Residuals vs. Leverages

Cook's distances vs. Leverages
Cook’s distances vs. Leverages

You could also check the linearity of the predictor-outcome association by plotting the residuals against each of the continuous predictor variables, for example here centered herd size:

ggplot(lm.wpc, aes(hs100_ct, .resid)) +
  geom_point() +
  geom_smooth(method = "loess", se = FALSE)
Linearity of predictor-outcome association
Linearity of predictor-outcome association

Log Transformation
Let’s do a log transformation (with a simpler model) and back-transform the coefficients:

daisy3$milk120.sq <- daisy3$milk120^2
lm.wpc2 <- lm(wpc ~ vag_disch + milk120 + milk120.sq, data = daisy3)
lm.lwpc <- lm(log(wpc) ~ vag_disch + milk120 + milk120.sq, data = daisy3)
(lm.lwpc.sum <- summary(lm.lwpc))

Call:
lm(formula = log(wpc) ~ vag_disch + milk120 + milk120.sq, data = daisy3)

Residuals:
Interval from wait period to conception 
    Min      1Q  Median      3Q     Max 
-3.9229 -0.5554  0.0231  0.5794  1.7245 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   4.765e+00  3.140e-01  15.175   <2e-16 ***
vag_dischYes  1.477e-01  8.761e-02   1.686   0.0921 .  
milk120      -4.664e-04  1.951e-04  -2.390   0.0170 *  
milk120.sq    6.399e-08  2.965e-08   2.159   0.0310 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.7615 on 1521 degrees of freedom
Multiple R-squared: 0.006997,	Adjusted R-squared: 0.005039 
F-statistic: 3.573 on 3 and 1521 DF,  p-value: 0.01356 

(ci.lwpc <- confint(lm.lwpc)) # confidence intervals
                     2.5 %        97.5 %
(Intercept)   4.149060e+00  5.380862e+00
vag_dischYes -2.416148e-02  3.195190e-01
milk120      -8.491117e-04 -8.367357e-05
milk120.sq    5.844004e-09  1.221453e-07
exp(lm.lwpc$coefficients[2])
vag_dischYes 
     1.15914 
exp(ci.lwpc[2, 1])
[1] 0.9761281
exp(ci.lwpc[2, 2])
[1] 1.376465

The trick to get the confidence interval is to get it on the transformed scale and then going back to the original scale. The most difficult part is in interpreting the results. A proprty of the logarithm is that “the difference between logs is the log of the ratio”. So the mean waiting period length difference between cows with and without vaginal discharge is 0.15 on a log scale, or exp(0.15) = 1.15 times longer for cows with vaginal discharge. Better maybe, express as ratios: the ratio of waiting periods from cows with vaginal discharge or no discharge is 1.15 (95% CI: 0.98 to 1.38). Note that the back-transformation of the confidence interval is not symmetric anymore.
So for your model \hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 + ... + \hat{\beta}_n x_n , or \hat{y} = e^{\hat{\beta}_0}e^{\hat{\beta}_{1}x_i}...e^{\hat{\beta}_{n}x_n} , an increase of 1 for x_1 multiply the predicted response (in the original scale) by e^{\hat{\beta}_1} (the regression coefficients are interpreted in a multiplicative rather than additive manner). If \hat{\beta}_1 >0 , when x increases by 1, the median of \hat{y} increases by (e^{\hat{\beta}_1} - 1) \times 100\% . If \hat{\beta}_1 <0 , when x increases by 1, the median of \hat{y} decreases by (1 - e^{\hat{\beta}_1}) \times 100\% .

Beeswarm Plot with ggplot2

A colleague showed me results of his study project with beeswarm plots made by GraphPad. I was wondering if it could be implemented in R and more specifically with ggplot2.

There is a R package allowing to draw such graphs, the beeswarm package (beeswarm, cran). An implementation was shown on R-statistics blog but not with ggplot.

First here’s the example from the beeswarm package:

library(beeswarm)
data(breast)
breast2 <- breast[order(breast$event_survival, breast$ER),]

beeswarm(time_survival ~ event_survival, data = breast2, pch = 16,
         pwcol = as.numeric(ER), xlab = '', 
         ylab = 'Follow-up time (months)', 
         labels = c('Censored', 'Metastasis'))
legend('topright', legend = levels(breast$ER), title = 'ER', 
       pch = 16, col = 1:2)


Or even like in Tal Galili’s blog, with a boxplot:

beeswarm(time_survival ~ event_survival, data = breast2, pch = 16,
         pwcol = as.numeric(ER), xlab = '', 
         ylab = 'Follow-up time (months)', 
         labels = c('Censored', 'Metastasis'))
boxplot(time_survival ~ event_survival, data = breast2, add = T,
        names = c("",""), col="#0000ff22")  
legend('topright', legend = levels(breast$ER), title = 'ER', 
        pch = 16, col = 1:2)

The trick is to use the beeswarm call to get the x and y position. Beeswarm creates a dataframe from which we can get the necessary positionings.

beeswarm <- beeswarm(time_survival ~ event_survival, 
            data = breast, method = 'swarm', 
            pwcol = ER)[, c(1, 2, 4, 6)]
colnames(beeswarm) <- c("x", "y", "ER", "event_survival") 

library(ggplot2)
library(plyr)
beeswarm.plot <- ggplot(beeswarm, aes(x, y)) +
  xlab("") +
  scale_y_continuous(expression("Follow-up time (months)"))
beeswarm.plot2 <- beeswarm.plot + geom_boxplot(aes(x, y,
  group = round_any(x, 1, round)), outlier.shape = NA)
beeswarm.plot3 <- beeswarm.plot2 + geom_point(aes(colour = ER)) +
  scale_colour_manual(values = c("black", "red")) + 
  scale_x_continuous(breaks = c(1:2), 
         labels = c("Censored", "Metastasis"), expand = c(0, 0.5))

Do not forget to remove the outliers from your boxplot or they will superimpose with the points created by geom_point.

I wonder if these plots are more useful in certain field. If anybody has references for beeswarm plots, I would be very grateful.

R, JAGS and ggplot2

Last week a question was asked on the ggplot2 list about using ggplot2 and jags in R (). Here’s what was my answer (a bit updated):

Using as an example the school dataset from R2WinBUGS package:

library(rjags)
library(coda)
library(ggplot2)
library(R2WinBUGS)

data(schools)
schoolmodel <- function(){
  for (j in 1:J)
    {
      y[j] ~ dnorm (theta[j], tau.y[j])
      theta[j] ~ dnorm (mu.theta, tau.theta)
      tau.y[j] <- pow(sigma.y[j], -2)
    }
  mu.theta ~ dnorm (0.0, 1.0E-6)
  tau.theta <- pow(sigma.theta, -2)
  sigma.theta ~ dunif (0, 1000)
}
write.model(schoolmodel, "~/schoolmodel.bug")
J <- nrow(schools)
y <- schools$estimate
sigma.y <- schools$sd
forJags <- list(J = nrow(schools), y = schools$estimate, sigma.y = schools$sd)
inits <- function(){
  list(theta = rnorm(J, 0, 100), mu.theta = rnorm(1, 0, 100),
       sigma.theta = runif(1, 0, 100))
}

school.sim <- jags.model(file = "schoolmodel.bug", data = forJags,
                         inits = inits, n.chains = 3, n.adapt = 500)
school.out <- coda.samples(school.sim, variable.names =
                           c("theta", "mu.theta", "sigma.theta"),
                           n.iter = 1000)
summary(school.out)

Than you can use the mcmcplots package which give a “feel” of ggplot2:

library(mcmcplots)
denplot(school.out, parms = c('mu.theta', collapse = FALSE, auto.layout = TRUE))
traplot(school.out, parms = c('mu.theta'))
autplot1(school.out[,"mu.theta", drop = FALSE])

Density Plot - mcmcplots package

Trace Plot - mcmcplots package

Autoplot - mcmcplots package

If you really want to use ggplot2, you have to extract the information you need, like for example:

varnames(school.out)
str(school.out)
mu.theta.out <- cbind(school.out[[1]][,1], school.out[[2]][,1], school.out[[3]][,1])
attributes(mu.theta.out) <- NULL
dens.mu.theta <- density(mu.theta.out)
q25.mu.theta <- quantile(mu.theta.out, .025)
q975.mu.theta <- quantile(mu.theta.out, .975)
dd.mu.theta <- with(dens.mu.theta, data.frame(x,y))

qplot(x, y, data = dd.mu.theta, geom = "line", ylab = "", xlab = "") +
  geom_ribbon(data = subset(dd.mu.theta, x>q25.mu.theta & x<q975.mu.theta),
              aes(ymax = y), ymin = 0, fill = "red", colour = NA, alpha = 0.5)

This would give you the density plot with its 95% credible interval.

Density Plot - ggplot2 package