As noted on paragraph 18.4.1 of the book Veterinary Epidemiologic Research, logistic regression is widely used for binary data, with the estimates reported as odds ratios (OR). If it’s appropriate for case-control studies, risk ratios (RR) are preferred for cohort studies as RR provides estimates of probabilities directly. Moreover, it is often forgotten the assumption of rare event rate, and the OR will overestimate the true effect.

One approach to get RR is to fit a generalised linear model (GLM) with a binomial distribution and a log link. But these models may sometimes fail to converge (Zou, 2004). Another option is to use a Poisson regression with no exposure or offset specified (McNutt, 2003). It gives estimates with very little bias but confidence intervals that are too wide. However, using robust standard errors gives correct confidence intervals (Greenland, 2004, Zou, 2004).

We use data on culling of dairy cows to demonstrate this.

temp <- tempfile() download.file( "http://ic.upei.ca/ver/sites/ic.upei.ca.ver/files/ver2_data_R.zip", temp) load(unz(temp, "ver2_data_R/culling.rdata")) unlink(temp) table(culling$culled) 0 1 255 466 ### recoding number of lactation culling$lact <- with(culling, ifelse(lact_c3 > 1, 2, lact_c3))

The logistic regression:

log.reg <- glm(culled ~ lact, family = binomial("logit"), data = culling) summary(log.reg) Call: glm(formula = culled ~ lact, family = binomial("logit"), data = culling) Deviance Residuals: Min 1Q Median 3Q Max -1.546 -1.199 0.849 0.849 1.156 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.734 0.299 -2.45 0.014 * lact 0.784 0.171 4.59 4.4e-06 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 936.86 on 720 degrees of freedom Residual deviance: 915.84 on 719 degrees of freedom AIC: 919.8 Number of Fisher Scoring iterations: 4 cbind(exp(coef(log.reg)), exp(confint(log.reg))) Waiting for profiling to be done... 2.5 % 97.5 % (Intercept) 0.48 0.267 0.864 lact 2.19 1.568 3.065

We could compute by hand the OR and RR:

with(culling, table(culled, lact)) lact culled 1 2 0 97 158 1 102 364 ## OR is 2.19: (364 / 158) / (102 / 97) [1] 2.19 ## or the ratio between the cross-products (364 * 97) / (158 * 102) [1] 2.19 ## or with epicalc library(epicalc) with(culling, cc(culled, lact, graph = FALSE)) lact culled 1 2 Total 0 97 158 255 1 102 364 466 Total 199 522 721 OR = 2.19 Exact 95% CI = 1.54, 3.1 Chi-squared = 21.51, 1 d.f., P value = 0 Fisher's exact test (2-sided) P value = 0 # RR is 1.36: (364 / 522) / (102 / 199) [1] 1.36

The GLM with binomial distribution and log link:

log.reg2 <- glm(culled ~ lact, family = binomial(link = "log"), data = culling) summary(log.reg2) Call: glm(formula = culled ~ lact, family = binomial(link = "log"), data = culling) Deviance Residuals: Min 1Q Median 3Q Max -1.546 -1.199 0.849 0.849 1.156 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.9762 0.1412 -6.91 4.8e-12 *** lact 0.3078 0.0749 4.11 4.0e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 936.86 on 720 degrees of freedom Residual deviance: 915.84 on 719 degrees of freedom AIC: 919.8 Number of Fisher Scoring iterations: 5 cbind(exp(coef(log.reg2)), exp(confint(log.reg2))) Waiting for profiling to be done... 2.5 % 97.5 % (Intercept) 0.377 0.28 0.488 lact 1.360 1.18 1.588

The modified Poisson with robust estimator as described by Zou, 2004 (GEE with individual observations treated as a repeated measure):

## create id for each observation culling$id <- 1:length(culling$herd) library(geepack) zou.mod <- geeglm(culled ~ lact, family = poisson(link = "log"), id = id, corstr = "exchangeable", data = culling) summary(zou.mod) Call: geeglm(formula = culled ~ lact, family = poisson(link = "log"), data = culling, id = id, corstr = "exchangeable") Coefficients: Estimate Std.err Wald Pr(>|W|) (Intercept) -0.9762 0.1412 47.8 4.8e-12 *** lact 0.3078 0.0749 16.9 4.0e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Estimated Scale Parameters: Estimate Std.err (Intercept) 0.354 0.021 Correlation: Structure = exchangeable Link = identity Estimated Correlation Parameters: Estimate Std.err alpha 0 0 Number of clusters: 721 Maximum cluster size: 1 ## exponentiated coefficients exp(coef(zou.mod)) (Intercept) lact 0.377 1.360 ## getting confidence intervals library(doBy) zou.mod.coefci <- esticon(zou.mod, diag(2)) zou.mod.expci <- exp(cbind(zou.mod.coefci$Estimate, zou.mod.coefci$Lower, zou.mod.coefci$Upper)) rownames(zou.mod.expci) <- names(coef(zou.mod)) colnames(zou.mod.expci) <- c("RR", "Lower RR", "Upper RR") zou.mod.expci RR Lower RR Upper RR (Intercept) 0.377 0.286 0.497 lact 1.360 1.175 1.576

Or the same using glm() and then getting robust standard errors:

pois.reg <- glm(culled ~ lact, family = poisson(link = "log"), data = culling) library(sandwich) # to get robust estimators library(lmtest) # to test coefficients coeftest(pois.reg, vcov = sandwich) z test of coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.9762 0.1412 -6.91 4.8e-12 *** lact 0.3078 0.0749 4.11 4.0e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 ## RR exp(coef(pois.reg)) (Intercept) lact 0.377 1.360 ## CI 0.3078 + qnorm(0.05 / 2) * 0.0749 # upper 95% CI [1] 0.161 exp(0.3078 + qnorm(0.05 / 2) * 0.0749) [1] 1.17 0.3078 + qnorm(1 - 0.05 / 2) * 0.0749 # lower 95% CI [1] 0.455 exp(0.3078 + qnorm(1 - 0.05 / 2) * 0.0749) [1] 1.58

So no excuse to use odds ratios when risk ratios are more appropriate!

**Addition: Plotting the Risk Ratios**

zou.mod2 <- geeglm(culled ~ lact + johnes, family = poisson(link = "log"), id = id, corstr = "exchangeable", data = culling) zou.mod2.coefci <- esticon(zou.mod2, diag(3)) zou.mod2.expci <- exp(cbind(zou.mod2.coefci$Estimate, zou.mod2.coefci$Lower, zou.mod2.coefci$Upper)) rownames(zou.mod2.expci) <- names(coef(zou.mod2)) colnames(zou.mod2.expci) <- c("RR", "LowerCI", "UpperCI") zou.df <- as.data.frame(zou.mod2.expci) zou.df$var <- row.names(zou.df) library(ggplot2) ggplot(zou.df[2:3, ], aes(y = RR, x = reorder(var, RR))) + geom_point() + geom_pointrange(aes(ymin = LowerCI, ymax = UpperCI)) + scale_x_discrete(labels = c("Lactation", "Johnes")) + scale_y_log10(breaks = seq(1, 2, by = 0.1)) + xlab("") + geom_hline(yintercept = 1, linetype = "dotdash", colour = "blue") + coord_flip()

Thanks to Tal Galili for suggesting this addition.

“Moreover, it is often forgotten the assumption of rare event rate, and the OR will overestimate the true effect.”

I agree with almost everything that you wrote! But this sentence above is trick. OR will overestimate the RR without this assumption if, and only if, you want to interpretate OR as RR. But in fact OR estimate ODDS ratio, by definition it is other thing, the problem is not about the math!

Hi Lucas,

What was meant is that OR should not be used for common outcome, where RR is preferred. See for example Altman, 1998.

Cheers,

D