When it’s time to interpret the study results from your observational study, you have to estimate if the effect measure you obtained is the truth, if it’s due to bias (systematic error, the effect measure’s precision), or if it’s due to chance (random error, the effect measure’s validity) (Rothman and Greenland, 2008, pp115-134). Every study has some random error due to its limited sample size, and is susceptible to systematic errors as well, from selection bias to the presence of (un)known confounders or information bias (measurement error, including misclassification).

Bias analysis, or sensitivity analysis, tries to quantify the direction, magnitude, and uncertainty of the bias affecting an estimate of association (Greenland and Lash, 2008, pp345-380; Lash et al., 2009).

Very often bias is evaluated qualitatively without any quantitative assessment of its magnitude. This might be due to the very few tools available to epidemiologists. Hence this R package, episensr, which is built following the book by Lash et al. I will illustrate its use with 3 studies from this book.

First you have to install and load the package.

install.packages("episensr") # or get it from Github repo with # library(devtools) # install_github("dhaine/episensr") library(episensr)

**Selection Bias**

We will use a case-control study by Stang et al. on the relation between mobile phone use and uveal melanoma. The observed odds ratio for the association between regular mobile phone use vs. no mobile phone use with uveal melanoma incidence is 0.71 [95% CI 0.51-0.97]. But there was a substantial difference in participation rates between cases and controls (94% vs 55%, respectively) and so selection bias could have an impact on the association estimate. The 2X2 table for this study is the following:

Regular Use | No Use | |

Cases | 136 | 107 |

Controls | 297 | 165 |

We use the function *selection* as shown below.

selection(matrix(c(136, 107, 297, 165), dimnames = list(c("UM+", "UM-"), c("Mobile+", "Mobile-")), nrow = 2, byrow = TRUE), selprob = c(.94, .85, .64, .25)) Observed Data: --------------------------------------------------- Outcome : UM+ Comparing : Mobile+ vs. Mobile- Mobile+ Mobile- UM+ 136 107 UM- 297 165 Data Corrected for Selected Proportions: --------------------------------------------------- Mobile+ Mobile- UM+ 144.6809 125.8824 UM- 464.0625 660.0000 95% conf. interval Observed Relative Risk: 0.7984 0.6518 0.9780 Observed Odds Ratio: 0.7061 0.5144 0.9693 [,1] Selection Bias Corrected Relative Risk: 1.4838 Selection Bias Corrected Odds Ratio: 1.6346 [,1] Selection probability among cases exposed: 0.94 Selection probability among cases unexposed: 0.85 Selection probability among noncases exposed: 0.64 Selection probability among noncases unexposed: 0.25

The 2X2 table is provided as a matrix and selection probabilities given with the argument *selprob*, a vector with the 4 probabilities (guided by the participation rates in cases and controls) in the following order: among cases exposed, among cases unexposed, among noncases exposed, and among noncases unexposed. The output shows the observed 2X2 table, the same table corrected for the selection proportions, the observed odds ratio (and relative risk) followed by the corrected ones, and the input parameters.

### Uncontrolled Confounders

We will use date from a cross-sectional study by Tyndall et al. on the association between male circumcision and the risk of acquiring HIV, which might be confounded by religion. The code to account for unmeasured or unknown confounders is the following, where the 2X2 table is given as a matrix. We choose a risk ratio implementation, provide a vector defining the prevalence of the confounder, religion, among the exposed and the unexposed, and give the risk ratio associating the confounder with the disease.

confounders(matrix(c(105, 85, 527, 93), dimnames = list(c("HIV+", "HIV-"), c("Circ+", "Circ-")), nrow = 2, byrow = TRUE), implement = "RR", p = c(.8, .05), RR.cd = .63) Observed Data: -------------- Outcome : HIV+ Comparing : Circ+ vs. Circ- Circ+ Circ- HIV+ 105 85 HIV- 527 93 Data, Counfounder +: -------------------- Circ+ Circ- HIV+ 75.1705 2.728 HIV- 430.4295 6.172 Data, Counfounder -: -------------------- Circ+ Circ- HIV+ 29.8295 82.272 HIV- 96.5705 86.828 Crude and Unmeasured Confounder Specific Measures of Exposure-Outcome Relationship: ----------------------------------------------------------------------------------- 95% conf. interval Crude Relative Risk: 0.3479 0.2757 0.439 Relative Risk, Confounder +: 0.4851 Relative Risk, Confounder -: 0.4851 Exposure-Outcome Relationship Adjusted for Confounder: ------------------------------------------------------ Standardized Morbidity Ratio SMRrr: 0.4851 RR adjusted using SMR estimate: 0.7173 Mantel-Haenszel MHrr: 0.4851 RR adjusted using MH estimate: 0.7173 Bias Parameters: ---------------- p(Confounder+|Exposure+): 0.8 p(Confounder+|Exposure-): 0.05 RR(Confounder-Outcome): 0.63

The output gives the crude 2X2 table, the 2X2 tables by levels of the confounder, the crude relative risk and confounder specific measures of association between exposure and outcome, and the relationship adjusted for the unknown confounder, using a standardized morbidity ratio (SMR) or a Mantel-Haenszel (MH) estimate of the risk ratio. Finally, the bias parameters are shown.

### Probabilistic Sensitivity Analysis for Exposure Misclassification

We use a study on the effect of smoking during pregnancy on breast cancer risk (Fink and Lash), where we assume nondifferential misclassification of the exposure, smoking, with probability density functions for sensitivities (Se) and specificities (Sp) among cases and noncases equal to uniform distributions with a minimum of 0.7 and a maximum of 0.95 for sensitivities (0.9 and 0.99 respectively for specificities). As usual, the 2X2 table is provided as a matrix. We choose to correct for exposure misclassification with the argument *type = exposure*. We ask for 10000 replications (default is 1000). The Se and Sp for cases (*seca*, *spca*) are given as a list with its first element referring to the type of distribution (choice between uniform, triangular and trapezoidal) and the second element giving the distribution parameters (min and max for uniform distribution). By avoiding to provide information on the noncases (*seexp*, *spexp*), we are referring to a nondifferential misclassification.

smoke.nd <- probsens(matrix(c(215, 1449, 668, 4296), dimnames = list(c("BC+", "BC-"), c("Smoke+", "Smoke-")), nrow = 2, byrow = TRUE), type = "exposure", reps = 10000, seca.parms = list("uniform", c(.7, .95)), spca.parms = list("uniform", c(.9, .99))) Observed Data: -------------- Outcome : BC+ Comparing : Smoke+ vs. Smoke- Smoke+ Smoke- BC+ 215 1449 BC- 668 4296 Observed Measures of Exposure-Outcome Relationship: ----------------------------------------------------------------------------------- 95% conf. interval Observed Relative Risk: 0.9654 0.8524 1.0934 Observed Odds Ratio: 0.9542 0.8092 1.1252 Median 2.5th percentile Relative Risk -- systematic error: 0.9432 0.8816 Odds Ratio -- systematic error: 0.9254 0.8477 Relative Risk -- systematic and random error: 0.9372 0.8181 Odds Ratio -- systematic and random error: 0.9178 0.7671 97.5th percentile Relative Risk -- systematic error: 0.9612 Odds Ratio -- systematic error: 0.9488 Relative Risk -- systematic and random error: 1.0662 Odds Ratio -- systematic and random error: 1.0884 Bias Parameters: ---------------- Se|Cases: uniform ( 0.7 0.95 ) Sp|Cases: uniform ( 0.9 0.99 ) Se|No-cases: ( ) Sp|No-cases: ( )

The output gives the 2X2 table, the observed measures of association, the corrected measures of association, and the input bias parameters.

We saved the *probsens* analysis in a new variable *smoke.nd*. We can see the element of the object *probsens* with the function *str()*:

str(smoke.nd) List of 4 $ obs.data : num [1:2, 1:2] 215 668 1449 4296 ..- attr(*, "dimnames")=List of 2 .. ..$ : chr [1:2] "BC+" "BC-" .. ..$ : chr [1:2] "Smoke+" "Smoke-" $ obs.measures: num [1:2, 1:3] 0.965 0.954 0.852 0.809 1.093 ... ..- attr(*, "dimnames")=List of 2 .. ..$ : chr [1:2] " Observed Relative Risk:" " Observed Odds Ratio:" .. ..$ : chr [1:3] " " "95% conf." "interval" $ adj.measures: num [1:4, 1:3] 0.943 0.925 0.937 0.918 0.882 ... ..- attr(*, "dimnames")=List of 2 .. ..$ : chr [1:4] " Relative Risk -- systematic error:" " Odds Ratio -- systematic error:" "Relative Risk -- systematic and random error:" " Odds Ratio -- systematic and random error:" .. ..$ : chr [1:3] "Median" "2.5th percentile" "97.5th percentile" $ sim.df :'data.frame': 10000 obs. of 12 variables: ..$ seca : num [1:10000] 0.878 0.75 0.928 0.796 0.841 ... ..$ seexp : num [1:10000] 0.878 0.75 0.928 0.796 0.841 ... ..$ spca : num [1:10000] 0.958 0.948 0.971 0.906 0.969 ... ..$ spexp : num [1:10000] 0.958 0.948 0.971 0.906 0.969 ... ..$ A1 : num [1:10000] 173.5 183.1 185.5 82.6 201.6 ... ..$ B1 : num [1:10000] 1490 1481 1479 1581 1462 ... ..$ C1 : num [1:10000] 550 584 583 284 634 ... ..$ D1 : num [1:10000] 4414 4380 4381 4680 4330 ... ..$ corr.RR: num [1:10000] 0.951 0.944 0.957 0.891 0.955 ... ..$ corr.OR: num [1:10000] 0.935 0.927 0.943 0.86 0.941 ... ..$ tot.RR : num [1:10000] 0.917 0.855 0.895 0.882 0.883 ... ..$ tot.OR : num [1:10000] 0.892 0.813 0.863 0.848 0.848 ...

*smoke.nd* is a list of 4 elements where different information on the analysis done are saved. We have *smoke.nd$obs.data* where we have the observed 2X2 table, *smoke.nd$obs.measures* (the observed measures of association), *smoke.nd$adj.measures* (the adjusted measures of association), and *smoke.nd$sim.df*, a data frame with the simulated variables from each replication, like the Se and Sp, the 4 cells of the adjusted 2X2 table, and the adjusted measures. We can plot the Se prior distribution (not forgetting to discard the draws that led to negative adjustments).

hist(smoke.nd$sim.df[!is.na(smoke.nd$sim.df$corr.RR), ]$seca, breaks = seq(0.65, 1, 0.01), col = "lightgreen", main = NULL, xlab = "Sensitivity for Cases")

Additional functions can be found on the reference manual and more will be added in the coming months.

Denis,

nice that you made the effort for this package!

Regarding the blog post:

It would be even easier to follow (for me, anyways) if you would make explicit where you get all your probability estimates from, or if you simply make explicit which can be calculated from the data of the specific study, and which are educated guesses (I was and am still struggling with the first example in this regard).

Regarding your package:

I understand your choice of distributions for the Probabilistic Sensitivity Analysis is based on Lash et al. However, I still think the most plausible distribution would be a beta distribution. is there a specific reason you didn’t implement it?

Minor: in your “confounders” example it reads “Data, Counfounder”, but should be “Data, Confounder” it really doesn’t change anything, it just interrupts the reading flow ;-)

A general question:

Are you aware of approaches to bias analysis when the outcome is a continuous or count variable (i.e. not dichotomous)

Hi Guido,

Thanks for your comments!

The example used is unusual as the authors of the study had data on exposure prevalence among nonparticipants to the study. More often you only have info on the number of nonparticipating cases and controls, and thus you have to guess what would be the selection proportions. This is better than nothing and you can play with (and discuss) your guesses. So in this example all the selection probabilities were worked out based on the available data, and not guessed. I refer you to the book of Lash et al. for a detailed calculation.

As a first version for this package, I restrained the choice of probability distribution to only those 3, as it was the choices already available in the SAS macro developed by Lash et al. and in the episens Stata module. But the package is a work in progress and more functionalities will be added, including additional distributions. I was planning an update before the summer but it looks it will be more in July/August. To keep up-to-date, follow me on Twitter!

Regarding bias analysis and continuous/count variables, I’d go bayesian. See for example this book by Gustafson.

Cheers!