Bias in Observational Studies – Sensitivity Analysis with R package episensr

When it’s time to interpret the study results from your observational study, you have to estimate if the effect measure you obtained is the truth, if it’s due to bias (systematic error, the effect measure’s precision), or if it’s due to chance (random error, the effect measure’s validity) (Rothman and Greenland, 2008, pp115-134). Every study has some random error due to its limited sample size, and is susceptible to systematic errors as well, from selection bias to the presence of (un)known confounders or information bias (measurement error, including misclassification).

Bias analysis, or sensitivity analysis, tries to quantify the direction, magnitude, and uncertainty of the bias affecting an estimate of association (Greenland and Lash, 2008, pp345-380; Lash et al., 2009).

Very often bias is evaluated qualitatively without any quantitative assessment of its magnitude. This might be due to the very few tools available to epidemiologists. Hence this R package, episensr, which is built following the book by Lash et al. I will illustrate its use with 3 studies from this book.

First you have to install and load the package.

# or get it from Github repo with
# library(devtools)
# install_github("dhaine/episensr")


Selection Bias

We will use a case-control study by Stang et al. on the relation between mobile phone use and uveal melanoma. The observed odds ratio for the association between regular mobile phone use vs. no mobile phone use with uveal melanoma incidence is 0.71 [95% CI 0.51-0.97]. But there was a substantial difference in participation rates between cases and controls (94% vs 55%, respectively) and so selection bias could have an impact on the association estimate. The 2X2 table for this study is the following:

Regular Use No Use
Cases 136 107
Controls 297 165

We use the function selection as shown below.

selection(matrix(c(136, 107, 297, 165),
                 dimnames = list(c("UM+", "UM-"), c("Mobile+", "Mobile-")),
                 nrow = 2, byrow = TRUE),
          selprob = c(.94, .85, .64, .25))

Observed Data:
Outcome   : UM+
Comparing : Mobile+ vs. Mobile- 

    Mobile+ Mobile-
UM+     136     107
UM-     297     165

Data Corrected for Selected Proportions:

     Mobile+  Mobile-
UM+ 144.6809 125.8824
UM- 464.0625 660.0000

                               95% conf. interval
Observed Relative Risk: 0.7984    0.6518   0.9780
   Observed Odds Ratio: 0.7061    0.5144   0.9693

Selection Bias Corrected Relative Risk: 1.4838
   Selection Bias Corrected Odds Ratio: 1.6346

     Selection probability among cases exposed: 0.94
   Selection probability among cases unexposed: 0.85
  Selection probability among noncases exposed: 0.64
Selection probability among noncases unexposed: 0.25

The 2X2 table is provided as a matrix and selection probabilities given with the argument selprob, a vector with the 4 probabilities (guided by the participation rates in cases and controls) in the following order: among cases exposed, among cases unexposed, among noncases exposed, and among noncases unexposed. The output shows the observed 2X2 table, the same table corrected for the selection proportions, the observed odds ratio (and relative risk) followed by the corrected ones, and the input parameters.

Uncontrolled Confounders

We will use date from a cross-sectional study by Tyndall et al. on the association between male circumcision and the risk of acquiring HIV, which might be confounded by religion. The code to account for unmeasured or unknown confounders is the following, where the 2X2 table is given as a matrix. We choose a risk ratio implementation, provide a vector defining the prevalence of the confounder, religion, among the exposed and the unexposed, and give the risk ratio associating the confounder with the disease.

confounders(matrix(c(105, 85, 527, 93),
                   dimnames = list(c("HIV+", "HIV-"), c("Circ+", "Circ-")),
                   nrow = 2, byrow = TRUE),
            implement = "RR",
            p = c(.8, .05),
   = .63)

Observed Data:
Outcome   : HIV+
Comparing : Circ+ vs. Circ- 

     Circ+ Circ-
HIV+   105    85
HIV-   527    93

Data, Counfounder +:

        Circ+ Circ-
HIV+  75.1705 2.728
HIV- 430.4295 6.172

Data, Counfounder -:

       Circ+  Circ-
HIV+ 29.8295 82.272
HIV- 96.5705 86.828

Crude and Unmeasured Confounder Specific Measures of Exposure-Outcome Relationship:

                                    95% conf. interval
        Crude Relative Risk: 0.3479    0.2757    0.439
Relative Risk, Confounder +: 0.4851
Relative Risk, Confounder -: 0.4851 

Exposure-Outcome Relationship Adjusted for Confounder:

Standardized Morbidity Ratio     SMRrr: 0.4851    RR adjusted using SMR estimate: 0.7173
Mantel-Haenszel                   MHrr: 0.4851     RR adjusted using MH estimate: 0.7173 

Bias Parameters:

p(Confounder+|Exposure+): 0.8
p(Confounder+|Exposure-): 0.05
  RR(Confounder-Outcome): 0.63

The output gives the crude 2X2 table, the 2X2 tables by levels of the confounder, the crude relative risk and confounder specific measures of association between exposure and outcome, and the relationship adjusted for the unknown confounder, using a standardized morbidity ratio (SMR) or a Mantel-Haenszel (MH) estimate of the risk ratio. Finally, the bias parameters are shown.

Probabilistic Sensitivity Analysis for Exposure Misclassification

We use a study on the effect of smoking during pregnancy on breast cancer risk (Fink and Lash), where we assume nondifferential misclassification of the exposure, smoking, with probability density functions for sensitivities (Se) and specificities (Sp) among cases and noncases equal to uniform distributions with a minimum of 0.7 and a maximum of 0.95 for sensitivities (0.9 and 0.99 respectively for specificities). As usual, the 2X2 table is provided as a matrix. We choose to correct for exposure misclassification with the argument type = exposure. We ask for 10000 replications (default is 1000). The Se and Sp for cases (seca, spca) are given as a list with its first element referring to the type of distribution (choice between uniform, triangular and trapezoidal) and the second element giving the distribution parameters (min and max for uniform distribution). By avoiding to provide information on the noncases (seexp, spexp), we are referring to a nondifferential misclassification.

smoke.nd <- probsens(matrix(c(215, 1449, 668, 4296),
                            dimnames = list(c("BC+", "BC-"), c("Smoke+", "Smoke-")),
                            nrow = 2, byrow = TRUE),
                     type = "exposure",
                     reps = 10000,
                     seca.parms = list("uniform", c(.7, .95)),
                     spca.parms = list("uniform", c(.9, .99)))

Observed Data:
Outcome   : BC+
Comparing : Smoke+ vs. Smoke- 

    Smoke+ Smoke-
BC+    215   1449
BC-    668   4296

Observed Measures of Exposure-Outcome Relationship:

                                95% conf. interval
 Observed Relative Risk: 0.9654    0.8524   1.0934
    Observed Odds Ratio: 0.9542    0.8092   1.1252

                                              Median 2.5th percentile
           Relative Risk -- systematic error: 0.9432           0.8816
              Odds Ratio -- systematic error: 0.9254           0.8477
Relative Risk -- systematic and random error: 0.9372           0.8181
   Odds Ratio -- systematic and random error: 0.9178           0.7671
                                              97.5th percentile
           Relative Risk -- systematic error:            0.9612
              Odds Ratio -- systematic error:            0.9488
Relative Risk -- systematic and random error:            1.0662
   Odds Ratio -- systematic and random error:            1.0884

Bias Parameters:

   Se|Cases: uniform ( 0.7 0.95 )
   Sp|Cases: uniform ( 0.9 0.99 )
Se|No-cases: ( )
Sp|No-cases: ( )

The output gives the 2X2 table, the observed measures of association, the corrected measures of association, and the input bias parameters.

We saved the probsens analysis in a new variable smoke.nd. We can see the element of the object probsens with the function str():


List of 4
 $    : num [1:2, 1:2] 215 668 1449 4296
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:2] "BC+" "BC-"
  .. ..$ : chr [1:2] "Smoke+" "Smoke-"
 $ obs.measures: num [1:2, 1:3] 0.965 0.954 0.852 0.809 1.093 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:2] " Observed Relative Risk:" "    Observed Odds Ratio:"
  .. ..$ : chr [1:3] "     " "95% conf." "interval"
 $ adj.measures: num [1:4, 1:3] 0.943 0.925 0.937 0.918 0.882 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:4] "           Relative Risk -- systematic error:" "              Odds Ratio -- systematic error:" "Relative Risk -- systematic and random error:" "   Odds Ratio -- systematic and random error:"
  .. ..$ : chr [1:3] "Median" "2.5th percentile" "97.5th percentile"
 $ sim.df      :'data.frame':	10000 obs. of  12 variables:
  ..$ seca   : num [1:10000] 0.878 0.75 0.928 0.796 0.841 ...
  ..$ seexp  : num [1:10000] 0.878 0.75 0.928 0.796 0.841 ...
  ..$ spca   : num [1:10000] 0.958 0.948 0.971 0.906 0.969 ...
  ..$ spexp  : num [1:10000] 0.958 0.948 0.971 0.906 0.969 ...
  ..$ A1     : num [1:10000] 173.5 183.1 185.5 82.6 201.6 ...
  ..$ B1     : num [1:10000] 1490 1481 1479 1581 1462 ...
  ..$ C1     : num [1:10000] 550 584 583 284 634 ...
  ..$ D1     : num [1:10000] 4414 4380 4381 4680 4330 ...
  ..$ corr.RR: num [1:10000] 0.951 0.944 0.957 0.891 0.955 ...
  ..$ corr.OR: num [1:10000] 0.935 0.927 0.943 0.86 0.941 ...
  ..$ tot.RR : num [1:10000] 0.917 0.855 0.895 0.882 0.883 ...
  ..$ tot.OR : num [1:10000] 0.892 0.813 0.863 0.848 0.848 ...

smoke.nd is a list of 4 elements where different information on the analysis done are saved. We have smoke.nd$ where we have the observed 2X2 table, smoke.nd$obs.measures (the observed measures of association), smoke.nd$adj.measures (the adjusted measures of association), and smoke.nd$sim.df, a data frame with the simulated variables from each replication, like the Se and Sp, the 4 cells of the adjusted 2X2 table, and the adjusted measures. We can plot the Se prior distribution (not forgetting to discard the draws that led to negative adjustments).

hist(smoke.nd$sim.df[!$sim.df$corr.RR), ]$seca,
     breaks = seq(0.65, 1, 0.01),
     col = "lightgreen",
     main = NULL,
     xlab = "Sensitivity for Cases")
Histogram of 10000 draws from uniform distribution for Se among cases
Histogram of 10000 draws from uniform distribution for Se among cases


Additional functions can be found on the reference manual and more will be added in the coming months.

2 thoughts on “Bias in Observational Studies – Sensitivity Analysis with R package episensr”

  1. Denis,
    nice that you made the effort for this package!

    Regarding the blog post:
    It would be even easier to follow (for me, anyways) if you would make explicit where you get all your probability estimates from, or if you simply make explicit which can be calculated from the data of the specific study, and which are educated guesses (I was and am still struggling with the first example in this regard).

    Regarding your package:
    I understand your choice of distributions for the Probabilistic Sensitivity Analysis is based on Lash et al. However, I still think the most plausible distribution would be a beta distribution. is there a specific reason you didn’t implement it?
    Minor: in your “confounders” example it reads “Data, Counfounder”, but should be “Data, Confounder” it really doesn’t change anything, it just interrupts the reading flow ;-)

    A general question:
    Are you aware of approaches to bias analysis when the outcome is a continuous or count variable (i.e. not dichotomous)

    1. Hi Guido,
      Thanks for your comments!
      The example used is unusual as the authors of the study had data on exposure prevalence among nonparticipants to the study. More often you only have info on the number of nonparticipating cases and controls, and thus you have to guess what would be the selection proportions. This is better than nothing and you can play with (and discuss) your guesses. So in this example all the selection probabilities were worked out based on the available data, and not guessed. I refer you to the book of Lash et al. for a detailed calculation.

      As a first version for this package, I restrained the choice of probability distribution to only those 3, as it was the choices already available in the SAS macro developed by Lash et al. and in the episens Stata module. But the package is a work in progress and more functionalities will be added, including additional distributions. I was planning an update before the summer but it looks it will be more in July/August. To keep up-to-date, follow me on Twitter!
      Regarding bias analysis and continuous/count variables, I’d go bayesian. See for example this book by Gustafson.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s