# Veterinary Epidemiologic Research: GLM (part 4) – Exact and Conditional Logistic Regressions

Next topic on logistic regression: the exact and the conditional logistic regressions.

Exact logistic regression
When the dataset is very small or severely unbalanced, maximum likelihood estimates of coefficients may be biased. An alternative is to use exact logistic regression, available in R with the elrm package. Its syntax is based on an events/trials formulation. The dataset has to be collapsed into a data frame with unique combinations of predictors.
Another possibility is to use robust standard errors, and get comparable p-values to those obtained with exact logistic regression.

### exact logistic regression
x <- xtabs(~ casecont + interaction(dneo, dclox), data = nocardia)
x
interaction(dneo, dclox)
casecont 0.0 1.0 0.1 1.1
0  20  15   9  10
1   2  44   3   5
> nocardia.coll <- data.frame(dneo = rep(1:0, 2), dclox = rep(1:0, each = 2),
+                    casecont = x[1, ], ntrials = colSums(x))
nocardia.coll
dneo dclox casecont ntrials
0.0    1     1       20      22
1.0    0     1       15      59
0.1    1     0        9      12
1.1    0     0       10      15

library(elrm)
Le chargement a nécessité le package : coda
Le chargement a nécessité le package : lattice
mod5 <- elrm(formula = casecont/ntrials ~ dneo,
interest = ~dneo,
iter = 100000, dataset = nocardia.coll, burnIn = 2000)

### robust SE
library(robust)
Le chargement a nécessité le package : fit.models
Le chargement a nécessité le package : MASS
Le chargement a nécessité le package : robustbase
Le chargement a nécessité le package : rrcov
Le chargement a nécessité le package : pcaPP
Le chargement a nécessité le package : mvtnorm
Scalable Robust Estimators with High Breakdown Point (version 1.3-02)

mod6 <- glmrob(casecont ~ dcpct + dneo + dclox + dneo*dclox,
+                family = binomial, data = nocardia, method= "Mqle",
+                control= glmrobMqle.control(tcc=1.2))
> summary(mod6)

Call:  glmrob(formula = casecont ~ dcpct + dneo + dclox + dneo * dclox,      family = binomial, data = nocardia, method = "Mqle", control = glmrobMqle.control(tcc = 1.2))

Coefficients:
Estimate Std. Error z-value Pr(>|z|)
(Intercept) -4.440253   1.239138  -3.583 0.000339 ***
dcpct        0.025947   0.008504   3.051 0.002279 **
dneo         3.604941   1.034714   3.484 0.000494 ***
dclox        0.713411   1.193426   0.598 0.549984
dneo:dclox  -2.935345   1.367212  -2.147 0.031797 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Robustness weights w.r * w.x:
89 weights are ~= 1. The remaining 19 ones are summarized as
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
0.1484  0.4979  0.6813  0.6558  0.8764  0.9525

Number of observations: 108
Fitted by method ‘Mqle’  (in 5 iterations)

(Dispersion parameter for binomial family taken to be 1)

No deviance values available
Algorithmic parameters:
acc    tcc
0.0001 1.2000
maxit
50
test.acc
"coef"


Conditional logistic regression

Matched case-control studies analyzed with unconditional logistic regression model produce estimates of the odds ratios that are the square of their true value. But we can use conditional logistic regression to analyze matched case-control studies and get correct estimates. Instead of estimating a parameter for each matched set, a conditional model conditions the fixed effects out of the estimation. It can be run in R with clogit from the survival package:

temp <- tempfile()
+   "http://ic.upei.ca/ver/sites/ic.upei.ca.ver/files/ver2_data_R.zip", temp)
essai de l'URL 'http://ic.upei.ca/ver/sites/ic.upei.ca.ver/files/ver2_data_R.zip'
Content type 'application/zip' length 1107873 bytes (1.1 Mb)
URL ouverte
==================================================

### Salmonella outbreak dataset

library(survival)
mod7 <- clogit(casecontrol ~ slt_a + strata(match_grp), data = sal_outbrk)
summary(mod7)

Call:
coxph(formula = Surv(rep(1, 112L), casecontrol) ~ slt_a + strata(match_grp),
data = sal_outbrk, method = "exact")

n= 112, number of events= 39

coef exp(coef) se(coef)     z Pr(>|z|)
slt_a 1.4852    4.4159   0.5181 2.867  0.00415 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

exp(coef) exp(-coef) lower .95 upper .95
slt_a     4.416     0.2265       1.6     12.19

Rsquare= 0.085   (max possible= 0.518 )
Likelihood ratio test= 10  on 1 df,   p=0.001568
Wald test            = 8.22  on 1 df,   p=0.004148
Score (logrank) test = 9.48  on 1 df,   p=0.002075


# Veterinary Epidemiologic Research: GLM – Evaluating Logistic Regression Models (part 3)

Third part on logistic regression (first here, second here).
Two steps in assessing the fit of the model: first is to determine if the model fits using summary measures of goodness of fit or by assessing the predictive ability of the model; second is to deterime if there’s any observations that do not fit the model or that have an influence on the model.

Covariate pattern
A covariate pattern is a unique combination of values of predictor variables.

mod3 <- glm(casecont ~ dcpct + dneo + dclox + dneo*dclox,
+             family = binomial("logit"), data = nocardia)
summary(mod3)

Call:
glm(formula = casecont ~ dcpct + dneo + dclox + dneo * dclox,
family = binomial("logit"), data = nocardia)

Deviance Residuals:
Min       1Q   Median       3Q      Max
-1.9191  -0.7682   0.1874   0.5876   2.6755

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)      -3.776896   0.993251  -3.803 0.000143 ***
dcpct             0.022618   0.007723   2.928 0.003406 **
dneoYes           3.184002   0.837199   3.803 0.000143 ***
dcloxYes          0.445705   1.026026   0.434 0.663999
dneoYes:dcloxYes -2.551997   1.205075  -2.118 0.034200 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 149.72  on 107  degrees of freedom
Residual deviance: 103.42  on 103  degrees of freedom
AIC: 113.42

Number of Fisher Scoring iterations: 5

library(epiR)
Package epiR 0.9-45 is loaded
Type help(epi.about) for summary information

mod3.mf <- model.frame(mod3)
(mod3.cp <- epi.cp(mod3.mf[-1]))
$cov.pattern id n dcpct dneo dclox 1 1 7 0 No No 2 2 38 100 Yes No 3 3 1 25 No No 4 4 1 1 No No 5 5 11 100 No Yes 8 6 1 25 Yes Yes 10 7 1 14 Yes No 12 8 4 75 Yes No 13 9 1 90 Yes Yes 14 10 1 30 No No 15 11 3 5 Yes No 17 12 9 100 Yes Yes 22 13 2 20 Yes No 23 14 8 100 No No 25 15 2 50 Yes Yes 26 16 1 7 No No 27 17 4 50 Yes No 28 18 1 50 No No 31 19 1 30 Yes No 34 20 1 99 No No 35 21 1 99 Yes Yes 40 22 1 80 Yes Yes 48 23 1 3 Yes No 59 24 1 1 Yes No 77 25 1 10 No No 84 26 1 83 No Yes 85 27 1 95 Yes No 88 28 1 99 Yes No 89 29 1 25 Yes No 105 30 1 40 Yes No$id
[1]  1  2  3  4  5  1  1  6  5  7  5  8  9 10 11 11 12  1 12  1  5 13 14  2 15
[26] 16 17 18  1  2 19  2 14 20 21 12 14  5  8 22 14  5  5  5  1 14 14 23  2 12
[51] 14 12 11  5 15  2  8  2 24  2  2  8  2 17  2  2  2  2 12 12 12  2  2  2  5
[76]  2 25  2 17  2  2  2  2 26 27 13 17 28 29 14  2 12  2  2  2  2  2  2  2  2
[101]  2  2  2  2 30  2  2  5


There are 30 covariate patterns in the dataset. The pattern dcpct=100, dneo=Yes, dclox=No appears 38 times.

Pearson and deviance residuals
Residuals represent the difference between the data and the model. The Pearson residuals are comparable to standardized residuals used for linear regression models. Deviance residuals represent the contribution of each observation to the overall deviance.

residuals(mod3) # deviance residuals
residuals(mod3, "pearson") # pearson residuals


Goodness-of-fit test
All goodness-of-fit tests are based on the premise that the data will be divided into subsets and within each subset the predicted number of outcomes will be computed and compared to the observed number of outcomes. The Pearson $\chi^2$ and the deviance $\chi^2$ are based on dividing the data up into the natural covariate patterns. The Hosmer-Lemeshow test is based on a more arbitrary division of the data.

The Pearson $\chi^2$ is similar to the residual sum of squares used in linear models. It will be close in size to the deviance, but the model is fit to minimize the deviance and not the Pearson $\chi^2$. It is thus possible even if unlikely that the $\chi^2$ could increase as a predictor is added to the model.

sum(residuals(mod3, type = "pearson")^2)
[1] 123.9656
deviance(mod3)
[1] 103.4168
1 - pchisq(deviance(mod3), df.residual(mod3))
[1] 0.4699251


The p-value is large indicating no evidence of lack of fit. However, when using the deviance statistic to assess the goodness-of-fit for a nonsaturated logistic model, the $\chi^2$ approximation for the likelihood ratio test is questionable. When the covariate pattern is almost as large as N, the deviance cannot be assumed to have a $\chi^2$ distribution.
Now the Hosmer-Lemeshow test, usually dividing by 10 the data:

hosmerlem <- function (y, yhat, g = 10) {
+   cutyhat <- cut(yhat, breaks = quantile(yhat, probs = seq(0, 1, 1/g)),
+                  include.lowest = TRUE)
+   obs <- xtabs(cbind(1 - y, y) ~ cutyhat)
+   expect <- xtabs(cbind(1 - yhat, yhat) ~ cutyhat)
+   chisq <- sum((obs - expect)^2 / expect)
+   P <- 1 - pchisq(chisq, g - 2)
+   c("X^2" = chisq, Df = g - 2, "P(>Chi)" = P)
+ }
hosmerlem(y = nocardia$casecont, yhat = fitted(mod3)) Erreur dans cut.default(yhat, breaks = quantile(yhat, probs = seq(0, 1, 1/g)), (depuis #2) : 'breaks' are not unique  The model used has many ties in its predicted probabilities (too few covariate values?) resulting in an error when running the Hosmer-Lemeshow test. Using fewer cut-points (g = 5 or 7) does not solve the problem. This is a typical example when not to use this test. A better goodness-of-fit test than Hosmer-Lemeshow and Pearson / deviance $\chi^2$ tests is the le Cessie – van Houwelingen – Copas – Hosmer unweighted sum of squares test for global goodness of fit (also here) implemented in the rms package (but you have to implement your model with the lrm function of this package): mod3b <- lrm(casecont ~ dcpct + dneo + dclox + dneo*dclox, nocardia, + method = "lrm.fit", model = TRUE, x = TRUE, y = TRUE, + linear.predictors = TRUE, se.fit = FALSE) residuals(mod3b, type = "gof") Sum of squared errors Expected value|H0 SD 16.4288056 16.8235055 0.2775694 Z P -1.4219860 0.1550303  The p-value is 0.16 so there’s no evidence the model is incorrect. Even better than these tests would be to check for linearity of the predictors. Overdispersion Sometimes we can get a deviance that is much larger than expected if the model was correct. It can be due to the presence of outliers, sparse data or clustering of data. The approach to deal with overdispersion is to add a dispersion parameter $\sigma^2$. It can be estimated with: $\hat{\sigma}^2 = \frac{\chi^2}{n - p}$ (p = probability of success). A half-normal plot of the residuals can help checking for outliers: library(faraway) halfnorm(residuals(mod1))  The dispesion parameter of model 1 can be found as: (sigma2 <- sum(residuals(mod1, type = "pearson")^2) / 104) [1] 1.128778 drop1(mod1, scale = sigma2, test = "F") Single term deletions Model: casecont ~ dcpct + dneo + dclox scale: 1.128778 Df Deviance AIC F value Pr(>F) <none> 107.99 115.99 dcpct 1 119.34 124.05 10.9350 0.001296 ** dneo 1 125.86 129.82 17.2166 6.834e-05 *** dclox 1 114.73 119.96 6.4931 0.012291 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Message d'avis : In drop1.glm(mod1, scale = sigma2, test = "F") : le test F implique une famille 'quasibinomial'  The dispersion parameter is not very different than one (no dispersion). If dispersion was present, you could use it in the F-tests for the predictors, adding scale to drop1. Predictive ability of the model A ROC curve can be drawn: predicted <- predict(mod3) library(ROCR) prob <- prediction(predicted, nocardia$casecont,
+                    label.ordering = c('Control', 'Case'))
tprfpr <- performance(prob, "tpr", "fpr")
tpr <- unlist(slot(tprfpr, "y.values"))
fpr <- unlist(slot(tprfpr, "x.values"))
roc <- data.frame(tpr, fpr)
ggplot(roc) + geom_line(aes(x = fpr, y = tpr)) +
+   geom_abline(intercept = 0, slope = 1, colour = "gray") +
+   ylab("Sensitivity") +
+   xlab("1 - Specificity")


Identifying important observations
Like for linear regression, large positive or negative standardized residuals allow to identify points which are not well fit by the model. A plot of Pearson residuals as a function of the logit for model 1 is drawn here, with bubbles relative to size of the covariate pattern. The plot should be an horizontal band with observations between -3 and +3. Covariate patterns 25 and 26 are problematic.

nocardia$casecont.num <- as.numeric(nocardia$casecont) - 1
mod1 <- glm(casecont.num ~ dcpct + dneo + dclox, family = binomial("logit"),
+             data = nocardia) # "logit" can be omitted as it is the default
mod1.mf <- model.frame(mod1)
mod1.cp <- epi.cp(mod1.mf[-1])
nocardia.cp <- as.data.frame(cbind(cpid = mod1.cp$id, + nocardia[ , c(1, 9:11, 13)], + fit = fitted(mod1))) ### Residuals and delta betas based on covariate pattern: mod1.obs <- as.vector(by(as.numeric(nocardia.cp$casecont.num),
+                          as.factor(nocardia.cp$cpid), FUN = sum)) mod1.fit <- as.vector(by(nocardia.cp$fit, as.factor(nocardia.cp$cpid), + FUN = min)) mod1.res <- epi.cpresids(obs = mod1.obs, fit = mod1.fit, + covpattern = mod1.cp) mod1.lodds <- as.vector(by(predict(mod1), as.factor(nocardia.cp$cpid),
+                            FUN = min))

plot(mod1.lodds, mod1.res$spearson, + type = "n", ylab = "Pearson Residuals", xlab = "Logit") text(mod1.lodds, mod1.res$spearson, labels = mod1.res$cpid, cex = 0.8) symbols(mod1.lodds, mod1.res$pearson, circles = mod1.res$n, add = TRUE)  The hat matrix is used to calculate leverage values and other diagnostic parameters. Leverage measures the potential impact of an observation. Points with high leverage have a potential impact. Covariate patterns 2, 14, 12 and 5 have the largest leverage values. mod1.res[sort.list(mod1.res$leverage, decreasing = TRUE), ]
cpid  leverage
2   0.74708052
14  0.54693851
12  0.54017700
5   0.42682684
11  0.21749664
1   0.19129427
...


Delta-betas provides an overall estimate of the effect of the $j^{th}$ covariate pattern on the regression coefficients. It is analogous to Cook’s distance in linear regression. Covariate pattern 2 has the largest delta-beta (and represents 38 observations).

mod1.res[sort.list(mod1.res$sdeltabeta, decreasing = TRUE), ] cpid sdeltabeta 2 7.890878470 14 3.983840529 ...  # Veterinary Epidemiologic Research: GLM – Logistic Regression (part 2) Second part on logistic regression (first one here). We used in the previous post a likelihood ratio test to compare a full and null model. The same can be done to compare a full and nested model to test the contribution of any subset of parameters: mod1.nest <- glm(casecont ~ dcpct, family = binomial("logit"), + data = nocardia) lr.mod1.nest <- -(deviance(mod1.nest) / 2) (lr <- 2 * (lr.mod1 - lr.mod1.nest)) [1] 30.16203 1 - pchisq(lr, 2) [1] 2.820974e-07 ### or, more straightforward, using anova to compare the 2 models: anova(mod1, mod1.nest, test = "Chisq") Analysis of Deviance Table Model 1: casecont ~ dcpct + dneo + dclox Model 2: casecont ~ dcpct Resid. Df Resid. Dev Df Deviance Pr(>Chi) 1 104 107.99 2 106 138.15 -2 -30.162 2.821e-07 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1  Interpretation of coefficients ### example 16.2 nocardia$dbarn <- as.factor(nocardia$dbarn) mod2 <- glm(casecont ~ dcpct + dneo + dclox + dbarn, + family = binomial("logit"), data = nocardia) (mod2.sum <- summary(mod2)) Call: glm(formula = casecont ~ dcpct + dneo + dclox + dbarn, family = binomial("logit"), data = nocardia) Deviance Residuals: Min 1Q Median 3Q Max -1.6949 -0.7853 0.1021 0.7692 2.6801 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.445696 0.854328 -2.863 0.00420 ** dcpct 0.021604 0.007657 2.821 0.00478 ** dneoYes 2.685280 0.677273 3.965 7.34e-05 *** dcloxYes -1.235266 0.580976 -2.126 0.03349 * dbarntiestall -1.333732 0.631925 -2.111 0.03481 * dbarnother -0.218350 1.154293 -0.189 0.84996 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 149.72 on 107 degrees of freedom Residual deviance: 102.32 on 102 degrees of freedom AIC: 114.32 Number of Fisher Scoring iterations: 5 cbind(exp(coef(mod2)), exp(confint(mod2))) Waiting for profiling to be done... 2.5 % 97.5 % (Intercept) 0.08666577 0.01410982 0.4105607 dcpct 1.02183934 1.00731552 1.0383941 dneoYes 14.66230075 4.33039752 64.5869271 dcloxYes 0.29075729 0.08934565 0.8889877 dbarntiestall 0.26349219 0.06729031 0.8468235 dbarnother 0.80384385 0.08168466 8.2851189  Note: Dohoo do not report the profile likelihood-based confidence interval on page 404. As said in the previous post, the profile likelihood-based CI is preferable due to the Hauck-Donner effect (overestimation of the SE). Using neomycin in the herd increases the log odds of Nocardia mastitis by 2.7 units (or using neomycin increases the odds 14.7 times). Since Nocardia mastitis is a rare condition, we can interpret the odds ratio as a risk ratio (RR) and say neomycin increases the risk of Nocardia mastitis by 15 times. Changing the percentage of dry cows treated from 50 to 75% increases the log odds of disease by $(75 - 50) \times 0.022 = 0.55$ units (or $1.022^{(75-50)} = 1.72$). An increase of 25% in the percentage of cows dry-treated increases the risk of disease by about 72% (i.e. 1.72 times). Tiestall barns and other barn types both have lower risks of Nocardia mastitis than freestall barns. The significance of the main effects can be tested with: drop1(mod2, test = "Chisq") Single term deletions Model: casecont ~ dcpct + dneo + dclox + as.factor(dbarn) Df Deviance AIC LRT Pr(>Chi) <none> 102.32 114.32 dcpct 1 111.36 121.36 9.0388 0.002643 ** dneo 1 123.91 133.91 21.5939 3.369e-06 *** dclox 1 107.02 117.02 4.7021 0.030126 * as.factor(dbarn) 2 107.99 115.99 5.6707 0.058697 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1  The drop1 function tests each predictor relative to the full model. Presenting effects of factors on the probability scale Usually we think about the probability of disease rather than the odds, and the probability of disease is not linearly related to the factor of interest. A unit increase in the factor does not increase the probability of disease by a fixed amount. It depends on the level of the factor and the levels of other factors in the model. mod1 <- glm(casecont ~ dcpct + dneo + dclox, family = binomial("logit"), + data = nocardia) nocardia$neo.pred <- predict(mod1, type = "response", se.fit = FALSE)
library(ggplot2)
ggplot(nocardia, aes(x = dcpct, y = neo.pred, group = dneo,
+                      colour = factor(dneo))) +
+   stat_smooth(method = "glm", family = "binomial", se = FALSE) +
+   labs(colour = "Neomycin", x = "Percent of cows dry treated",
+        y = "Probability of Nocardia")


Next: evaluating logistic regression models.