This post will describe linear regression as from the book Veterinary Epidemiologic Research, describing the examples provided with R.

Regression analysis is used for modeling the relationship between a single variable Y (the outcome, or dependent variable) measured on a continuous or near-continuous scale and one or more predictor (independent or explanatory variable), X. If only one predictor is used, we have a simple regression model, and if we have more than one predictor, we have a multiple regression model. Let’s download the data coming together with the book. There’s a choice of various format, including R:

temp <- tempfile() download.file("http://ic.upei.ca/ver/sites/ic.upei.ca.ver/files/ver2_data_R.zip", tmp) # fetch the file into the temporary file load(unz(tmp, "ver2_data_R/daisy2.rdata")) # extract the target file from the temporary file unlink(tmp) # remove the temporary file ### some data management daisy2 <- daisy2[daisy2$h7 == 1, ] # we only use a subset of the data daisy2[, c(4, 7, 11, 15:18)] <- lapply(daisy2[, c(4, 7, 11, 15:18)], factor) library(Hmisc) daisy2 <- upData(daisy2, labels = c(region = 'Region', herd = 'Herd number', cow = 'Cow number', study_lact = 'Study lactation number', herd_size = 'Herd size', mwp = "Minimum wait period for herd", parity = 'Lactation number', milk120 = 'Milk volume in first 120 days of lactation', calv_dt = 'Calving date', cf = 'Calving to first service interval', fs = 'Conception at first service', cc = 'Calving to conception interval', wpc = 'Interval from wait period to conception', spc = 'Services to conception', twin = 'Twins born', dyst = 'Dystocia at calving', rp = 'Retained placenta at calving', vag_disch = 'Vaginal discharge observed', h7 = 'Indicator for 7 herd subset'), levels = list(fs = list('No' = 0, 'Yes' = 1), twin = list('No' = 0, 'Yes' = 1), dyst = list('No' = 0, 'Yes' = 1), rp = list('No' = 0, 'Yes' = 1), vag_disch = list('No' = 0, 'Yes' = 1)), units = c(milk120 = "litres")) summary(daisy2$milk120, na.rm = TRUE) Min. 1st Qu. Median Mean 3rd Qu. Max. 1110 2742 3226 3225 3693 5630 sd(daisy2$milk120, na.rm = TRUE) [1] 703.364 daisy2 <- daisy2[!is.na(daisy2$milk120), ] # get rid of missing observations for milk production

To use R to download zipped file, see this answer on Stackoverflow by Dirk Eddelbuettel.

The regression model could be written: . A first simple model, with a categorical independent variable would be in R:

lm.milk <- lm(milk120 ~ parity, data = daisy2) (lm.milk.sum <- summary(lm.milk)) Call: lm(formula = milk120 ~ parity, data = daisy2) Residuals: Milk volume in first 120 days of lactation [litres] Min 1Q Median 3Q Max -2234.73 -384.50 -0.63 376.52 2188.48 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2629.63 29.31 89.712 < 2e-16 *** parity2 715.30 42.45 16.849 < 2e-16 *** parity3 794.67 44.05 18.039 < 2e-16 *** parity4 834.51 48.50 17.205 < 2e-16 *** parity5 812.19 50.77 15.998 < 2e-16 *** parity6 929.63 66.92 13.893 < 2e-16 *** parity7 795.26 218.86 3.634 0.000287 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 613.5 on 1761 degrees of freedom Multiple R-squared: 0.2419, Adjusted R-squared: 0.2393 F-statistic: 93.65 on 6 and 1761 DF, p-value: < 2.2e-16

The output shows in order: the model specified, the distribution of residuals (which should be normal with mean = 0 and so median close to 0), the coefficients and their standard errors, the standard error of the residuals, the and adjusted and then a F test for the null hypothesis (all the coefficients equal zero except the intercept, i.e. is our model better to explain variance than a model predicting the dependent variable mean for all data points). But where’s the ANOVA table?

anova(lm.milk) Analysis of Variance Table Response: milk120 Df Sum Sq Mean Sq F value Pr(>F) parity 6 211463774 35243962 93.653 < 2.2e-16 *** Residuals 1761 662708166 376325 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

How would you calculate this F-test yourself? The F value is given by , with SST the total sum of square, SSE the sum of square for the residuals, n and p the degrees of freedom for total and residuals. We thus have to get a p-value.

(src.var.total <- sum((daisy2$milk120 - mean(daisy2$milk120))^2)) #source of variation: total [1] 874171940 (src.var.res <- sum(lm.milk$res^2)) #source of variation: error (or residual) [1] 662708166 (f.stat <- ((src.var.total - src.var.res) / (lm.milk.sum$fstatistic[2])) / + (src.var.res / lm.milk.sum$fstatistic[3])) #F-statistic 93.65301 1 - pf(f.stat, lm.milk.sum$fstatistic[2], lm.milk.sum$fstatistic[3]) # p-value 0

We could also compute ourself :

1 - sum(lm.milk$res^2) / sum((daisy2$milk120 - mean(daisy2$milk120))^2) [1] 0.2419018

We can also show some actual values, predicted ones and residuals:

head(data.frame(daisy2[, c(7:8)], fitted.value = fitted(lm.milk), + residual = resid(lm.milk)), n = 10) parity milk120 fitted.value residual 1 5 3505.8 3441.824 63.97631 2 5 3691.3 3441.824 249.47631 3 5 4173.0 3441.824 731.17626 4 5 3727.3 3441.824 285.47631 5 5 3090.8 3441.824 -351.02369 6 4 5041.2 3464.141 1577.05893 7 5 3861.2 3441.824 419.37621 8 5 4228.4 3441.824 786.57616 9 6 3431.1 3559.258 -128.15761 10 5 4445.5 3441.824 1003.67626

Next we could plot actual values and intervals for prediction both for the mean and new observations. We first create a new data frame to hold the values of parity for which we want the predictions, then compute the prediction and confidence bands and plot them with ggplo2.

confidence <- data.frame(parity = as.factor(1:7)) # Confidence bands confidence.band <- predict(lm.milk, int = "c" , newdata = confidence) confidence.band <- as.data.frame(cbind(confidence.band, parity = c(1:7))) # Prediction bands prediction.band <- predict(lm.milk, int = "p" , newdata = confidence) prediction.band <- as.data.frame(cbind(prediction.band, parity = c(1:7))) library(ggplot2) ggplot(data = daisy2, aes(x = as.numeric(parity), y = milk120)) + geom_point() + geom_smooth(data = confidence.band, aes(x = parity, y = lwr), method = lm, se = FALSE, colour = "blue") + geom_smooth(data = confidence.band, aes(x = parity, y = upr), method = lm, se = FALSE, colour = "blue") + geom_smooth(data = prediction.band, aes(x = parity, y = lwr), method = lm, se = FALSE, colour = "green") + geom_smooth(data = prediction.band, aes(x = parity, y = upr), method = lm, se = FALSE, colour = "green") + xlab("Parity") + ylab("Milk volume in first 120 days of lactation")

Finally, let’s have a look at a multiple regression and see how we can assess the significance of predictor variables. We will look at the effect of milk production, parity, retained placenta and vaginal discharge on the waiting period of the cows.

daisy2$milk120.sq <- daisy2$milk120^2 daisy2$parity.cat[as.numeric(daisy2$parity) >= 3] <- "3+" daisy2$parity.cat[daisy2$parity == 2] <- "2" daisy2$parity.cat[as.numeric(daisy2$parity) < 2] <- "1" lm.wpc <- lm(wpc ~ milk120 + milk120.sq + as.factor(parity.cat) + rp + vag_disch, data = daisy2) summary(lm.wpc) Call: lm(formula = wpc ~ milk120 + milk120.sq + as.factor(parity.cat) + rp + vag_disch, data = daisy2) Residuals: Interval from wait period to conception Min 1Q Median 3Q Max -69.21 -37.98 -14.66 24.46 219.46 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.232e+02 2.134e+01 5.775 9.29e-09 *** milk120 -3.426e-02 1.339e-02 -2.558 0.0106 * milk120.sq 4.507e-06 2.011e-06 2.241 0.0251 * as.factor(parity.cat)2 5.083e+00 4.056e+00 1.253 0.2103 as.factor(parity.cat)3+ 8.887e+00 3.665e+00 2.425 0.0154 * rpYes 9.771e+00 4.572e+00 2.137 0.0327 * vag_dischYes 9.838e+00 6.086e+00 1.616 0.1062 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 51.47 on 1529 degrees of freedom (232 observations deleted due to missingness) Multiple R-squared: 0.01306, Adjusted R-squared: 0.009191 F-statistic: 3.373 on 6 and 1529 DF, p-value: 0.002622

The F-test tells us if any of the predictors is useful in predicting the length of the waiting period. The F-statistic is significant and we can reject the null hypothesis that no predictors are useful. If we had failed to reject the null hypothesis, we would still be interested in the possibility of non-linear transformations of variables (already done here) or the presence of outliers, masking the true effect. Also we could have not enough data to demonstrate a real effect. Now that we rejected this null hypothesis, maybe not all predictors are necessary to predict the response.

We can see from the summary table which predictors are significant. However we can test a single predictor, with an F-test:

lm.wpc2 <- lm(wpc ~ milk120 + milk120.sq + as.factor(parity.cat) + rp, data = daisy2) anova(lm.wpc, lm.wpc2) Analysis of Variance Table Model 1: wpc ~ milk120 + milk120.sq + as.factor(parity.cat) + rp + vag_disch Model 2: wpc ~ milk120 + milk120.sq + as.factor(parity.cat) + rp Res.Df RSS Df Sum of Sq F Pr(>F) 1 1529 4050745 2 1530 4057666 -1 -6921.1 2.6125 0.1062

or with a t-test, as with degrees of freedom. Note that equals the F-statistic.

The same approach can be used to test a group of predictors. For example, if we want to see if we have to keep both milk production and its squared value in the model:

lm.wpc3 <- lm(wpc ~ as.factor(parity.cat) + rp + vag_disch, data = daisy2) anova(lm.wpc, lm.wpc3) Analysis of Variance Table Model 1: wpc ~ milk120 + milk120.sq + as.factor(parity.cat) + rp + vag_disch Model 2: wpc ~ as.factor(parity.cat) + rp + vag_disch Res.Df RSS Df Sum of Sq F Pr(>F) 1 1529 4050745 2 1531 4076138 -2 -25394 4.7925 0.008416 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Hi Denis, please I want to find out whether it was necessary including the square of the milk.120 in the multiple regressing analysis. I thought we could just used only the milk.120 as a dependent variable without squaring it. Perhaps the squaring would have been necessary if the residuals had shown abnormal distribution. Please what do you think about this??

Hi Michael,

It’s not necessary to include it at first. But you might be interested to check if there’s a non-linear relationship between milk120 and waiting period. That was done in “one shot” to save place on the post. You can also see that the squared variable is significant and should be kept in the model.

Cheers