Factor Analysis in Epidemiology

This is a short (15′!) presentation given for the Canadian Veterinary Epidemiology Club (CVEC) on February 26, 2016, about new epi tools.

What is Factor Analysis?

Factor analysis is a group of statistical methods used to identify the structure of data with the help of latent (not observed) variables. These latent variables, called factors, are identified by looking at clusters of correlated variables (the correlation between 2 variables proceed from the similarity of their relation with the latent variables).

In brief, we try to explain the correlation (covariance) between a large set of observed variables according to a smaller set of unobserved latent variables, a.k.a the factors. Therefore a factor is a cluster of related observed variables, representing the underlying dimension of a construct that is as distinct as possible from the other factors.

There are two main types of factor analysis: exploratory (EFA) and confirmatory (CFA).

Exploratory Factor Analysis (EFA)

It describes a group of variables by a linear combination of the underlying common factors.

Each observed variable is potentially a measure of every factor, and the goal is to determine the relationships between observed variables and factors that are the strongest. We don’t know a priori how many factors are needed to explain the relation among variables.

Confirmatory Factor Analysis (CFA)

We try to verify the ability of our theoretical model to explain the common variance between several variables with latent variables that we determined a priori. We have prior knowledge about the underlying structure of the construct under investigation. See structural Equation Modelling (SEM).

Why would you want to use it?

  • When you are more interested in identifying the structure underlying a particular phenomenon than testing a specific hypothesis.
  • You want to reduce the number of variables to a few factors explaining a large proportion of the original variables variance (Data reduction).
  • You want to identify groups of highly correlated variables, choose one among them to be used in subsequent analyses.

What are the 2 main methods?

Principal Component Analysis (PCA)

“The central idea of principal component analysis (PCA) is to reduce the dimensionality of a data set consisting of a large number of interrelated variables, while retaining as much as possible of the variation present in the data set” (Jolliffe 2002).

Objective: data reduction.

Factor Analysis

In factor analysis the goal is to explain the covariance among variables (the common variance between one variable and at least one another, a factor).

Objective: identifying data structure or causal modelling.


  • Within a set of observed variables there is a set of underlying factors (latent variables) that can explain the relationships among variables. The observed variables are dependent upon these latent variables but are also subject to random variation,
  • No outliers,
  • From Pearson correlation coefficient:
    • Large sample size (15-20 observations per variables),
    • Continuous distribution and sufficient variation within the variables,
    • Linear relationship among the correlated variables.


PCA for data reduction

See Prediction of bulk tank somatic cell count violations based on monthly individual cow somatic cell count data.

In this study, we started with ~30 variables that were potentially highly correlated. We had first to identify clusters of correlated variables to reduce the likelihood of overfitting the model. PCA was used to discard redundant variables.



We further used Lasso (least absolute shrinkage and selection operator) regression on the clusters identified by PCA.

Multiple Factor Analysis (MFA)

See Herd-Level Management Factors and Culling Rates in Québec Dairy Herds.

MFA is dedicated to datasets where variables are structured into groups, studying several set of variables simultaneously. If you have a set of variables, some may be related together and you might want to study the relationships between the groups of variables, not only the variables.

In the above study, we wanted to see if we could identify groups of herds according to certain factors. Cow-level data were aggregated to provide herd-level information. These aggregated data formed groups of variables related to management, reproduction, milk production, and health.


Unfortunately there were no relationships between herds. However you can keep going your analysis and run a PCA on specific groups of variables.


Factor analysis

Here’s an example, on Q-methodology.



Factor analyses are additional tools that allow epidemiologists to carry out a data-driven exploration of their dataset, addressing the impact of exposures and their context.