SVA confounding analysis framework

library(limma)
library(svconfound)

svconfound: Searching for confounders the easy way

This package is aimed to help users to interactively study and determine the presence of confounders in their data. The idea is to make this process as easy as possible, and is based on the approach that Andrew Teschendorff uses in his book chapter [@Yang2015].

The idea of this vignette is to present a simple use case and show how the analyst can use the functions in the package to interrogate her data in order to learn more about possible confounders.

Load the example dataset

First of all, we are going to load a small dataset provided with the package:

data(smallmeth)

class(smallmeth)

dim(smallmeth)

This dataset represents DNA Methylation values from a set of microarrays. Not all the values are there, for the data has been previously pruned in order to keep the package size low. What we have is a 3000x25 matrix containing M-values in the range (-Inf, Inf), representing the methylation profiles of 25 samples over a subset of 3000 probes of the arrays.

A subset of the smallmeth dataset (5 probes x 3 samples) looks like this:

knitr::kable(smallmeth[1:5, 1:3])

We could not perform a proper confounder analysis without more information about our samples, or without knowing the research question we are trying to answer. Thus, we are going to need another dataset:

data(phenopart)

class(phenopart)

dim(phenopart)

This 25x6 data.frame contains some of the phenotype information available for the samples represented in our methylation values matrix. It does not contain all the information, but it can help us to demonstrate a very simple use case.

Let's take a look to the first 10 rows in phenopart:

knitr::kable(head(phenopart, 10))

SVD-based confounder analysis

The simplest method for confounder analysis in the svconfound package is based on Singular Value Decomposition (SVD) and Principal Components Analysis (PCA), and it is implemented in the svd_analysis() function. We can use the function on our data like this:

svd_result = svd_analysis(smallmeth, phenopart)

The svd_analysis() function estimates the principal components based on our data, and then tests each of the components against each of the phenotype variables, thus producing an association score which can help the researcher to determine which variables are most correlated with the greatest variations in the dataset.

We can generate a screeplot to see how much variance explains each component using the correspondent overloaded function:

screeplot(svd_result)

The darker vertical line in the plot shows the estimated number of significant components able to explain most of the variability present in the dataset. If the researcher does not like the default plot theme, she can access the data used for its generation just by accessing the variance_explained and limit_significant_PC fields of the resulting object.

knitr::kable(head(svd_result$variance_explained, 10))

svd_result$limit_significant_PC

We can generate the main confounder plot by using the overloaded plot() function:

plot(svd_result)

For each intersection of a principal component (PC-X) and a phenotype variable, a square is shown if there is a statistically significant relationship between them. The color of the square determines the significance level for their relationship.

In this case, we can see that the main source of variability (the first principal component) is highly correlated with our phenotype of interest (Sample_Group), which is something good and expected in this case, as this data is extracted from a real dataset comparing normal and tumor tissues, where very high differences are expected.

Problem is, there are other variables associated with the principal components. A very interesting one, from a confounder analysis point of view, is Slide. This is a technical variable which usually affects those samples processed in batches in DNA Methylation analysis pipelines. This type of confounder can introduce noise in our analysis methods and should be taken into account in order to obtain more accurate results.

As in the screeplot example, the researcher can access the data used to generate the plot in the significance field.

knitr::kable(head(svd_result$significance, 10))

SVA-based confounder analysis

Another approach provided by the svconfound package is the one implemented in the sva_analysis() function. It uses Surrogate Variable Analysis (SVA) in order to obtain the main sources of variability in the residuals of a previously fitted model. Then, a similar approach to the svd_analysis() function is used to determine the degree of association between Surrogate Variables and phenotype variables.

Let's use the sva_analysis() function on our data:

sva_result = sva_analysis(smallmeth, phenopart, ~ Sample_Group)

We have used a formula to describe our hypothesis. We are using a simple model containing only the Sample_Group variable. Our intention is to model the possible sources of confounding by including the Surrogate Variables in our model at a later step. We can generate a graph similar to the previous one:

plot(sva_result)

The function has estimated that the number of Surrogate Variables is 7, and the association graph shows that, after fitting the model, there is still a lot of confounding associated to the Slide, Sex and Age variables. If there were no sources of confounding, we could go along with our simple model, but the presence of these confounders indicates that we could have problems if we went that way in this case.

As in previous functions, the researcher can access the underlying graph information using the significance field.

knitr::kable(head(sva_result$significance, 10))

Fitting a model

With all the information we have gathered, we are ready now to fit a model to each of the probes, adjusting for the possible confounders we have identified using the previous plots. First, we should create a design matrix:

design = model.matrix(~ Sample_Group, data = phenopart)

In order to correct for confounding, we are going to add the Surrogate Variables to our design matrix, including them as predictors in our model.

design_sv = cbind(design, sva_result$surrogates)

Now we are ready for model fitting. We are going to use limma for this, as it is a very common method in the context of microarray analyses.

fit = lmFit(smallmeth, design_sv)
efit = eBayes(fit)

Once the model is fitted, we can obtain the p-values and effect sizes for all the 3000 probes.

sig_table = topTable(
  efit, 
  coef = 'Sample_GroupTumor', 
  p.value = 1, 
  number = Inf
)

With this information, we can apply whatever criteria in order to select a subset of significant probes. For example, we are going to select those probes with an adjusted p-value below 0.01 and an absolute effect size greater than 1.4.

sig_probes = sig_table[
  sig_table$adj.P.Val < 0.01 & abs(sig_table$logFC) > 1.4, 
  ]

This gives us a total of r nrow(sig_probes) significant probes after adjusting for possible confounders. We can take a look at the 10 top rows of the list: