knitr::opts_chunk$set(echo = TRUE)
In this vignette we illustrate the use of selbal
, an R package for selection of balances in microbiome compositional data. As described in Rivera-Pinto et al. 2018 Balances: a new perspective for microbiome analysis https://doi.org/10.1101/219386, selbal
implements a forward-selection method for the identification of two groups of taxa whose relative abundance, or balance, is associated with the response variable of interest.
\
selbal
\
selbal
package is available on GitHub and can be installed on your computer
through the following instructions:
# Load "devtools" library (install it if needed) library(devtools) # Install "selbal"" install_github(repo = "UVic-omics/selbal") # Load library library(selbal)
Once the package is installed, the user can access to all the functions and
data sets included in selbal
.
\
\
The package includes three datasets: HIV
, sCD14
and Crohn
.
The first two correspond to an HIV study (Noguera-Julian et al. 2016 http://dx.doi.org/10.1016/j.ebiom.2016.01.032). The third dataset corresponds to a Crohn's disease study (Gevers et al. 2014 http://dx.doi.org/10.1016/j.chom.2014.02.005).
These are their main characteristics:
HIV
is a data.frame
with 155 rows (samples) and 62 columns
(variables). The first sixty variables provide the number of counts for microbial taxa
at genus taxonomy rank. The last two are:
MSM
: an HIV risk factor, Men who has sex with men (MSM)
or not (nonMSM).HIV_Status
: a factor indicating if the individual is HIV1 positive (Pos)
or not (Neg).sCD14
is a data.frame
with 151 rows (a subset of individuals from HIV
) and 61 columns. The first sixty columns again correspond to microbiome information while the last one is a numeric value for the amount of the inflammation marker sCD14
.
Crohn
is a data.frame
with 975 rows (samples) and 49 columns (variables). 662 samples are cases diagnosed with Crohn's disease and 313 are controls. The first forty-eight columns conform the microbiome information at genus level and the last one is disease status: Crohn's disease (CD) or control (no).
Calling by their name, the user can get acces to each data set.
selbal.cv
functionselbal.cv()
is the function of selbal
package that is intended to answer the following questions:
To run the function we need to specify two input objects:
x
is a matrix with the microbiome information. It represents the number
of counts or reads for each sample (row) and each taxon (column).y
is a vector with the response variable. It should be specified as a factor
if the
response variable is dichotomous and numeric
if it is continuous.Additional parameters of selbal.cv
function are:
n.fold
: a numeric value indicating the number of folds in the
cross - validation procedure. When the response variable is dichotomous the cross-validation is performed so that the total proportion of cases and controls is preserved in each fold. Default n.fold = 5
.n.iter
: a numeric value indicating the number of iterations to implement; that is, the number of times that the cross - validation process is repeated. Default n.iter = 10
.logit.acc
: for dichotomous responses, the method to compute the association parameter. Default it is the AUC- ROC value (logit.acc = "auc"
), but other alternatives are also included (see the article to obtain more information); logit.acc = "rsq"
or logit.acc = "tjur"
.maxV
: the maximum number of variables to be included in a balance. Default maxV = 20
.zero.rep
: it defines the method used for the zero-replacement. If it is "bayes"
the Bayesian - Multiplicative treatment implemented in {zCompositions}
is applied. If it is "one"
, a pseudocount of 1 is added to the whole matrix.opt.cri
parameter for selecting the method to determine the optimal number of variables. "max"
to define this number as the number of variables which maximizes the association value or "1se"
to take also the standard error into account. Default opt.cri = "1se"
.There are also additional sequndary parameters. To get more information run ?selbal.cv
.
In this case we explore a possible association between HIV infection and microbiome composition. More precisely, we apply selbal.cv()
with the goal of identifying a microbiome balance between two groups of taxa associated to the HIV Status. According to Noguera-Julian et al. (2016) MSM factor is a possible confounder in HIV microbiome studies. Thus, we will add MSM in the analysis through the covar
parameter.
################################################################################ # Microbiomne and HIV: using a dichotomous response ################################################################################ # Define x, y and z x <- HIV[,1:60] y <- HIV[,62] z <- data.frame(MSM = HIV[,61]) # Run selbal.cv function (with the default values for zero.rep and opt.cri) CV.BAL.dic <- selbal.cv(x = x, y = y, n.fold = 5, n.iter = 10, covar = z, logit.acc = "AUC")
selbal.cv
function returns a list with six elements:
maxV
where the red dot represents mean accuracy and the branches the standard errors.CV.BAL.dic$accuracy.nvar
In this example, the optimal number of variables is two: the smallest number of variables with an accuracy within the minimum accuray plus one standard error.
CV.BAL.dic$var.barplot
The color represents if the variables have been included in the numerator (red) for the denominator(blue) of the balance. In this case f_Ruminococcaceae_g_Incertae_Sedis is the most frequent taxon in the CV-process, included in about 70% of cv balances.
grid.draw(CV.BAL.dic$global.plot)
The balance identified as the most associated with HIV-status is given by $X_+$, a taxon of the family Erysipelotrichaceae and unknown genus and $X_-$,a taxon of the family Ruminococcaceae and unknown genus (top-left of the figure). Higher balance scores are associated to HIV-positive samples, that is, larger relative abundances of Erysipelotrichaceae with respect to Ruminococcaceae (botton-left of the figure). The score associated to HIV-negative samples is lower compared to HIV-positive individuals, whose balance is more heterogeneous. The apparent discrimination accuracy of this balance (AUC of 0.786, top-right of the figure) is moderate and it is reduced when it is estimated in the CV-process (cv-AUC=0.674). See the fifth element of the output for more details.
plot.tab(CV.BAL.dic$cv.tab)
The Global balance is the one most frequently selected in the CV (44%), so it seems to be a robust selection for this dataset. The three most repeated balances include f_Ruminococcaceae_g_Incertae_Sedis (the most frequent one) as the variable in the denominator.
n.fold*n.iter
). CV.BAL.dic$cv.accuracy; summary(CV.BAL.dic$cv.accuracy)
The mean of these values provides the cross-validation accuracy. In this case, the cross-validation AUC is 0.674, which, as expected, is lower than the appparent AUC value measured on the whole dataset (0.786).
data.frame
with the variables included in the Global Balance.
The first column, Taxa
, provides the names of the selected taxa while variable Group
specifies wether the taxon is included in the numerator (NUM) or the denominator (DEN) of the balance.This table is useful in order to compute the balance score for another different dataset through bal.value
function. For example, given this table (tab
) and a matrix corresponding of microbiome information for other samples (mat
, log-transformed count matrix including at least all the variables presented in tab
), we can run bal.value(bal.tab = tab, x = mat)
and we will obtain for each sample a value corresponding to the balance defined in tab
.
CV.BAL.dic$global.balance
CV.BAL.dic$glm
CV.BAL.dic$opt.nvar
In this second example we deal with a continuous response. Previous studies have proved association between inflammatory diseases and the gut microbiome composition; so, it is expected such a relationship between the gut microbiome and an inflammatory disease like the HIV infection. We will explore the association between the gut microbiome and the inflammation parameter sCD14.
For this second example, the input is going to be defined as:
x
: the matrix with microbiome information, that is, the first sixty columns
of sCD14
.y
: the response variable, in this case the sCD14 value.We are going to use the default values for the rest of the parameters:
# Define x, y and z x2 <- sCD14[,1:60] y2 <- sCD14[,61] # Run selbal.cv function BAL.sCD14 <- selbal.cv(x = x2, y = y2, n.fold = 5, n.iter = 10, covar = NULL)
With the same format as the results of the previous example, now for sCD14 we have that:
BAL.sCD14$accruacy.nvar
BAL.sCD14$var.barplot
$X_+$ = { g_Subdoligranulum, f_Lachnospiraceae_g_Incertae_Sedis}
$X_-$ = { f_Lachnospiraceae_g_unclassified, g_Collinsella}
It is important to highlight that the four variables defining the balance are the most presented ones in the CV (see the previous Figure). Additionally, higher scores for this balance are linked with higher values of sCD14. An R-squared value of 0.281 is obtained for the regression model.
grid.draw(BAL.sCD14$global.plot)
plot.tab(BAL.sCD14$cv.tab)
BAL.sCD14$cv.accuracy; summary(BAL.sCD14$cv.accuracy)
data.frame
with the four variables selected in the balance.BAL.sCD14$global.balance
BAL.sCD14$glm
BAL.sCD14$opt.nvar
The third example is implemented for a Crohn's disease microbiome data set. It contains almost thousand patients and will allow to evaluate the robustness of the cv process with quite large folds. The aim of this analysis is to find two groups of taxa whose abundance balance is associated to disease status.
So, the fucntion will be run with the following objects as input:
x
: the matrix with microbiome information, that is, the first 48 columns of Crohn
table.y
: the response variable is a case-control indicator.To run selbal.cv
function for this example:
# Load data sets x3 <- Crohn[,1:48] y3 <- Crohn[,49] # Run selbal.cv function BAL.Crohn <- selbal.cv(x = x3, y = y3, n.fold = 5, n.iter = 10, covar = NULL, logit.acc = "AUC")
The results for this data set reveal:
BAL.Crohn$accuracy.nvar
BAL.Crohn$var.barplot
grid.draw(BAL.Crohn$global.plot)
plot.tab(BAL.Crohn$cv.tab)
BAL.Crohn$cv.accuracy ; summary(BAL.Crohn$cv.accuracy)
BAL.Crohn$global.balance
CD
or no
.BAL.Crohn$glm
BAL.Crohn$opt.nvar
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.