knitr::opts_chunk$set(echo = TRUE)

In this vignette we illustrate the use of selbal, an R package for selection of balances in microbiome compositional data. As described in Rivera-Pinto et al. 2018 Balances: a new perspective for microbiome analysis https://doi.org/10.1101/219386, selbal implements a forward-selection method for the identification of two groups of taxa whose relative abundance, or balance, is associated with the response variable of interest.

\

Installing and loading selbal

\ selbal package is available on GitHub and can be installed on your computer through the following instructions:

# Load "devtools" library (install it if needed)
  library(devtools)

# Install "selbal""
  install_github(repo = "UVic-omics/selbal")

# Load library
  library(selbal)

Once the package is installed, the user can access to all the functions and data sets included in selbal.

\

Data sets

\

The package includes three datasets: HIV, sCD14 and Crohn. The first two correspond to an HIV study (Noguera-Julian et al. 2016 http://dx.doi.org/10.1016/j.ebiom.2016.01.032). The third dataset corresponds to a Crohn's disease study (Gevers et al. 2014 http://dx.doi.org/10.1016/j.chom.2014.02.005).

These are their main characteristics:

 

Calling by their name, the user can get acces to each data set.

selbal.cv function

selbal.cv() is the function of selbal package that is intended to answer the following questions:

To run the function we need to specify two input objects:

Additional parameters of selbal.cv function are:

There are also additional sequndary parameters. To get more information run ?selbal.cv.    

HIV INFECTION AND MICROBIOME

In this case we explore a possible association between HIV infection and microbiome composition. More precisely, we apply selbal.cv() with the goal of identifying a microbiome balance between two groups of taxa associated to the HIV Status. According to Noguera-Julian et al. (2016) MSM factor is a possible confounder in HIV microbiome studies. Thus, we will add MSM in the analysis through the covar parameter.

################################################################################
# Microbiomne and HIV: using a dichotomous response
################################################################################

# Define x, y and z
  x <- HIV[,1:60]
  y <- HIV[,62]
  z <- data.frame(MSM = HIV[,61])

# Run selbal.cv function (with the default values for zero.rep and opt.cri)
  CV.BAL.dic <- selbal.cv(x = x, y = y, n.fold = 5, n.iter = 10,
                          covar = z, logit.acc = "AUC")

 

selbal.cv function returns a list with six elements:

CV.BAL.dic$accuracy.nvar

 

In this example, the optimal number of variables is two: the smallest number of variables with an accuracy within the minimum accuray plus one standard error.

 

CV.BAL.dic$var.barplot

 

The color represents if the variables have been included in the numerator (red) for the denominator(blue) of the balance. In this case f_Ruminococcaceae_g_Incertae_Sedis is the most frequent taxon in the CV-process, included in about 70% of cv balances.

 

grid.draw(CV.BAL.dic$global.plot)

 

The balance identified as the most associated with HIV-status is given by $X_+$, a taxon of the family Erysipelotrichaceae and unknown genus and $X_-$,a taxon of the family Ruminococcaceae and unknown genus (top-left of the figure). Higher balance scores are associated to HIV-positive samples, that is, larger relative abundances of Erysipelotrichaceae with respect to Ruminococcaceae (botton-left of the figure). The score associated to HIV-negative samples is lower compared to HIV-positive individuals, whose balance is more heterogeneous. The apparent discrimination accuracy of this balance (AUC of 0.786, top-right of the figure) is moderate and it is reduced when it is estimated in the CV-process (cv-AUC=0.674). See the fifth element of the output for more details.

 

plot.tab(CV.BAL.dic$cv.tab)

The Global balance is the one most frequently selected in the CV (44%), so it seems to be a robust selection for this dataset. The three most repeated balances include f_Ruminococcaceae_g_Incertae_Sedis (the most frequent one) as the variable in the denominator.

 

CV.BAL.dic$cv.accuracy; summary(CV.BAL.dic$cv.accuracy)

The mean of these values provides the cross-validation accuracy. In this case, the cross-validation AUC is 0.674, which, as expected, is lower than the appparent AUC value measured on the whole dataset (0.786).

 

This table is useful in order to compute the balance score for another different dataset through bal.value function. For example, given this table (tab) and a matrix corresponding of microbiome information for other samples (mat, log-transformed count matrix including at least all the variables presented in tab), we can run bal.value(bal.tab = tab, x = mat) and we will obtain for each sample a value corresponding to the balance defined in tab.

CV.BAL.dic$global.balance
CV.BAL.dic$glm
CV.BAL.dic$opt.nvar

 

HIV INFECTION AS AN INFLAMMATORY DISEASE

In this second example we deal with a continuous response. Previous studies have proved association between inflammatory diseases and the gut microbiome composition; so, it is expected such a relationship between the gut microbiome and an inflammatory disease like the HIV infection. We will explore the association between the gut microbiome and the inflammation parameter sCD14.

For this second example, the input is going to be defined as:

We are going to use the default values for the rest of the parameters:

# Define x, y and z
  x2 <- sCD14[,1:60]
  y2 <- sCD14[,61]

# Run selbal.cv function
  BAL.sCD14 <- selbal.cv(x = x2, y = y2, n.fold = 5, n.iter = 10,
                          covar = NULL)

 

With the same format as the results of the previous example, now for sCD14 we have that:

 

 

BAL.sCD14$accruacy.nvar

 

 

BAL.sCD14$var.barplot

 

$X_+$ = { g_Subdoligranulum, f_Lachnospiraceae_g_Incertae_Sedis}

$X_-$ = { f_Lachnospiraceae_g_unclassified, g_Collinsella}

It is important to highlight that the four variables defining the balance are the most presented ones in the CV (see the previous Figure). Additionally, higher scores for this balance are linked with higher values of sCD14. An R-squared value of 0.281 is obtained for the regression model.

grid.draw(BAL.sCD14$global.plot)

 

plot.tab(BAL.sCD14$cv.tab)

 

BAL.sCD14$cv.accuracy; summary(BAL.sCD14$cv.accuracy)

 

BAL.sCD14$global.balance
BAL.sCD14$glm
BAL.sCD14$opt.nvar

   

Crohn's disease

 

The third example is implemented for a Crohn's disease microbiome data set. It contains almost thousand patients and will allow to evaluate the robustness of the cv process with quite large folds. The aim of this analysis is to find two groups of taxa whose abundance balance is associated to disease status.

So, the fucntion will be run with the following objects as input:

To run selbal.cv function for this example:

  # Load data sets
    x3 <- Crohn[,1:48]
    y3 <- Crohn[,49]

  # Run selbal.cv function
    BAL.Crohn <- selbal.cv(x = x3, y = y3, n.fold = 5, n.iter = 10,
                           covar = NULL, logit.acc = "AUC")

 

The results for this data set reveal:

 

BAL.Crohn$accuracy.nvar
BAL.Crohn$var.barplot
grid.draw(BAL.Crohn$global.plot)
plot.tab(BAL.Crohn$cv.tab)
BAL.Crohn$cv.accuracy ; summary(BAL.Crohn$cv.accuracy)
BAL.Crohn$global.balance
BAL.Crohn$glm
BAL.Crohn$opt.nvar


malucalle/selbal documentation built on May 31, 2024, 2:36 p.m.