In malucalle/selbal: Finding highly - associated balances with the response variable.

knitr::opts_chunk$set(echo = TRUE)

In this vignette we illustrate the use of selbal, an R package for selection of balances in microbiome compositional data. As described in Rivera-Pinto et al. 2018 Balances: a new perspective for microbiome analysis https://doi.org/10.1101/219386, selbal implements a forward-selection method for the identification of two groups of taxa whose relative abundance, or balance, is associated with the response variable of interest.

Installing and loading `selbal`

\ selbal package is available on GitHub and can be installed on your computer through the following instructions:

# Load "devtools" library (install it if needed)
  library(devtools)

# Install "selbal""
  install_github(repo = "UVic-omics/selbal")

# Load library
  library(selbal)

Once the package is installed, the user can access to all the functions and data sets included in selbal.

Data sets

The package includes three datasets: HIV, sCD14 and Crohn. The first two correspond to an HIV study (Noguera-Julian et al. 2016 http://dx.doi.org/10.1016/j.ebiom.2016.01.032). The third dataset corresponds to a Crohn's disease study (Gevers et al. 2014 http://dx.doi.org/10.1016/j.chom.2014.02.005).

These are their main characteristics:

HIV is a data.frame with 155 rows (samples) and 62 columns (variables). The first sixty variables provide the number of counts for microbial taxa at genus taxonomy rank. The last two are:
- MSM: an HIV risk factor, Men who has sex with men (MSM) or not (nonMSM).
- HIV_Status: a factor indicating if the individual is HIV1 positive (Pos) or not (Neg).
sCD14 is a data.frame with 151 rows (a subset of individuals from HIV) and 61 columns. The first sixty columns again correspond to microbiome information while the last one is a numeric value for the amount of the inflammation marker sCD14.
Crohn is a data.frame with 975 rows (samples) and 49 columns (variables). 662 samples are cases diagnosed with Crohn's disease and 313 are controls. The first forty-eight columns conform the microbiome information at genus level and the last one is disease status: Crohn's disease (CD) or control (no).

Calling by their name, the user can get acces to each data set.

`selbal.cv` function

selbal.cv() is the function of selbal package that is intended to answer the following questions:

Which is the optimal number of variables to include in the balance?
Which are the taxa whose balance is more associated to the response variable?
Is the proposed balance robust?

To run the function we need to specify two input objects:

x is a matrix with the microbiome information. It represents the number of counts or reads for each sample (row) and each taxon (column).
y is a vector with the response variable. It should be specified as a factor if the response variable is dichotomous and numeric if it is continuous.

Additional parameters of selbal.cv function are:

n.fold : a numeric value indicating the number of folds in the cross - validation procedure. When the response variable is dichotomous the cross-validation is performed so that the total proportion of cases and controls is preserved in each fold. Default n.fold = 5.
n.iter: a numeric value indicating the number of iterations to implement; that is, the number of times that the cross - validation process is repeated. Default n.iter = 10.
logit.acc: for dichotomous responses, the method to compute the association parameter. Default it is the AUC- ROC value (logit.acc = "auc"), but other alternatives are also included (see the article to obtain more information); logit.acc = "rsq" or logit.acc = "tjur".
maxV: the maximum number of variables to be included in a balance. Default maxV = 20.
zero.rep : it defines the method used for the zero-replacement. If it is "bayes" the Bayesian - Multiplicative treatment implemented in {zCompositions} is applied. If it is "one", a pseudocount of 1 is added to the whole matrix.
opt.cri parameter for selecting the method to determine the optimal number of variables. "max" to define this number as the number of variables which maximizes the association value or "1se" to take also the standard error into account. Default opt.cri = "1se".

There are also additional sequndary parameters. To get more information run ?selbal.cv.

HIV INFECTION AND MICROBIOME

In this case we explore a possible association between HIV infection and microbiome composition. More precisely, we apply selbal.cv() with the goal of identifying a microbiome balance between two groups of taxa associated to the HIV Status. According to Noguera-Julian et al. (2016) MSM factor is a possible confounder in HIV microbiome studies. Thus, we will add MSM in the analysis through the covar parameter.

################################################################################
# Microbiomne and HIV: using a dichotomous response
################################################################################

# Define x, y and z
  x <- HIV[,1:60]
  y <- HIV[,62]
  z <- data.frame(MSM = HIV[,61])

# Run selbal.cv function (with the default values for zero.rep and opt.cri)
  CV.BAL.dic <- selbal.cv(x = x, y = y, n.fold = 5, n.iter = 10,
                          covar = z, logit.acc = "AUC")

selbal.cv function returns a list with six elements:

$accuracy.nvar: the first element of the list is an image representing the distribution of the accuracy measure as a function of the number of variables included in the balance. This values are from 2 to maxV where the red dot represents mean accuracy and the branches the standard errors.

CV.BAL.dic$accuracy.nvar

In this example, the optimal number of variables is two: the smallest number of variables with an accuracy within the minimum accuray plus one standard error.

$var.barplot: the second object in the list is a barplot representing the frequency of the variables selected in some step of the CV process (considering only the balances with at most the optimal number of variables defined in the previous step).

CV.BAL.dic$var.barplot

The color represents if the variables have been included in the numerator (red) for the denominator(blue) of the balance. In this case f_Ruminococcaceae_g_Incertae_Sedis is the most frequent taxon in the CV-process, included in about 70% of cv balances.

$global.plot: the third element includes a graphical representation of the Global balance; the balance obtained with the whole dataset.

grid.draw(CV.BAL.dic$global.plot)

The balance identified as the most associated with HIV-status is given by $X_+$, a taxon of the family Erysipelotrichaceae and unknown genus and $X_-$,a taxon of the family Ruminococcaceae and unknown genus (top-left of the figure). Higher balance scores are associated to HIV-positive samples, that is, larger relative abundances of Erysipelotrichaceae with respect to Ruminococcaceae (botton-left of the figure). The score associated to HIV-negative samples is lower compared to HIV-positive individuals, whose balance is more heterogeneous. The apparent discrimination accuracy of this balance (AUC of 0.786, top-right of the figure) is moderate and it is reduced when it is estimated in the CV-process (cv-AUC=0.674). See the fifth element of the output for more details.

$cv.tab: the fourth element of the list is a table with the summary of the CV-procedure. Rows represent the variables included either in the Global balance (second column) or the three most frequent balances in the CV process (last three columns). The first column provides the percentage of times each variable has appeared in CV and, the last row points the proportion of times the most repeated balances have appeared.

plot.tab(CV.BAL.dic$cv.tab)

The Global balance is the one most frequently selected in the CV (44%), so it seems to be a robust selection for this dataset. The three most repeated balances include f_Ruminococcaceae_g_Incertae_Sedis (the most frequent one) as the variable in the denominator.

$cv.accuracy: The fifth element in the output list is a vector with the test prediction or classification accuracy (AUC in this case) for each cv iteration (length equal to n.fold*n.iter).

CV.BAL.dic$cv.accuracy; summary(CV.BAL.dic$cv.accuracy)

The mean of these values provides the cross-validation accuracy. In this case, the cross-validation AUC is 0.674, which, as expected, is lower than the appparent AUC value measured on the whole dataset (0.786).

$global.balance: the sixth element of the output is a data.frame with the variables included in the Global Balance. The first column, Taxa, provides the names of the selected taxa while variable Group specifies wether the taxon is included in the numerator (NUM) or the denominator (DEN) of the balance.

This table is useful in order to compute the balance score for another different dataset through bal.value function. For example, given this table (tab) and a matrix corresponding of microbiome information for other samples (mat, log-transformed count matrix including at least all the variables presented in tab), we can run bal.value(bal.tab = tab, x = mat) and we will obtain for each sample a value corresponding to the balance defined in tab.

CV.BAL.dic$global.balance

$glm: the seventh element of the list is the result of a regression model. Balance selection is implemented through a regression model where the balance itself and additional covariates are the explanatory variables, and the variable of interest the response. The user has access to the regression model in order to obtain and analyze any of its characteristics.

CV.BAL.dic$glm

$opt.nvar: the last element represents the optinal number of variables considered in the balance as the optimal.

CV.BAL.dic$opt.nvar

HIV INFECTION AS AN INFLAMMATORY DISEASE

In this second example we deal with a continuous response. Previous studies have proved association between inflammatory diseases and the gut microbiome composition; so, it is expected such a relationship between the gut microbiome and an inflammatory disease like the HIV infection. We will explore the association between the gut microbiome and the inflammation parameter sCD14.

For this second example, the input is going to be defined as:

x: the matrix with microbiome information, that is, the first sixty columns of sCD14.
y: the response variable, in this case the sCD14 value.

We are going to use the default values for the rest of the parameters:

# Define x, y and z
  x2 <- sCD14[,1:60]
  y2 <- sCD14[,61]

# Run selbal.cv function
  BAL.sCD14 <- selbal.cv(x = x2, y = y2, n.fold = 5, n.iter = 10,
                          covar = NULL)

With the same format as the results of the previous example, now for sCD14 we have that:

$accuracy.nvar: even though the minimum mean MSE is obtained for eight variables, the lowest value under its mean plus the SE is four; so four is the optimal number of variables for this dataset.

BAL.sCD14$accruacy.nvar

$var.barplot: f_Lachinospiraceae_g_unclassified is the genus appearing most times in the cross - validation procedure. There are also three additional genera with a presence above fifty percent.

BAL.sCD14$var.barplot

$global.plot: the four genera defining the balance are divided into:

$X_+$ = { g_Subdoligranulum, f_Lachnospiraceae_g_Incertae_Sedis}

$X_-$ = { f_Lachnospiraceae_g_unclassified, g_Collinsella}

It is important to highlight that the four variables defining the balance are the most presented ones in the CV (see the previous Figure). Additionally, higher scores for this balance are linked with higher values of sCD14. An R-squared value of 0.281 is obtained for the regression model.

grid.draw(BAL.sCD14$global.plot)

$cv.tab: the table indicates that the Global Balance appears a 34% of times in the CV-procedure being the most presented one. The second and the third balances are also defined by three out of four of the variables included in the Global Balance, which reveals robustness for the proposal.

plot.tab(BAL.sCD14$cv.tab)

$cv.accuracy: the MSE values for cross - validation procedure are highly variable.

BAL.sCD14$cv.accuracy; summary(BAL.sCD14$cv.accuracy)

$global.balance: the sixth element of the output is the data.frame with the four variables selected in the balance.

BAL.sCD14$global.balance

$glm: it shows the result of the regression model defined for the balance selection.

BAL.sCD14$glm

$opt.nvar: the last element indicates the optimal number of variables in the balance: four in this case.

BAL.sCD14$opt.nvar

Crohn's disease

The third example is implemented for a Crohn's disease microbiome data set. It contains almost thousand patients and will allow to evaluate the robustness of the cv process with quite large folds. The aim of this analysis is to find two groups of taxa whose abundance balance is associated to disease status.

So, the fucntion will be run with the following objects as input:

x: the matrix with microbiome information, that is, the first 48 columns of Crohn table.
y: the response variable is a case-control indicator.

To run selbal.cv function for this example:

  # Load data sets
    x3 <- Crohn[,1:48]
    y3 <- Crohn[,49]

  # Run selbal.cv function
    BAL.Crohn <- selbal.cv(x = x3, y = y3, n.fold = 5, n.iter = 10,
                           covar = NULL, logit.acc = "AUC")

The results for this data set reveal:

$accuracy.nvar: the number of taxa with the minimum mean MSE value is fourteen. Nevertheless, under its MSE + SE value, twelve is the number of taxa presenting the least MSE, so the considered balance will have twelve microbial taxa.

BAL.Crohn$accuracy.nvar

$var.barplot: the barplot of the most selected taxa in the cross - validation shows a core of twelve genera appearing at least in 70% of the cv balances. As it is going to be shown later, they are the taxa defining the Global Balance.

BAL.Crohn$var.barplot

$global.plot: the Global Balance plot shows the twelve taxa whose balance most discriminates between cases and controls. Both, the boxplot and the density curve reveal that cases with Crohn's disease have lower balance scores than controls, meaning lower relative abundances of those taxa in the numerator with respect to those in the denominator. The discrimination value (AUC-ROC = 0.838) is very good as well as the cross - validation accuracy (cv-AUC=0.819).

grid.draw(BAL.Crohn$global.plot)

$cv.tab: the most repeated balance in the cross - validation is the Global Balance. This fact together with the core of taxa appearing in the three most repeated balances, reveal a robust structure linked with Crohn's disease in this data set.

plot.tab(BAL.Crohn$cv.tab)

$cv.accuracy: the mean AUC for the test datasets is very close to the AUC for the global balance; so , this is another fact that emphasized the robustness of the proposed balance.

BAL.Crohn$cv.accuracy ; summary(BAL.Crohn$cv.accuracy)

$global.balance: the sixth element of the output is a data.frame with the selected taxa.

BAL.Crohn$global.balance

$glm: the seventh element corresponds to the regression model where the balance and the covariates are the explanatory variables modelling the factor CD or no.

BAL.Crohn$glm

$opt.nvar: the last element indicates that in this case, twelve taxa define the proposed balance.

BAL.Crohn$opt.nvar

malucalle/selbal documentation built on Feb. 28, 2025, 4:53 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

malucalle/selbal
Finding highly - associated balances with the response variable.

In malucalle/selbal: Finding highly - associated balances with the response variable.

Installing and loading `selbal`

Data sets

`selbal.cv` function

HIV INFECTION AND MICROBIOME

HIV INFECTION AS AN INFLAMMATORY DISEASE

Crohn's disease

R Package Documentation

Browse R Packages

We want your feedback!

malucalle/selbal Finding highly - associated balances with the response variable.

In malucalle/selbal: Finding highly - associated balances with the response variable.

Installing and loading selbal

Data sets

selbal.cv function

HIV INFECTION AND MICROBIOME

HIV INFECTION AS AN INFLAMMATORY DISEASE

Crohn's disease

R Package Documentation

Browse R Packages

We want your feedback!

malucalle/selbal
Finding highly - associated balances with the response variable.

Installing and loading `selbal`

`selbal.cv` function