VSOLassoBag"

knitr::opts_chunk$set(
  collapse = TRUE,
  echo = TRUE,
  comment = "#>"
)

1. Introduction

VSOLassoBag is a variable-selection oriented Lasso bagging algorithm for biomarker development in omics-based translational research, providing a bagging Lasso framework for selecting significant variables from multiple models. A main application of this package is to screen a limited number of variables that are less dependent to the training data sets. Basically, this package was initially developed to adjust Lasso selected results from bootstrapped sample sets. The variables with the highest frequency among several selected results were considered as the stable variables to discriminate different sample sets. However, it is usually difficult to determine the cut-off point in terms of frequency when applied in real data sets. In this package, we introduced several methods, namely (1) curve elbow point detection, (2) parametric statistical test to help determine the cut-off point for variable selection. The source code and documents are freely available through Github (https://github.com/likelet/VSOLassoBag).

1.1 Citation

If you find this tool useful, please cite:

Liang J, Wang C, Zhang D, et al. VSOLassoBag: A variable-selection oriented LASSO bagging algorithm for biomarker discovery in omic-based translational research [published online ahead of print, 2023 Jan 3]. J Genet Genomics. 2023;S1673-8527(22)00285-5. doi:10.1016/j.jgg.2022.12.005

2. Installation

VSOLassoBag can run in both Windows system and most POSIX systems. To the latest version of this package, start R and enter:

```{R, eval=FALSE} if (!requireNamespace("devtools", quietly = TRUE)) { install.packages("devtools") } devtools::install_github("likelet/VSOLassoBag")

## 3. Prepare input data
To perform analysis with `VSOLassoBag` function, you need to provide:

* **ExpressionData** is an object constructed by  SummarizedExperiment, including independent variables (e.g., expression matrix) and dependent/outcome variable(s). Please view the [SummarizedExperiment doucumention](https://bioconductor.org/packages/release/bioc/vignettes/SummarizedExperiment/inst/doc/SummarizedExperiment.html) for the construction of SummarizedExperiment objects.

* **character** indicating the type of dependent variables as input to the parameter `a.family`.

**Note**: 

* `a.family` is determined entirely by the type of `dependent/outcome variable(s)`.

* You can also tune other parameters to better balance the performance and resource required for the analysis. Key parameters include `bootN` for bagging times, and `bagFreq.sigMethod` for the cut-off point decision method.


### 3.1 Independent variables
`Independent variables` is a matrix-like object with variables as rows and observations as columns, all the entries should be numeric. It can be obtained by `SummarizedExperiment::assay` function.

**Example `Independent variables` matrix**

```{R, echo=FALSE, comment=''}
  library(VSOLassoBag)
  data(ExpressionData)
  SummarizedExperiment::assay(ExpressionData)[1:6, 1:6]

3.2 dependent/outcome variable(s)

dependent/outcome variable(s) store sample information with the same rows as the samples in the independent variables matrix and can be invoked by SummarizedExperiment::colData function.

Example out.mat vector

``{R, echo=FALSE, comment=''} library(VSOLassoBag) data(ExpressionData) SummarizedExperiment::colData(ExpressionData)$y`[1:6]

### 3.3 a.family
`a.family` is a character determined by the data type of `out.mat`, including **binomial**, **gaussian**, **cox**, **multinomial**, **mgaussian**, **poisson**.


## 4. Results
A list with:

1. `results`: a dataframe containing **variables** with selection frequency >= 1. If setting `bagFreq.sigMethod == "PST"`, the **P.value** and the **P.adjust** of each variable will be provided. If setting `bagFreq.sigMethod = "CEP"` or using the elbow point indicators, elbow point(s) will be marked with <strong>*</strong> and <strong>thresholds</strong> for each variable will be assigned.

2. `permutations`: the permutation test results. 

3. `model`: the regression model built on LassoBag results.


## 5. Usage Example
For tutorial purpose, here we used two examples utilizing different cut-off point decision methods to exhibit how to interpret results. The data is simulated by Gaussian model and you can obtain it by `data(simulated_example)`

### 5.1 Using "CEP" cut-off point decision method

"CEP" (i.e. **"Curve Elbow Point Detection"**) is the default and recommended method for cut-off point decision. This method assumes that a sharp decrease of the observed frequency should separate important features from those unimportant ones, and based on this detects the elbow point(s) on the observed frequency curve. Finally the features with observed frequency higher than the elbow point are inferred as important features.

**Note:**
There may be **more than one elbow point** detected on the curve when using loose threshold, therefore it is recommended to use a stricter threshold first (use a larger `kneedle.S`) and the algorithm will automatically loosen the S parameter in case no elbow point can be found.



```{R, eval=FALSE}
  library(VSOLassoBag)
  data(ExpressionData)
  set.seed(19084)
  VSOLassoBag(ExpressionData = ExpressionData, outcomevariable = "y", bootN = 100, a.family = "gaussian", bagFreq.sigMethod = "PST", do.plot = FALSE)

Results of important variables:

```{R, echo=FALSE, comment=''} library(VSOLassoBag) data(ExpressionData) set.seed(19084) PSTResults<-VSOLassoBag(ExpressionData = ExpressionData, outcomevariable = "y", bootN = 100, a.family = "gaussian", bagFreq.sigMethod = "PST", do.plot = FALSE) print(PSTResults$results[1:14, ])

**CEP Observed Frequency Curve (generated by `do.plot = TRUE`):**

<center>
![](CEP.png){width=90%}
</center>

### 5.2 Using "PST" cut-off point decision method

"PST" (i.e. **"Parametric Statistical Test"**) is one of the alternative methods for cut-off point decision, which is computed as fast and memory-efficient as "CEP". It assumes the expected selection frequency of all variables follows a binomial distribution, so we can first model such a theoretical background distribution, and then obtain the statistical significance (p-value) of all variables.

```{R, eval=FALSE}
  library(VSOLassoBag)
  data(ExpressionData)
  set.seed(19084)
  VSOLassoBag(ExpressionData = ExpressionData, outcomevariable = "y", bootN = 100, a.family = "gaussian", bagFreq.sigMethod = "CEP", do.plot = TRUE)

Results of PST method

```{R, echo=FALSE, comment=''} library(VSOLassoBag) data(ExpressionData) set.seed(19084) CEPResults<-VSOLassoBag(ExpressionData = ExpressionData, outcomevariable = "y", bootN = 100, a.family = "gaussian", bagFreq.sigMethod = "CEP", do.plot = FALSE) print(CEPResults$results[1:14, ])

**PST Observed Frequency Distribution (generated by `do.plot = TRUE`):**

<center>
![](PST.png){width=90%}
</center>

## 6. Session Info
```{R, echo=FALSE, comment=''}
  sessionInfo()

@ Copyright 2018-2022, Center of Bioinformatics, Sun Yat-sen University Cancer Center Revision 322baf5b.



Try the VSOLassoBag package in your browser

Any scripts or data that you put into this service are public.

VSOLassoBag documentation built on March 31, 2023, 10:25 p.m.