knitr::opts_chunk$set(tidy = FALSE,message = FALSE)
library("BiocStyle")
BiocStyle::markdown()
suppressPackageStartupMessages(library("OmicsEV"))
suppressPackageStartupMessages(library("R.utils"))
suppressPackageStartupMessages(library("dplyr"))
suppressPackageStartupMessages(library("kableExtra"))
suppressPackageStartupMessages(library("formattable"))

Introduction

High-throughput technologies such as RNA-Seq and mass spectrometry-based proteomics are increasingly being applied to large sample cohorts, which creates vast amount of quantitative data for genes and proteins. Many algorithms, software, and pipelines have been developed to analyze these data. However, how to select optimal algorithms, software, and parameters for analyzing a specific omics dataset remains a significant challenge. To address this challenge, we have developed an R package named OmicsEV, which is dedicated to compare and evaluate different data matrices generated from the same omics dataset using different tools, algorithms, or parameter settings. In OmicsEV, we have implemented more than 15 evaluation metrics and all the evaluation results are included in an HTML-report for intuitive browsing. OmicsEV is easy to install and use. Only one function is needed to perform the whole evaluation process. A GUI based on R shiny is also implemented.

Example data

A few examples can be downloaded at https://github.com/bzhanglab/OmicsEV. One of the examples contains 6 data matrices generated from the same RNA dataset using different normalization methods. In addition, a proteomics data matrix and a sample list are also included. How to run this example is shown below.

Running OmicsEV

Preparing inputs

The two major inputs files are the omics data tables and a sample annotation file. More details can be found below.

Running evaluation process

In OmicsEV, Only one function (run_omics_evaluation) is needed to perform the whole evaluation process. An example is showing below:

library(OmicsEV)
run_omics_evaluation(data_dir = "datasets/",
                     sample_list = "sample_list.tsv",
                     x2 = "protein.tsv",
                     cpu=6,
                     data_type="gene",
                     class_for_ml="sample_ml.tsv")

In general, only a few parameters have to be set:

example_data <- read.delim(system.file("extdata/example_input_datasets.tsv",
                                       package = "OmicsEV"),
                           stringsAsFactors = FALSE)
kable(example_data,digits = 3,caption="An example of input dataset") %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
example_data <- read.delim(system.file("extdata/example_sample_list.tsv",
                                       package = "OmicsEV"),
                           stringsAsFactors = FALSE)
kable(example_data,digits = 3,caption="An example of sample list") %>%
  kable_styling(bootstrap_options = "striped", full_width = F)

All other parameters are optional. When input data tables for parameter data_dir are protein expression data and users also have gene expression data for the same samples, users can set parameter x2 as a file contains the gene expression data in tsv format, and vice versa. If parameter x2 is not NULL, sample wise and gene wise correlation analysis will be performed. See ?run_omics_evaluation for a more in-depth description of all its arguments.

The parameter class_for_ml is also set in above example. This parameter is used to specify the class information for class prediction. A sample list file or a character vector such as class_for_ml=c("T","C") is supported. If this is a sample list file, it must have the same format with the parameter "sample_list". This is useful when the class users want to predict is different from the one in the file for parameter "sample_list". OmicsEV uses an R S3 data class object to store data table and sample annotation data so it also needs to have batch and order as this is format requirement although order and batch are not used in class prediction. This file can be modified from the file for parameter "sample_list" by only updating the class to what users want for class prediction. If users want to predict the class present in the file for parameter "sample_list", then only a character vector to specify the class name is needed, such as class_for_ml=c("T","C"). If sample class prediction is not needed, then don't set anything to the parameter class_for_ml.

When the function is finished successfully, an HTML-based report that contains different evaluation metrics will be generated. Example reports are available at https://github.com/bzhanglab/OmicsEV.

Evaluation metrics implemented in OmicsEV

So far, more than 15 evaluation metrics have been implemented in OmicsEV and the evaluation result is organized in the following structure:

  1. Introduction
  2. Overview
  3. Data depth a. Study-wise (#identified features, #quantifiable features) b. Sample-wise c. Missing value distribution (Non-missing value percentage in the data table)
  4. Data normalization a. Boxplot (Data distribution similarity) b. Density plot
  5. Batch effect a. Silhouette width (silhouette width) b. PCA with batch annotation (pcRegscale) c. Correlation heatmap
  6. Biological signal a. Correlation among protein complex members (complex_ks) b. Gene function prediction (func_auc) c. Sample class prediction (class_auc) d. PCA with sample class annotation e. Unsupervised clustering
  7. Platform reproducibility (optional with QC sample) a. Coefficient of variation distribution (median CV)
  8. Multi-omics concordance (optional with two omics) a. Gene-wise mRNA-protein correlation (gene wise cor) b. Sample-wise mRNA-protein correlation (sample wise cor)

OmicsEV evaluation report

A few example evaluation reports are available at https://github.com/bzhanglab/OmicsEV.

Session information

All software and respective versions used to produce this document are listed below.

sessionInfo()

References



bzhanglab/OmicsEV documentation built on July 5, 2023, 5:29 a.m.