In kstawiski/OmicSelector: OmicSelector - a package for biomarker selection based on high-throughput experiments.

knitr::opts_chunk$set(
  collapse = TRUE, message=FALSE, warning=FALSE,
  comment = "#>"
)
knitr::opts_chunk$set(fig.width=12, fig.height=8)
knitr::opts_chunk$set(tidy.opts=list(width.cutoff=150),tidy=TRUE)
options(rgl.useNULL = TRUE)
options(warn=-1)
suppressMessages(library(dplyr))
set.seed(1)
options(knitr.table.format = "html")
library(OmicSelector)

Setup script

The package OmicSelector has a lot of requirements that are necessary to run all the experiments. The script below will allow installing most of them. It is highly recommended to install those packages using the code below.

If you are using our docker environment to learn this package, you should not run this again. It would be a waste of time. [However, if you wish, you can play with the code in the notebook. Click here.](If you are using our environment to learn this package, you can tweak this code in the interactive notebook. Click here.

Setup script:

readLines("https://raw.githubusercontent.com/kstawiski/OmicSelector/master/vignettes/setup.R") %>% paste0(collapse="\n") %>% cat

As the OmicSelector package is installed, we can load it easily.

library(OmicSelector)

Getting the data

To present the package functionality, we will use the pan-cancer data from TCGA (https://portal.gdc.cancer.gov/repository). These two commands below will download and store all the data required in your working directory. Please note that this process may take some time (depending on your network connection), but has to be performed only once.

# DO NOT RUN!!! THIS WILL WASTE TIME, just use data("orginal_TCGA_data")
# OmicSelector_download_tissue_miRNA_data_from_TCGA() # do not run it if you do not need to... You can use already downloaded files via data("orginal_TCGA_data") provided with the package.
# OmicSelector_process_tissue_miRNA_TCGA(remove_miRNAs_with_null_var = T)

Both of those functions produce two files: tissue_miRNA_counts.csv and tissue_miRNA_logtpm.csv. The first file contains metadata and raw counts as declared in TCGA. The second has log-transformed transcripts-per-million (TPM) counts. Let's load counts files and see its summary.

suppressWarnings(suppressMessages(library(data.table)))
suppressWarnings(suppressMessages(library(knitr)))
data("orginal_TCGA_data")
OmicSelector_table(table(orginal_TCGA_data$primary_site, orginal_TCGA_data$sample_type))

Let's consider the following exemplary problem.

We want to find the set of miRNAs most specific to pancreatic cancer. We see 178 cases of pancreatic cancer miRNA-seq results and only four solid tissue normal cases in ur TCGA dataset. However, we have multiple normal tissue miRNA-seq results from other projects that we could incorporate in the analysis. Let's filter and label the samples of interest.

suppressWarnings(suppressMessages(library(dplyr)))

cancer_cases = filter(orginal_TCGA_data, primary_site == "Pancreas" & sample_type == "PrimaryTumor")
control_cases = filter(orginal_TCGA_data, sample_type == "SolidTissueNormal")

The pipeline requires the variable Class with two levels (Case and Control) to be present in the dataset.

cancer_cases$Class = "Case"
control_cases$Class = "Control"

dataset = rbind(cancer_cases, control_cases)

OmicSelector_table(table(dataset$Class), col.names = c("Class","Number of cases"))

boxplot(dataset$age_at_diagnosis ~ dataset$Class)
t.test(dataset$age_at_diagnosis ~ dataset$Class)
OmicSelector_table(table(dataset$gender.x, dataset$Class))
chisq.test(dataset$gender.x, dataset$Class)

There is a statistically significant difference in age between classes. Gender was not associated with the class. To increase feature selection performance, we will try to balance the sets by performing propensity score matching.

old_dataset = dataset # backup
dataset = dataset[grepl("Adenocarcinomas", dataset$disease_type),]
match_by = c("age_at_diagnosis","gender.x")
tempdane = dplyr::select(dataset, all_of(match_by))
tempdane$Class = ifelse(dataset$Class == "Case", TRUE, FALSE)
suppressMessages(library(mice))
suppressMessages(library(MatchIt))
temp1 = mice(tempdane, m=1)
temp2 = temp1$data
temp3 = mice::complete(temp1)
temp3 = temp3[complete.cases(temp3),]
tempform = OmicSelector_create_formula(match_by)
mod_match <- matchit(tempform, data = temp3)
newdata = match.data(mod_match)
dataset = dataset[as.numeric(rownames(newdata)),]

Let's check if there is an imbalance in the new, transformed dataset.

boxplot(dataset$age_at_diagnosis ~ dataset$Class)
t.test(dataset$age_at_diagnosis ~ dataset$Class)
OmicSelector_table(table(dataset$gender.x, dataset$Class))
chisq.test(dataset$gender.x, dataset$Class)
fwrite(dataset, "balanced_dataset.csv.gz")
OmicSelector_tutorial_balanced_dataset = dataset # can be used by data("OmicSelector_tutorial_balanced_dataset")

There are no significant differences in age or gender composition between classes now. We can proceed to standard filtering, log-transformation, and TPM-normalization.

However, first, to stay consistent between different datasets, we need to standardize microRNA names. OmicSelector_correct_miRNA_names() unify the miRNA names between different versions of miRbase.

# In real file use: (we won't run this in this tutorial)
# dataset = OmicSelector_correct_miRNA_names(dataset) 
# This will correct miRNA names based on the aliases. Useful when analyzing old datasets - to keep the results coherent with current knowledge.

danex = dplyr::select(dataset, starts_with("hsa")) # Create data.frame or matrix with miRNA counts with miRNAs in columns and cases in rows.
metadane = dplyr::select(dataset, -starts_with("hsa")) # Metadata with 'Class' variables.
OmicSelector_table(table(metadane$Class)) # Let's be sure that 'Class' variable is correct and contains only 'Case' and 'Control' cases.
ttpm = OmicSelector_counts_to_log10tpm(danex, metadane, ids = metadane$sample, filtr = T, filtr_minimalcounts = 100, filtr_howmany = 1/3) # We will leave only the miRNAs which apeared with at least 100 counts in 1/3 of cases.

You might have noticed that we have filtered out microRNAs not having at least 100 counts in 1/3 of the samples. We did ensure that microRNAs selected as features for the classifier (possibly useful in the clinic) would be detectable in other, cheaper methods, such as qPCR. After filtering, there are 166 miRNAs left.

In the next step, we will divide the dataset into training, testing, and validation datasets. We strongly believe that hold-out validation is the most redundant validation method, and although OmicSelector supports cross-validation, the hold-out confirmation is set by default in most cases. Thus, the rest of the analysis is dependent on the existence of 3 separate datasets:

Training dataset (mixed_train.csv): By default 60%, used for differential expression, feature selection and model training.
Testing dataset (mixed_test.csv): By default 20%, used for hyperparameter selection (in holdout=T mode) and for performance assessment.
Validation dataset (mixed_valid.csv): By default 20%, used only for performance assessment.

The best signature (best set of miRNAs for the diagnostic test) can be selected based on all three datasets, two datasets, or only a validation set. We will discuss the process of the best signature selection below.

The split can be prepared manually by the user (the pipeline expects to find mixed_*.csv files in the working directory) or in a convenient way using OmicSelector_prepare_split(). Let's do it now.

mixed = OmicSelector_prepare_split(metadane = metadane, ttpm = ttpm, train_proc = 0.6)
OmicSelector_tutorial_balanced_mixed = mixed # can be used by data("OmicSelector_tutorial_balanced_mixed")

Let's see a split summary.

mixed = fread("mixed.csv")
OmicSelector_table(table(mixed$Class, mixed$mix))
OmicSelector_table(cbind(mixed[1:10,c(100:105)], Class = mixed[1:10,"Class"]))

We can see that the dataset was divided in a balanced way. Now we are ready to move to the analysis...

Basic exploratory analysis

In biomarker studies, we rely on validation. We perform hold-out verification, so the signature selection has to be based on the training dataset only. Including testing and validation dataset in the exploratory analysis could lead to bias ('data leakage'). In the following section, we show how to use our package to perform quick exploratory analysis of miRNA-seq data.

dane = OmicSelector_load_datamix() # load mixed_*.csv files
train = dane[[1]]; test = dane[[2]]; valid = dane[[3]]; train_smoted = dane[[4]]; trainx = dane[[5]]; trainx_smoted = dane[[6]] # get the objects from list to make the code more readable.

OmicSelector_load_datamix() function loads the data created in preparation phase. It requires the output constructed by OmicSelector_prepare_split() function to be placed in working directory ('wd'), thus files 'mixed_train.csv', 'mixed_test.csv' and 'mixed_valid.csv' have to exist in the directory. If you have split the data manually, there may be some imbalance of classes in train, test, or validation datasets. If so, OmicSelector_prepare_split() perform balancing using:

ROSE: https://journal.r-project.org/archive/2014/RJ-2014-008/RJ-2014-008.pdf - by default, we generate 10 * number of cases in the original dataset.
SMOTE (default): https://arxiv.org/abs/1106.1813 - by default, we use our modification of SMOTE, i.e. we add SMOTEd cases only to the minority class (we do not down-/sample majority class). The minimal amount of added rows is calculated to compensate for the imbalance problem. See details in the code of OmicSelector_load_datamix().

At the beginning of the analysis, we usually perform principal component analysis (PCA) to assess any batch effect and possible outliers; hence understand the miRNA profile. The package can construct a 2-dimensional biplot and a 3-dimensional interactive scatterplot based on the computed components.

pca = OmicSelector_PCA(trainx, train$Class)
pca

3D PCA plot may not be shown correctly in Jupyter notebook, but you can check it in, e.g., R Studio.

if(is.null(sessionInfo()$loadedOnly$IRdisplay)) { # if not in the Jupyter, if you run OmicSelector_PCA_3D in learning/editing Jupyter enviorment it may cause: *** caught segfault *** address 0x1, cause 'memory not mapped'
pca3d = OmicSelector_PCA_3D(trainx, train$Class)
pca3d }

Now we can also correct the batch effect if there is any. For example, you can use OmicSelector_combat() to do so. However, the detailed demonstration is out of the scope of this tutorial.

Usually, the next step in the exploratory analysis is to perform the differential expression analysis. Differential expression in our package is biomarker-discovery oriented. Thus it uses a t-test with the correction for multiple comparisons. The following table shows significantly differently expressed miRNAs after Benjamini-Hochberg correction.

de = OmicSelector_differential_expression_ttest(trainx, train$Class)
sig_de = de %>% dplyr::filter(`p-value BH` <= 0.05) %>% dplyr::arrange(`p-value BH`) # leave only significant after Benjamini-Hochberg procedure and sort by ascending p-value
OmicSelector_table(sig_de)

Let's visualize the results of differential expression using heatmap and volcano plot.

OmicSelector_heatmap(x = dplyr::select(trainx, sig_de$miR),
           rlab = data.frame(Class = train$Class),
           zscore = F, margins = c(10,10))

Z-scoring the values before clustering and plotting will help us to gain more insight.

OmicSelector_heatmap(x = dplyr::select(trainx, sig_de$miR),
           rlab = data.frame(Class = train$Class),
           zscore = T, margins = c(10,10))

We will also create a vulcano plot and label top 10 most significant miRNAs:

OmicSelector_vulcano_plot(selected_miRNAs = de$miR, DE = de, only_label = sig_de$miR[1:10])

We may also what to check the consistency of differential expression between datasets:

de_test = OmicSelector_differential_expression_ttest(dplyr::select(test, starts_with("hsa")), test$Class)
de_valid = OmicSelector_differential_expression_ttest(dplyr::select(valid, starts_with("hsa")), valid$Class)
OmicSelector_correlation_plot(de$log2FC, de_test$log2FC, "log2FC on training set", "log2FC on test set", "", yx = T)
OmicSelector_correlation_plot(de$log2FC, de_valid$log2FC, "log2FC on training set", "log2FC on validation set", "", yx = T)
OmicSelector_correlation_plot(de_test$log2FC, de_valid$log2FC, "log2FC on test set", "log2FC on validation set", "", yx = T)

miRNA selection

This package's main feature is the shotgun-like feature selection evaluation of possible miRNA signatures of biological processes. We can straightforwardly apply the function, e.g.:

library(OmicSelector)
selected_features = OmicSelector_OmicSelector(wd = getwd(), m = c(1:4), max_iterations = 1, stamp = "tutorial") # For the sake of this tutorial and vignette building we will use only few fastest methods. The m parameter defines what methods will be tested. See more details below.

But, for larger projects, we suggest using the following wrapper, which will perform the feature selection in parallel, significantly reducing computational time. We do not recommend using more than five threads, because some of the methods inherently use multicore processing:

readLines("https://raw.githubusercontent.com/kstawiski/OmicSelector/master/vignettes/Tutorial_OmicSelector.R") %>% paste0(collapse="\n") %>% cat

If you are using our environment to learn this package, you can tweak and run this code in the interactive notebook. Click here.

A few notes about what it does:

This function iterates through available methods, which are described by m parameter. This function aims to perform feature selection using multiple approaches and create formulas for benchmarking.
It loads the data from the working directory. The output is mainly put in files in the working directory. Log and temporary files are placed in a created temp subfolder.

Files created for each method (e.g. for stamp=tutorial and m=1):

formulastutorial-1.RDS - main result file containing the final formula (final set of miRNAs selected by this method).
time1-formula.RDS - time taken to compute the results
tutorial1featureselection.log - log file of the process
(optional) all1-tutorial.rdata - all variables created during feature selection (created if debug=T).

Pearls about the methods:

Sig = miRNAs with p-value <0.05 after BH correction (DE using t-test)
Fcsig = sig + absolute log2FC filter (included if abs. log2FC>1)
Cfs = Correlation-based Feature Selection for Machine Learning (more: https://www.cs.waikato.ac.nz/~mhall/thesis.pdf)
Classloop = Classification using different classification algorithms (classifiers) with the embedded feature selection and using the various schemes for the performance validation (more: https://rdrr.io/cran/Biocomb/man/classifier.loop.html)
Fcfs = CFS algorithm with forward search (https://rdrr.io/cran/Biocomb/man/select.forward.Corr.html)
MDL methods = minimal description length (MDL) discretization algorithm with different a method of feature ranking or feature selection (AUC, SU, CorrSF) (more: https://rdrr.io/cran/Biocomb/man/select.process.html)
bounceR = genetic algorithm with componentwise boosting (more: https://www.statworx.com/ch/blog/automated-feature-selection-using-bouncer/)
RandomForestRFE = recursive feature elimination using random forest with resampling to assess the performance. (more: https://topepo.github.io/caret/recursive-feature-elimination.html#resampling-and-external-validation)
GeneticAlgorithmRF (more: https://topepo.github.io/caret/feature-selection-using-genetic-algorithms.html)
SimulatedAnnealing = makes small random changes (i.e. perturbations) to an initial candidate solution (more: https://topepo.github.io/caret/feature-selection-using-simulated-annealing.html)
Boruta (more: https://www.jstatsoft.org/article/view/v036i11/v36i11.pdf)
spFSR = simultaneous perturbation stochastic approximation (SPSA-FSR) (more: https://arxiv.org/abs/1804.05589)
varSelRF = using the out-of-bag error as minimization criterion, carry out variable elimination from the random forest by successfully eliminating the least important variables (with importance as returned from the random forest). (more: https://www.ncbi.nlm.nih.gov/pubmed/16398926)
WxNet = a neural network-based feature selection algorithm for transcriptomic data (more: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6642261/)
Step = backward stepwise method of feature selection based on logistic regression (GLM, family = binomial) using AIC criteria (stepAIC) and functions from My.stepwise package (https://cran.r-project.org/web/packages/My.stepwise/index.html)

Notes about methods:

shiny::includeHTML("methods.html")

The OmicSelector functions saves all output files to temp/ directory. As users may want to run multiple selection methods in different configurations, we do not recommend using the return of this function in the following steps. Instead, we provide OmicSelector_merge_formulas() which conviniently summerizes the results of feature selection. We can do:

selected_sets_of_miRNAs = OmicSelector_merge_formulas(max_miRNAs = 11) # we filter out sets with more than 11 miRNAs.
selected_sets_of_miRNAs_with_own = OmicSelector_merge_formulas(max_miRNAs = 11, 
                                                     add = list("my_own_signature" = c("hsa.miR.192.5p","hsa.let.7g.5p","hsa.let.7a.5p","hsa.let.7d.5p","hsa.miR.194.5p", "hsa.miR.98.5p", "hsa.let.7f.5p", "hsa.miR.26b.5p"))) # you can also add your own signature (for example selected from literature)

Note that:

Two files are created in the working directory: (1) featureselection_formulas_all.RDS - contains the formulas for all selection methods, (2) featureselection_formulas_final.RDS - contains methods that selected a smaller or equal number of microRNAs than specified in max_miRNA parameter, as well as fcsig and cfs_sig.
The results of fcsig and cfs_sig methods are always retained in the final formulas set (ignoring max_miRNA parameter) - they are commonly used as benchmark comparator for the final set of miRNAs. Those sets can be manually removed (if needed) from the final selection.
Readable tables are written in corresponding *.csv files.

Let's analyze the process of feature selection:

all_sets = readRDS("featureselection_formulas_all.RDS")
length(all_sets) # How many feature selection methods completed in time?
final_sets = readRDS("featureselection_formulas_final.RDS")
length(final_sets) # How many feature selection methods completed in time and fulfilled max_miRNA criteria? (remember about fcsig and cfs_sig)
featureselection_formulas_final = fread("featureselection_formulas_final.csv")
OmicSelector_table(featureselection_formulas_final) # show information about selected formulas

Note that my_own_signture has 0 miRNAs according to the table. This trick ensures that we will not filter out signatures added manually based on the max_miRNA parameter.

We will summarize the number of microRNAs selected by methods implemented in OmicSelector_OmicSelector by creating a histogram and calculating some descriptive statistics.

hist(featureselection_formulas_final$ile_miRNA[-which(featureselection_formulas_final$ile_miRNA == 0)], 
     breaks = ncol(train),
     main = "Number of selected microRNAs distribution",
     xlab = "Number of selected microRNAs"
     ) # Histogram showing how many miRNAs were selected in final set.
psych::describe(featureselection_formulas_final$ile_miRNA[-which(featureselection_formulas_final$ile_miRNA == 0)]) # Descriptive statistics of how many features where selected in the final set.

Benchmarking

In the next step of looking for the best microRNA signature, we perform benchmarking. This tests all the signatures using different classifier architectures. Here is the example of a benchmark with default parameters:

readLines("Tutorial_benchmark.R") %>% paste0(collapse="\n") %>% cat

If you are using our environment to learn this package, you can tweak and run this code in the interactive notebook. Click here.

Just for rendering this tutorial, we will use a straightforward benchmark using only logistic regression and conditional trees.

library(OmicSelector)
OmicSelector_tutorial_balanced_benchmark = OmicSelector_benchmark(search_iters = 5, # 5 random hyperparameter sets will be checked; 5 is set here for speed purposes.. for real projects use more, like 5000...
            algorithms = c("ctree"), # just add ctree, note that logistic regression (glm) is always included
            output_file = paste0("benchmark.csv")) # the main output
# exemplary benchmark data can be loaded using data('OmicSelector_tutorial_balanced_benchmark')

As benchmarking is done, the main result file is saved in the file specified in output_file parameter. It contains the performance metrics of signatures satisfying initial criteria (e.g. max_miRNA) across different methods of modeling. Let's take a quick look:

OmicSelector_table(fread("benchmark.csv"))

Description of columns:

method - accronym for method
SMOTE - if balancing using SMOTE or ROSE was used in the training dataset.
miRy - formula used (miRNAs selected).
*_modelname - the name of .RDS file placed in models/ directory, containing the caret final model that can be used for the prediction of new cases. This allows the reproducibility of the results. For example glm_ prefix is set according to the method, glm = logistic regression.
*_train_ROCAUC - area under the ROC curve (AUC ROC) on the training dataset, indicating the general potential of the model.
*_train_ROCAUC_lower95CI - lower boundery of 95% confidence interval for AUC ROC.
*_train_ROCAUC_upper95CI - upper boundery of 95% confidence interval for AUC ROC.
*_train_Accuracy - accuracy on the training set.
*_train_Sensitivity - sensitivity on training set.
*_train_Specificity - sensitivity on the training set.

By this logic, every parameter is also calculated from the testing (_test_) and validation (_valid_) set. If the method generated a probability, a default cutoff is used for all of the predictions.

Let's see the general performance (accuracy) of methods in the benchmark:

metody = OmicSelector_get_benchmark_methods("benchmark.csv") # gets the methods used in benchmark
par(mfrow = c(2,2))
for(i in 1:length(metody)){
    temp = OmicSelector_get_benchmark("benchmark.csv") # loads benchmark
    temp2 = dplyr::select(temp, starts_with(paste0(metody[i],"_")))
    boxplot(temp[,paste0(metody[i],"_train_Accuracy")], temp[,paste0(metody[i],"_test_Accuracy")], temp[,paste0(metody[i],"_valid_Accuracy")],
            main = paste0("Method: ", metody[i]), names = c("Training","Testing","Validation"), ylab = "Accuracy", ylim = c(0.5,1))
    tempids = c(match(paste0(metody[i],"_train_Accuracy"), colnames(temp)), match(paste0(metody[i],"_test_Accuracy"), colnames(temp)), match(paste0(metody[i],"_valid_Accuracy"), colnames(temp)))
  }
par(mfrow = c(1,1))

With OmicSelector package, the final optimal feature signature can be selected in 3 ways:

1. Picking the signature which achieved the best accuracy in training, testing and validation: (metaindex = mean of all 3 accuracy metrics)

acc1 = OmicSelector_best_signature_proposals(benchmark_csv = "benchmark.csv", without_train = F) # generates the benchmark sorted by metaindex
best_signatures = acc1[1:3,] # get top 3 methods
OmicSelector_table(best_signatures[,c("metaindex","method","miRy")])

2. Picking the signature which achieved the best accuracy in testing and validation: (metaindex = mean of 2 accuracy metrics)

acc1 = OmicSelector_best_signature_proposals(benchmark_csv = "benchmark.csv", without_train = T) # generates the benchmark sorted by metaindex
best_signatures = acc1[1:3,] # get top 3 methods
OmicSelector_table(best_signatures[,c("metaindex","method","miRy")])

3. The signature which achieved the best sensitivity and specificity in validation: (metaindex = mean of sensitivity and specificity in validation dataset)

acc = OmicSelector_best_signature_proposals_meta11(benchmark_csv = "benchmark.csv") # generates the benchmark sorted by metaindex
best_signatures = acc[1:3,] # get top 3 methods
OmicSelector_table(best_signatures[,c("metaindex","method","miRy")])

It is a good practice to assess learning performance for considered signatures across different classification methods. Here, we visualize the over/underfitting of selected methods by comparing the accuracy score between testing and validation sets for 3 top scoring signatures across a selection of model architectures.

for(i in 1:length(metody))
  {
suppressMessages(library(PairedData))
suppressMessages(library(profileR))
pd = paired(as.numeric(acc[1:3,paste0(metody[i],"_train_Accuracy")]),as.numeric(acc[1:3,paste0(metody[i],"_test_Accuracy")]))
colnames(pd) = c("Train Accuracy","Test Accuracy")
plot2 = OmicSelector_profileplot(pd, Method.id = acc$method[1:3], standardize = F)
pd = paired(as.numeric(acc[1:3,paste0(metody[i],"_train_Accuracy")]),as.numeric(acc[1:3,paste0(metody[i],"_valid_Accuracy")]))
colnames(pd) = c("Train Accuracy","Valid Accuracy")
plot3 = OmicSelector_profileplot(pd, Method.id = acc$method[1:3], standardize = F)
pd = paired(as.numeric(acc[1:3,paste0(metody[i],"_test_Accuracy")]),as.numeric(acc[1:3,paste0(metody[i],"_valid_Accuracy")]))
colnames(pd) = c("Test Accuracy","Valid Accuracy")
plot4 = OmicSelector_profileplot(pd, Method.id = acc$method[1:3], standardize = F)



require(gridExtra)
grid.arrange(arrangeGrob(plot2, plot3, ncol=2, nrow = 1, top=metody[i]))
grid.arrange(arrangeGrob(plot4, ncol=1, nrow = 1, top=metody[i]))
}

The relationship betweend accuracy on testing and validation sets can also be pictured as scatterplot:

acc2 = acc[1:6,] # get top 6 methods
accmelt = melt(acc2, id.vars = "method") %>% filter(variable != "metaindex") %>% filter(variable != "miRy")
accmelt = cbind(accmelt, strsplit2(accmelt$variable, "_"))
acctest = accmelt$value[accmelt$`2` == "test"]
accvalid = accmelt$value[accmelt$`2` == "valid"]
accmeth = accmelt$method[accmelt$`2` == "test"]
unique(accmeth)
plot5 = ggplot(, aes(x = as.numeric(acctest), y = as.numeric(accvalid), shape = accmeth)) +
  geom_point() + scale_x_continuous(name="Accuracy on test set", limits=c(0.5, 1)) +
  scale_y_continuous(name="Accuracy on validation set", limits=c(0.5, 1)) +
  theme_bw()
grid.arrange(arrangeGrob(plot5, ncol=1, nrow = 1))

Best signature analysis

Suppose we decide to use sensitivity and specificity in the validation set (e.g., in the model architectures we plan to use in our research) as a decisive scoring metric for feature signature. The three best signatures may be:

OmicSelector_table(best_signatures[1:3,2:4])

To get the miRNAs from the formula you can use: OmicSelector_get_features_from_benchmark.

selected_miRNAs = OmicSelector_get_features_from_benchmark(benchmark_csv = "benchmark.csv", best_signatures$method[1]) # for the best performing signature
gsub("\\.", "-", selected_miRNAs) # R doesn't like hyphens, but we can introduce them easly

As a double check, we can inspect the differential expression of miRNAs constituating selected signature:

best_de = OmicSelector_best_signature_de(selected_miRNAs)
OmicSelector_table(best_de)

Let's visualize the performance of those methods using barplots:

for(i in 1:3){
  cat(paste0("\n\n## ", acc$method[i],"\n\n"))
  par(mfrow = c(1,2))
  acc = OmicSelector_best_signature_proposals_meta11("benchmark.csv")
  metody = OmicSelector_get_benchmark_methods("benchmark.csv")
  ktory_set = match(acc$method[i], OmicSelector_get_benchmark("benchmark.csv")$method)
  #do_ktorej_kolumny = which(colnames(acc) == "metaindex")
  #barplot(as.numeric(acc[i,1:do_ktorej_kolumny]))
  for(ii in 1:length(metody)) {

    temp = OmicSelector_get_benchmark("benchmark.csv") %>% 
      dplyr::select(starts_with(paste0(metody[ii],"_t")),starts_with(paste0(metody[ii],"_v")))

    ROCtext = paste0("Training AUC ROC: ", round(temp[ktory_set,1],2), " (95%CI: ", round(temp[ktory_set,2],2), "-", round(temp[ktory_set,3],2), ")")

    temp = temp[,-c(1:3)]
    temp2 = as.numeric(temp[ktory_set,])
    temp3 = matrix(temp2, nrow = 3, byrow = T)
    colnames(temp3) = c("Accuracy","Sensitivity","Specificity")
    rownames(temp3) = c("Training","Testing","Validation")
    temp3 = t(temp3)

    plot1 = barplot(temp3, beside=T, ylim = c(0,1), xlab = paste0(ROCtext,"\nBlack - accuracy, blue - sensitivity, green - specificity"), width = 0.85, col=c("black", "blue", "green"), legend = F,  args.legend = list(x="topright", bty = "n", inset=c(0, -0.25)), cex.lab=0.7, main = paste0(acc$method[i], " - ", metody[ii]), font.lab=2)
    ## Add text at top of bars
    text(x = plot1, y = as.numeric(temp3), label = paste0(round(as.numeric(temp[ktory_set,])*100,1),"%"), pos = 3, cex = 0.6, col = "red")
  }
  par(mfrow = c(1,1))

}

Finally, we can assess the overlap of top 3 feature selection methods:

overlap = OmicSelector_signature_overlap(acc$method[1:3], "benchmark.csv")

Which miRNAs are common for between feature selection methods?

attr(overlap,"intersections")

Let's draw vulcano plot and mark the miRNAs selected in best signature:

OmicSelector_vulcano_plot(selected_miRNAs = de$miR, DE = de, only_label = selected_miRNAs)

Let's draw a heatmap for selected miRNAs in the whole dataset (training, testing and validation set).

OmicSelector_heatmap(x = dplyr::select(mixed, selected_miRNAs),
           rlab = data.frame(Class = mixed$Class, Mix = mixed$mix),
           zscore = F, margins = c(10,10))

OmicSelector_heatmap(x = dplyr::select(mixed, selected_miRNAs),
           rlab = data.frame(Class = mixed$Class, Mix = mixed$mix),
           zscore = T, margins = c(10,10))

Based on everything we have done so far, we suggest using the following signature to validate the biomarker study further.

cat(paste0(gsub("\\.", "-", selected_miRNAs), collapse = ", "))

Sesssion

sessionInfo()

packageDescription("OmicSelector")

To render this tutorial we used:

render("Tutorial.Rmd", output_file = "Tutorial.html", output_dir = "../inst/doc/")

Packages installed in our docker enviorment:

OmicSelector_table(as.data.frame(installed.packages()))

Clean the temporary and model files (as the tutorial results are simplified, and we do not need them).

unlink("temp", recursive=TRUE)
unlink("models", recursive=TRUE)
unlink("task.log")
unlink("mixed*.csv")

kstawiski/OmicSelector documentation built on April 10, 2024, 11:11 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

kstawiski/OmicSelector
OmicSelector - a package for biomarker selection based on high-throughput experiments.

In kstawiski/OmicSelector: OmicSelector - a package for biomarker selection based on high-throughput experiments.

Setup script

Getting the data

Basic exploratory analysis

miRNA selection

Benchmarking

Best signature analysis

Sesssion

R Package Documentation

Browse R Packages

We want your feedback!

kstawiski/OmicSelector OmicSelector - a package for biomarker selection based on high-throughput experiments.

In kstawiski/OmicSelector: OmicSelector - a package for biomarker selection based on high-throughput experiments.

Setup script

Getting the data

Basic exploratory analysis

miRNA selection

Benchmarking

Best signature analysis

Sesssion

R Package Documentation

Browse R Packages

We want your feedback!

kstawiski/OmicSelector
OmicSelector - a package for biomarker selection based on high-throughput experiments.