knitr::opts_chunk$set( collapse = TRUE, message=FALSE, warning=FALSE, comment = "#>" ) knitr::opts_chunk$set(fig.width=12, fig.height=8) knitr::opts_chunk$set(tidy.opts=list(width.cutoff=150),tidy=TRUE) options(rgl.useNULL = TRUE) options(warn=-1) suppressMessages(library(dplyr)) set.seed(1) options(knitr.table.format = "html") library(miRNAselector)
The package miRNAselector has a lot of requirements that are nessesary to run all the experiments. The script below will allow to install most of them. It is highly recommended to install those packages using the code below.
If you are using our docker enviorment to learn this package you should not run this again. It would be a waste of time. [However, if you wish, you can play with the code in the notebook. Click here.](If you are using our enviorment to learn this package you can tweak this code in the interactive notebook. Click here.
Setup script:
readLines("https://raw.githubusercontent.com/kstawiski/miRNAselector/master/vignettes/setup.R") %>% paste0(collapse="\n") %>% cat
This code does not cover the installation of mxnet, which can be used for benchmarking of the selected miRNA sets. The mxnet is however built and installed in our docker enviorment.
As the miRNAselector
package is installed we can load it easly.
library(miRNAselector)
To present the package functionality, we will use the pan-cancer data from TCGA (https://portal.gdc.cancer.gov/repository). These two commands below will download and store all the necessary data in your working directory. Please note, that this process may take some time (depending on your network connection), but has to be performed only once.
ks.download_tissue_miRNA_data_from_TCGA() ks.process_tissue_miRNA_TCGA(remove_miRNAs_with_null_var = T)
Both of those function produce 2 files: tissue_miRNA_counts.csv
and tissue_miRNA_logtpm.csv
.
First of those files contains metadata and raw counts as declared in TCGA. The second is are log-transformed transcripts-per-million (TPM) counts.
Let's load counts files and see its summary.
suppressWarnings(suppressMessages(library(data.table))) suppressWarnings(suppressMessages(library(knitr))) data("orginal_TCGA_data") ks.table(table(orginal_TCGA_data$primary_site, orginal_TCGA_data$sample_type))
Let's consider a following exemplary problem..
We want to find the set of miRNAs the most specific to pancreatic cancer. We see that there are 178 cases of pancreatic cancer miRNA-seq results and only 4 solid tissue normal cases. However, we have multiple normal tissue miRNA-seq results from other projects that could be incorporated in the analysis. Let's filter and label the samples of interest.
suppressWarnings(suppressMessages(library(dplyr))) cancer_cases = filter(orginal_TCGA_data, primary_site == "Pancreas" & sample_type == "PrimaryTumor") control_cases = filter(orginal_TCGA_data, sample_type == "SolidTissueNormal")
The pipeline requires the variable Class
with two levels (Cancer
and Control
) to be present in the dataset.
cancer_cases$Class = "Cancer" control_cases$Class = "Control" dataset = rbind(cancer_cases, control_cases) ks.table(table(dataset$Class), col.names = c("Class","Number of cases"))
boxplot(dataset$age_at_diagnosis ~ dataset$Class) t.test(dataset$age_at_diagnosis ~ dataset$Class) ks.table(table(dataset$gender.x, dataset$Class)) chisq.test(dataset$gender.x, dataset$Class)
There is a stistically significant difference in age between classess. The gender was not associated with class. In order to increase feature selection performance, we will try to balance the sets by performing propensity score matching.
old_dataset = dataset # backup dataset = dataset[grepl("Adenocarcinomas", dataset$disease_type),] match_by = c("age_at_diagnosis","gender.x") tempdane = dplyr::select(dataset, match_by) tempdane$Class = ifelse(dataset$Class == "Cancer", TRUE, FALSE) suppressMessages(library(mice)) suppressMessages(library(MatchIt)) temp1 = mice(tempdane, m=1) temp2 = temp1$data temp3 = mice::complete(temp1) temp3 = temp3[complete.cases(temp3),] tempform = ks.create_miRNA_formula(match_by) mod_match <- matchit(tempform, data = temp3) newdata = match.data(mod_match) dataset = dataset[as.numeric(rownames(newdata)),]
Let's check if there is any imbalance in the new, transformed dataset.
boxplot(dataset$age_at_diagnosis ~ dataset$Class) t.test(dataset$age_at_diagnosis ~ dataset$Class) ks.table(table(dataset$gender.x, dataset$Class)) chisq.test(dataset$gender.x, dataset$Class) fwrite(dataset, "balanced_dataset.csv.gz") miRNAselector_tutorial_balanced_dataset = dataset # can be used by data("miRNAselector_tutorial_balanced_dataset")
There are no significant differences in age or gender composition between classes now. We can proceed to standard filtering, log-transormation and TPM-normalization.
However, first, in order to stay consistent between different datasets, we need to standardise microRNA names.
ks.correct_miRNA_names()
unify the miRNA names between different versions of miRbase.
dataset = ks.correct_miRNA_names(dataset) # Correct miRNA names based on the aliases. Useful when analyzing old datasets - to keep the results coherent with current knowledge. danex = dplyr::select(dataset, starts_with("hsa")) # Create data.frame or matrix with miRNA counts with miRNAs in columns and cases in rows. metadane = dplyr::select(dataset, -starts_with("hsa")) # Metadata with 'Class' variables. ks.table(table(metadane$Class)) # Let's be sure that 'Class' variable is correct and contains only 'Cancer' and 'Control' cases. ttpm = ks.counts_to_log10tpm(danex, metadane, ids = metadane$sample, filtr = T, filtr_minimalcounts = 100, filtr_howmany = 1/3) # We will leave only the miRNAs which apeared with at least 100 counts in 1/3 of cases.
You might have noticed, that we have filtered out microRNAs not having at least 100 counts in 1/3 of samples. It was done to ensure that microRNAs selected as features for classifier (possibly used in clinic) would be detectable in other, cheaper methods, such as qPCR. After filtering there are 166 miRNAs left.
In the next step we will devide the dataset into training, testing and validation datasets. We strongly belive that hold-out validation is the most redundant validation method and although miRNAselector supports cross-validation, the hold-out validation is set by default in most cases. Thus, the rest of the analysis is dependent on existance of 3 seperate datasets:
mixed_train.csv
): By default 60%, used for differential expression, feature selection and model training.mixed_test.csv
): By default 20%, used for hyperparameter selection (in holdout=T
mode)and for performance assessment.mixed_valid.csv
): By default 20%, used only for performance assessment.The best signiture (best set of miRNAs for diagnostic test) can be selected based on all 3 datasets, 2 datasets or only validation set. The process of best signiture selection will be discussed below.
The split can be prepared manually by user (the pipeline expects to find mixed_*.csv
files in working directory) or in a convinient way using ks.prepare_split()
. Let's do it now.
mixed = ks.prepare_split(metadane = metadane, ttpm = ttpm, train_proc = 0.6) miRNAselector_tutorial_balanced_mixed = mixed # can be used by data("miRNAselector_tutorial_balanced_mixed")
Let's see a split summary.
mixed = fread("mixed.csv") ks.table(table(mixed$Class, mixed$mix)) ks.table(cbind(mixed[1:10,c(100:105)], Class = mixed[1:10,"Class"]))
We can see that the dataset was devided in balanced way. Now we are ready to move to the analysis...
In biomarker studies we relay on validation. We perform hold-out validation, so the signature selection has to be based on training dataset only. Including testing and validation dataset in the exploratory analysis could lead to bias ('data leakage'). In the following section we show how to use our package to perform quick exploratory analysis of miRNA-seq data.
dane = ks.load_datamix(use_smote_not_rose = T) # load mixed_*.csv files train = dane[[1]]; test = dane[[2]]; valid = dane[[3]]; train_smoted = dane[[4]]; trainx = dane[[5]]; trainx_smoted = dane[[6]] # get the objects from list to make the code more readable.
ks_load_datamix()
function loads the data created in preparation phase. It requires the output constructed by ks.prepare_split()
function to be placed in working directory ('wd'), thus files 'mixed_train.csv', 'mixed_test.csv' and 'mixed_valid.csv' have to exist in the directory.
If you have split the data manually, there may be some imbalance of classes in train, test or validation datasets. If so, ks.prepare_split()
perform balancing using:
At the beging of the analysis we usually perform principal component analysis (PCA) to assess for any batch effect, possible outliers and get a general understanding of miRNA profile. The package can construct 2-dimentional biplot and 3-dimentional interactive scatterplot based on the computed components.
pca = ks.PCA(trainx, train$Class) pca
3D PCA plot may not be shown correctly in Jupyter notebook, but you can check in e.g. R Studio that it works.
if(is.null(sessionInfo()$loadedOnly$IRdisplay)) { # if not in the Jupyter, if you run ks.PCA_3D in learning/editing Jupyter enviorment it may cause: *** caught segfault *** address 0x1, cause 'memory not mapped' pca3d = ks.PCA_3D(trainx, train$Class) pca3d }
Now we can also correct the batch effect, if there is any. For example, you can use ks.combat()
to do so. However, detailed demonstration is out of scope of this tutorial.
Usually, the next step in the exploratory analysis is to perform the differential expression analysis. Differential expression in our package is biomarker-discovery oriented, thus it uses t-test with the correction for multiple comparisons. The following table shows signifiantly differently expressed miRNAs after Benjamini-Hochberg correction.
de = ks.miRNA_differential_expression(trainx, train$Class) sig_de = de %>% dplyr::filter(`p-value BH` <= 0.05) %>% dplyr::arrange(`p-value BH`) # leave only significant after Benjamini-Hochberg procedure and sort by ascending p-value ks.table(sig_de)
Let's visualize the results of differential expression using heatmap and vulcano plot.
ks.heatmap(x = dplyr::select(trainx, sig_de$miR), rlab = data.frame(Class = train$Class), zscore = F, margins = c(10,10))
Z-scoring the values before clustering and plotting will help us to gain more insight.
ks.heatmap(x = dplyr::select(trainx, sig_de$miR), rlab = data.frame(Class = train$Class), zscore = T, margins = c(10,10))
We will also create a vulcano plot and label top 10 most significant miRNAs:
ks.vulcano_plot(selected_miRNAs = de$miR, DE = de, only_label = sig_de$miR[1:10])
We may also what to check the consistency of differential expression between datasets:
de_test = ks.miRNA_differential_expression(dplyr::select(test, starts_with("hsa")), test$Class) de_valid = ks.miRNA_differential_expression(dplyr::select(valid, starts_with("hsa")), valid$Class) ks.correlation_plot(de$log2FC, de_test$log2FC, "log2FC on training set", "log2FC on test set", "", yx = T) ks.correlation_plot(de$log2FC, de_valid$log2FC, "log2FC on training set", "log2FC on validation set", "", yx = T) ks.correlation_plot(de_test$log2FC, de_valid$log2FC, "log2FC on test set", "log2FC on validation set", "", yx = T)
The main feature of this package is the shotgun-like feature selection evaluation of possible miRNA signatures of biological processes. The function can be applied in a straightforward way, e.g.:
library(miRNAselector) selected_features = ks.miRNAselector(wd = getwd(), m = c(1:3,51), max_iterations = 1, stamp = "tutorial") # For the sake of this tutorial and vignette building we will use only few fastest methods. The m parameter defines what methods will be tested. See more details below.
But, for largers projects we suggest using the following wrapper wich will perform the feature selection in parallel, significantly reducing computational time. We do not recommend using more than 5 threads, beacuse some of the methods inhereditly use multicore processing:
readLines("https://raw.githubusercontent.com/kstawiski/miRNAselector/master/vignettes/Tutorial_miRNAselector.R") %>% paste0(collapse="\n") %>% cat
A few notes about what is does:
m
parameter. The aim of this function is to perform feature selection using multiple methods and to create formulas for benchmarking.temp
subfolder.Files created for each method (e.g. for stamp=tutorial
and m=1
):
formulastutorial-1.RDS
- main result file containing the final formula (final set of miRNAs selected by this method).time1-formula.RDS
- time taken to compute the resultstutorial1featureselection.log
- log file of the processall1-tutorial.rdata
- all variables created during feature selection (created if debug=T
).Pearls about the methods:
Sig
= miRNAs with p-value <0.05 after BH correction (DE using t-test)Fcsig
= sig
+ absolute log2FC filter (included if abs. log2FC>1)Cfs
= Correlation-based Feature Selection for Machine Learning (more: https://www.cs.waikato.ac.nz/~mhall/thesis.pdf)Classloop
= Classification using different classification algorithms (classifiers) with the embedded feature selection and using the different schemes for the performance validation (more: https://rdrr.io/cran/Biocomb/man/classifier.loop.html)Fcfs
= CFS algorithm with forward search (https://rdrr.io/cran/Biocomb/man/select.forward.Corr.html)MDL
methods = minimal description length (MDL) discretization algorithm with different a method of feature ranking or feature selection (AUC, SU, CorrSF) (more: https://rdrr.io/cran/Biocomb/man/select.process.html)bounceR
= genetic algorithm with componentwise boosting (more: https://www.statworx.com/ch/blog/automated-feature-selection-using-bouncer/)RandomForestRFE
= recursive feature elimination using random forest with resampling to assess the performance. (more: https://topepo.github.io/caret/recursive-feature-elimination.html#resampling-and-external-validation)GeneticAlgorithmRF
(more: https://topepo.github.io/caret/feature-selection-using-genetic-algorithms.html)SimulatedAnnealing
= makes small random changes (i.e. perturbations) to an initial candidate solution (more: https://topepo.github.io/caret/feature-selection-using-simulated-annealing.html)Boruta
(more: https://www.jstatsoft.org/article/view/v036i11/v36i11.pdf)spFSR
= simultaneous perturbation stochastic approximation (SPSA-FSR) (more: https://arxiv.org/abs/1804.05589)varSelRF
= using the out-of-bag error as minimization criterion, carry out variable elimination from random forest, by successively eliminating the least important variables (with importance as returned from random forest). (more: https://www.ncbi.nlm.nih.gov/pubmed/16398926)WxNet
= a neural network-based feature selection algorithm for transcriptomic data (more: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6642261/)Step
= backward stepwise method of feature selection based on logistic regression (GLM, family = binomial) using AIC criteria (stepAIC) and functions from My.stepwise package (https://cran.r-project.org/web/packages/My.stepwise/index.html)Notes about methods:
shiny::includeHTML("methods.html")
The miRNAselector
functions saves all output files to temp/
directory. As users may want to run multiple selection methods in different configurations, we do not recommend using the return of this function in the following steps. Instead, we provide ks.merge_formulas()
which conviniently summerizes the results of feature selection. We can do:
selected_sets_of_miRNAs = ks.merge_formulas(max_miRNAs = 11) # we filter out sets with more than 11 miRNAs. selected_sets_of_miRNAs_with_own = ks.merge_formulas(max_miRNAs = 11, add = list("my_own_signature" = c("hsa.miR.192.5p","hsa.let.7g.5p","hsa.let.7a.5p","hsa.let.7d.5p","hsa.miR.194.5p", "hsa.miR.98.5p", "hsa.let.7f.5p", "hsa.miR.26b.5p"))) # you can also add your own signature (for example selected from literature)
Note that:
featureselection_formulas_all.RDS
- contains the formulas for all selection methods, (2) featureselection_formulas_final.RDS
- contains methods that selected smaller or equal number of microRNAs than specified in max_miRNA
parameter, as well as fcsig
and cfs_sig
.fcsig
and cfs_sig
methods are always retained in the final formulas
set (ignoring max_miRNA
parameter) - they are commonly used as benchmark comparator for the final set of miRNAs. Those sets can be manually removed (if needed) from final selection.*.csv
files.Let's analyze the process of feature selection:
all_sets = readRDS("featureselection_formulas_all.RDS") length(all_sets) # How many feature selection methods completed in time? final_sets = readRDS("featureselection_formulas_final.RDS") length(final_sets) # How many feature selection methods completed in time and fulfilled max_miRNA criteria? (remember about fcsig and cfs_sig) featureselection_formulas_final = fread("featureselection_formulas_final.csv") ks.table(featureselection_formulas_final) # show information about selected formulas
Note that my_own_signture
has 0
miRNAs according to the table. This trick is done to make sure that signatures added manually will not be filtered out based on max_miRNA
parameter.
Now, we will summarize number of microRNAs selected by methods implemented in ks.miRNAselector
by creating histogram and calculating some descriptive statistics.
hist(featureselection_formulas_final$ile_miRNA[-which(featureselection_formulas_final$ile_miRNA == 0)], breaks = ncol(train), main = "Number of selected microRNAs distribution", xlab = "Number of selected microRNAs" ) # Histogram showing how many miRNAs were selected in final set. psych::describe(featureselection_formulas_final$ile_miRNA[-which(featureselection_formulas_final$ile_miRNA == 0)]) # Descriptive statistics of how many features where selected in the final set.
In the next step of looking for the best microRNA signature, we perform benchmarking. This tests all the signatures using different classifier architectures. Here is the example of benchmark with default parameters:
readLines("Tutorial_benchmark.R") %>% paste0(collapse="\n") %>% cat
Just for rendering this tutorial we will use a very simple benchmark using only logistic regression and conditional trees.
library(miRNAselector) miRNAselector_tutorial_balanced_benchmark = ks.benchmark(search_iters = 5, # 5 random hyperparameter sets will be checked; 5 is set here for speed purposes.. for real projects use more, like 5000... algorithms = c("ctree"), # just add ctree, note that logistic regression (glm) is always included output_file = paste0("benchmark.csv")) # the main output # exemplary benchmark data can be loaded using data('miRNAselector_tutorial_balanced_benchmark')
As benchmarking is done, the main result file is saved in file specified in output_file
parameter. It contains the performance metrics of signatures satisfying initial criteria (e.g. max_miRNA
) across different methods of modelling. Let's take a quick look:
ks.table(fread("benchmark.csv"))
Description of columns:
method
- accronym for methodSMOTE
- if balancing using SMOTE or ROSE was used in training dataset.miRy
- formula used (miRNAs selected).*_modelname
- the name of .RDS
file placed in models/
directory, containing the caret
final model that can be used for prediction of new cases. This allows reproducibility of the results. For example glm_
prefix is set according to the method, glm
= logistic regression.*_train_ROCAUC
- area under the ROC curve (AUC ROC) on training dataset, indicating general potential of the model.*_train_ROCAUC_lower95CI
- lower boundery of 95% confidence interval for AUC ROC.*_train_ROCAUC_upper95CI
- upper boundery of 95% confidence interval for AUC ROC.*_train_Accuracy
- accuracy on training set.*_train_Sensitivity
- sensitivity on training set.*_train_Specificity
- sensitivity on training set.By this logic every parameter is also calculated from testing (_test_
) and validation (_valid_
) set. If the method generated a probability, a default cutoff is used for all of the predictions.
Let's see the general performance (accuracy) of methods in the benchmark:
metody = ks.get_benchmark_methods("benchmark.csv") # gets the methods used in benchmark par(mfrow = c(2,2)) for(i in 1:length(metody)){ temp = ks.get_benchmark("benchmark.csv") # loads benchmark temp2 = dplyr::select(temp, starts_with(paste0(metody[i],"_"))) boxplot(temp[,paste0(metody[i],"_train_Accuracy")], temp[,paste0(metody[i],"_test_Accuracy")], temp[,paste0(metody[i],"_valid_Accuracy")], main = paste0("Method: ", metody[i]), names = c("Training","Testing","Validation"), ylab = "Accuracy", ylim = c(0.5,1)) tempids = c(match(paste0(metody[i],"_train_Accuracy"), colnames(temp)), match(paste0(metody[i],"_test_Accuracy"), colnames(temp)), match(paste0(metody[i],"_valid_Accuracy"), colnames(temp))) } par(mfrow = c(1,1))
With miRNAselector
package, the final optimal feature signature can be selected in 3 ways:
1. Picking the signture which achived the best accuracy in training, testing and validation: (metaindex = mean of all 3 accuracy metrics)
acc1 = ks.best_signiture_proposals(benchmark_csv = "benchmark.csv", without_train = F) # generates the benchmark sorted by metaindex best_signatures = acc1[1:3,] # get top 3 methods ks.table(best_signatures[,c("metaindex","method","miRy")])
2. Picking the signture which achived the best accuracy in testing and validation: (metaindex = mean of 2 accuracy metrics)
acc1 = ks.best_signiture_proposals(benchmark_csv = "benchmark.csv", without_train = T) # generates the benchmark sorted by metaindex best_signatures = acc1[1:3,] # get top 3 methods ks.table(best_signatures[,c("metaindex","method","miRy")])
3. The signture which achived the best sensitivity and specificity in validation: (metaindex = mean of sensivitiy and specificity in validation dataset)
acc = ks.best_signiture_proposals_meta11(benchmark_csv = "benchmark.csv") # generates the benchmark sorted by metaindex best_signatures = acc[1:3,] # get top 3 methods ks.table(best_signatures[,c("metaindex","method","miRy")])
It is a good practice to assess learning performance for considered signatures across different classification methods. Here, we visualize the over/underfitting of selected methods by comparing accuracy score between testing and validation sets for 3 top scoring signatures across selection of model architectures.
for(i in 1:length(metody)) { suppressMessages(library(PairedData)) suppressMessages(library(profileR)) pd = paired(as.numeric(acc[1:3,paste0(metody[i],"_train_Accuracy")]),as.numeric(acc[1:3,paste0(metody[i],"_test_Accuracy")])) colnames(pd) = c("Train Accuracy","Test Accuracy") plot2 = ks.profileplot(pd, Method.id = acc$method[1:3], standardize = F) pd = paired(as.numeric(acc[1:3,paste0(metody[i],"_train_Accuracy")]),as.numeric(acc[1:3,paste0(metody[i],"_valid_Accuracy")])) colnames(pd) = c("Train Accuracy","Valid Accuracy") plot3 = ks.profileplot(pd, Method.id = acc$method[1:3], standardize = F) pd = paired(as.numeric(acc[1:3,paste0(metody[i],"_test_Accuracy")]),as.numeric(acc[1:3,paste0(metody[i],"_valid_Accuracy")])) colnames(pd) = c("Test Accuracy","Valid Accuracy") plot4 = ks.profileplot(pd, Method.id = acc$method[1:3], standardize = F) require(gridExtra) grid.arrange(arrangeGrob(plot2, plot3, ncol=2, nrow = 1, top=metody[i])) grid.arrange(arrangeGrob(plot4, ncol=1, nrow = 1, top=metody[i])) }
The relationship betweend accuracy on testing and validation sets can also be pictured as scatterplot:
acc2 = acc[1:6,] # get top 6 methods accmelt = melt(acc2, id.vars = "method") %>% filter(variable != "metaindex") %>% filter(variable != "miRy") accmelt = cbind(accmelt, strsplit2(accmelt$variable, "_")) acctest = accmelt$value[accmelt$`2` == "test"] accvalid = accmelt$value[accmelt$`2` == "valid"] accmeth = accmelt$method[accmelt$`2` == "test"] unique(accmeth) plot5 = ggplot(, aes(x = as.numeric(acctest), y = as.numeric(accvalid), shape = accmeth)) + geom_point() + scale_x_continuous(name="Accuracy on test set", limits=c(0.5, 1)) + scale_y_continuous(name="Accuracy on validation set", limits=c(0.5, 1)) + theme_bw() grid.arrange(arrangeGrob(plot5, ncol=1, nrow = 1))
Suppose we decide to use sensitivity and specificity in validation set (e.g. in the model architectures we plan to use in our research) as decisive scoring metric for feature signature. The three best signatures may be:
ks.table(best_signatures[1:3,2:4])
To get the miRNAs from formula you can use ks.get_miRNAs_from_benchmark
.
selected_miRNAs = ks.get_miRNAs_from_benchmark(benchmark_csv = "benchmark.csv", best_signatures$method[1]) # for the best performing signiture gsub("\\.", "-", selected_miRNAs) # R doesn't like hyphens, but we can introduce them easly
As a double check, we can inspect the differential expression of miRNAs constituating selected signature:
best_de = ks.best_signiture_de(selected_miRNAs) ks.table(best_de)
Let's visualize the performance of those methods using barplots:
for(i in 1:3){ cat(paste0("\n\n## ", acc$method[i],"\n\n")) par(mfrow = c(1,2)) acc = ks.best_signiture_proposals_meta11("benchmark.csv") metody = ks.get_benchmark_methods("benchmark.csv") ktory_set = match(acc$method[i], ks.get_benchmark("benchmark.csv")$method) #do_ktorej_kolumny = which(colnames(acc) == "metaindex") #barplot(as.numeric(acc[i,1:do_ktorej_kolumny])) for(ii in 1:length(metody)) { temp = ks.get_benchmark("benchmark.csv") %>% dplyr::select(starts_with(paste0(metody[ii],"_t")),starts_with(paste0(metody[ii],"_v"))) ROCtext = paste0("Training AUC ROC: ", round(temp[ktory_set,1],2), " (95%CI: ", round(temp[ktory_set,2],2), "-", round(temp[ktory_set,3],2), ")") temp = temp[,-c(1:3)] temp2 = as.numeric(temp[ktory_set,]) temp3 = matrix(temp2, nrow = 3, byrow = T) colnames(temp3) = c("Accuracy","Sensitivity","Specificity") rownames(temp3) = c("Training","Testing","Validation") temp3 = t(temp3) plot1 = barplot(temp3, beside=T, ylim = c(0,1), xlab = paste0(ROCtext,"\nBlack - accuracy, blue - sensitivity, green - specificity"), width = 0.85, col=c("black", "blue", "green"), legend = F, args.legend = list(x="topright", bty = "n", inset=c(0, -0.25)), cex.lab=0.7, main = paste0(acc$method[i], " - ", metody[ii]), font.lab=2) ## Add text at top of bars text(x = plot1, y = as.numeric(temp3), label = paste0(round(as.numeric(temp[ktory_set,])*100,1),"%"), pos = 3, cex = 0.6, col = "red") } par(mfrow = c(1,1)) }
Finally, we can assess the overlap of top 3 feature selection methods:
overlap = ks.miRNA_signiture_overlap(acc$method[1:3], "benchmark.csv")
Which miRNAs are common for between feature selection methods?
attr(overlap,"intersections")
Let's draw vulcano plot and mark the miRNAs selected in best signature:
ks.vulcano_plot(selected_miRNAs = de$miR, DE = de, only_label = selected_miRNAs)
Let's draw heatmap for selected miRNAs in whole dataset (training, testing and validation set).
ks.heatmap(x = dplyr::select(mixed, gsub("\\.", "-", selected_miRNAs)), rlab = data.frame(Class = mixed$Class, Mix = mixed$mix), zscore = F, margins = c(10,10))
ks.heatmap(x = dplyr::select(mixed, gsub("\\.", "-", selected_miRNAs)), rlab = data.frame(Class = mixed$Class, Mix = mixed$mix), zscore = T, margins = c(10,10))
Based on everything we have done so far, we suggest using the following signiture in further validation of biomarker study.
cat(paste0(gsub("\\.", "-", selected_miRNAs), collapse = ", "))
session_info()
packageDescription("miRNAselector")
To render this tutorial we used:
render("Tutorial.Rmd", output_file = "Tutorial.html", output_dir = "../inst/doc/")
Packages installed in our docker enviorment:
ks.table(as.data.frame(installed.packages()))
Clean the temporary and model files (as the results of this tutorial are simplified and we do not need them).
unlink("temp", recursive=TRUE) unlink("models", recursive=TRUE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.