knitr::opts_chunk$set(
  collapse = TRUE, message=FALSE, warning=FALSE,
  comment = "#>"
)
knitr::opts_chunk$set(fig.width=12, fig.height=8)
knitr::opts_chunk$set(tidy.opts=list(width.cutoff=150),tidy=TRUE)
options(rgl.useNULL = TRUE)
options(warn=-1)
suppressMessages(library(dplyr))
set.seed(1)
options(knitr.table.format = "html")
library(miRNAselector)

Setup script

The package miRNAselector has a lot of requirements that are nessesary to run all the experiments. The script below will allow to install most of them. It is highly recommended to install those packages using the code below.

If you are using our docker enviorment to learn this package you should not run this again. It would be a waste of time. [However, if you wish, you can play with the code in the notebook. Click here.](If you are using our enviorment to learn this package you can tweak this code in the interactive notebook. Click here.

Setup script:

readLines("https://raw.githubusercontent.com/kstawiski/miRNAselector/master/vignettes/setup.R") %>% paste0(collapse="\n") %>% cat

This code does not cover the installation of mxnet, which can be used for benchmarking of the selected miRNA sets. The mxnet is however built and installed in our docker enviorment.

As the miRNAselector package is installed we can load it easly.

library(miRNAselector)

Getting the data

To present the package functionality, we will use the pan-cancer data from TCGA (https://portal.gdc.cancer.gov/repository). These two commands below will download and store all the necessary data in your working directory. Please note, that this process may take some time (depending on your network connection), but has to be performed only once.

ks.download_tissue_miRNA_data_from_TCGA()
ks.process_tissue_miRNA_TCGA(remove_miRNAs_with_null_var = T)

Both of those function produce 2 files: tissue_miRNA_counts.csv and tissue_miRNA_logtpm.csv. First of those files contains metadata and raw counts as declared in TCGA. The second is are log-transformed transcripts-per-million (TPM) counts. Let's load counts files and see its summary.

suppressWarnings(suppressMessages(library(data.table)))
suppressWarnings(suppressMessages(library(knitr)))
data("orginal_TCGA_data")
ks.table(table(orginal_TCGA_data$primary_site, orginal_TCGA_data$sample_type))

Let's consider a following exemplary problem..

We want to find the set of miRNAs the most specific to pancreatic cancer. We see that there are 178 cases of pancreatic cancer miRNA-seq results and only 4 solid tissue normal cases. However, we have multiple normal tissue miRNA-seq results from other projects that could be incorporated in the analysis. Let's filter and label the samples of interest.

suppressWarnings(suppressMessages(library(dplyr)))

cancer_cases = filter(orginal_TCGA_data, primary_site == "Pancreas" & sample_type == "PrimaryTumor")
control_cases = filter(orginal_TCGA_data, sample_type == "SolidTissueNormal")

The pipeline requires the variable Class with two levels (Cancer and Control) to be present in the dataset.

cancer_cases$Class = "Cancer"
control_cases$Class = "Control"

dataset = rbind(cancer_cases, control_cases)

ks.table(table(dataset$Class), col.names = c("Class","Number of cases"))
boxplot(dataset$age_at_diagnosis ~ dataset$Class)
t.test(dataset$age_at_diagnosis ~ dataset$Class)
ks.table(table(dataset$gender.x, dataset$Class))
chisq.test(dataset$gender.x, dataset$Class)

There is a stistically significant difference in age between classess. The gender was not associated with class. In order to increase feature selection performance, we will try to balance the sets by performing propensity score matching.

old_dataset = dataset # backup
dataset = dataset[grepl("Adenocarcinomas", dataset$disease_type),]
match_by = c("age_at_diagnosis","gender.x")
tempdane = dplyr::select(dataset, match_by)
tempdane$Class = ifelse(dataset$Class == "Cancer", TRUE, FALSE)
suppressMessages(library(mice))
suppressMessages(library(MatchIt))
temp1 = mice(tempdane, m=1)
temp2 = temp1$data
temp3 = mice::complete(temp1)
temp3 = temp3[complete.cases(temp3),]
tempform = ks.create_miRNA_formula(match_by)
mod_match <- matchit(tempform, data = temp3)
newdata = match.data(mod_match)
dataset = dataset[as.numeric(rownames(newdata)),]

Let's check if there is any imbalance in the new, transformed dataset.

boxplot(dataset$age_at_diagnosis ~ dataset$Class)
t.test(dataset$age_at_diagnosis ~ dataset$Class)
ks.table(table(dataset$gender.x, dataset$Class))
chisq.test(dataset$gender.x, dataset$Class)
fwrite(dataset, "balanced_dataset.csv.gz")
miRNAselector_tutorial_balanced_dataset = dataset # can be used by data("miRNAselector_tutorial_balanced_dataset")

There are no significant differences in age or gender composition between classes now. We can proceed to standard filtering, log-transormation and TPM-normalization.

However, first, in order to stay consistent between different datasets, we need to standardise microRNA names. ks.correct_miRNA_names() unify the miRNA names between different versions of miRbase.

dataset = ks.correct_miRNA_names(dataset) # Correct miRNA names based on the aliases. Useful when analyzing old datasets - to keep the results coherent with current knowledge.
danex = dplyr::select(dataset, starts_with("hsa")) # Create data.frame or matrix with miRNA counts with miRNAs in columns and cases in rows.
metadane = dplyr::select(dataset, -starts_with("hsa")) # Metadata with 'Class' variables.
ks.table(table(metadane$Class)) # Let's be sure that 'Class' variable is correct and contains only 'Cancer' and 'Control' cases.
ttpm = ks.counts_to_log10tpm(danex, metadane, ids = metadane$sample,
                                 filtr = T, filtr_minimalcounts = 100, filtr_howmany = 1/3) # We will leave only the miRNAs which apeared with at least 100 counts in 1/3 of cases.

You might have noticed, that we have filtered out microRNAs not having at least 100 counts in 1/3 of samples. It was done to ensure that microRNAs selected as features for classifier (possibly used in clinic) would be detectable in other, cheaper methods, such as qPCR. After filtering there are 166 miRNAs left.

In the next step we will devide the dataset into training, testing and validation datasets. We strongly belive that hold-out validation is the most redundant validation method and although miRNAselector supports cross-validation, the hold-out validation is set by default in most cases. Thus, the rest of the analysis is dependent on existance of 3 seperate datasets:

The best signiture (best set of miRNAs for diagnostic test) can be selected based on all 3 datasets, 2 datasets or only validation set. The process of best signiture selection will be discussed below.

The split can be prepared manually by user (the pipeline expects to find mixed_*.csv files in working directory) or in a convinient way using ks.prepare_split(). Let's do it now.

mixed = ks.prepare_split(metadane = metadane, ttpm = ttpm, train_proc = 0.6)
miRNAselector_tutorial_balanced_mixed = mixed # can be used by data("miRNAselector_tutorial_balanced_mixed")

Let's see a split summary.

mixed = fread("mixed.csv")
ks.table(table(mixed$Class, mixed$mix))
ks.table(cbind(mixed[1:10,c(100:105)], Class = mixed[1:10,"Class"]))

We can see that the dataset was devided in balanced way. Now we are ready to move to the analysis...

Basic exploratory analysis

In biomarker studies we relay on validation. We perform hold-out validation, so the signature selection has to be based on training dataset only. Including testing and validation dataset in the exploratory analysis could lead to bias ('data leakage'). In the following section we show how to use our package to perform quick exploratory analysis of miRNA-seq data.

dane = ks.load_datamix(use_smote_not_rose = T) # load mixed_*.csv files
train = dane[[1]]; test = dane[[2]]; valid = dane[[3]]; train_smoted = dane[[4]]; trainx = dane[[5]]; trainx_smoted = dane[[6]] # get the objects from list to make the code more readable.

ks_load_datamix() function loads the data created in preparation phase. It requires the output constructed by ks.prepare_split() function to be placed in working directory ('wd'), thus files 'mixed_train.csv', 'mixed_test.csv' and 'mixed_valid.csv' have to exist in the directory. If you have split the data manually, there may be some imbalance of classes in train, test or validation datasets. If so, ks.prepare_split() perform balancing using:

  1. ROSE (default): https://journal.r-project.org/archive/2014/RJ-2014-008/RJ-2014-008.pdf - by default we generate 10 * number of cases in orginal dataset.
  2. SMOTE: https://arxiv.org/abs/1106.1813 - by defult we use 'perc.under=100' and 'k=10'.

At the beging of the analysis we usually perform principal component analysis (PCA) to assess for any batch effect, possible outliers and get a general understanding of miRNA profile. The package can construct 2-dimentional biplot and 3-dimentional interactive scatterplot based on the computed components.

pca = ks.PCA(trainx, train$Class)
pca

3D PCA plot may not be shown correctly in Jupyter notebook, but you can check in e.g. R Studio that it works.

if(is.null(sessionInfo()$loadedOnly$IRdisplay)) { # if not in the Jupyter, if you run ks.PCA_3D in learning/editing Jupyter enviorment it may cause: *** caught segfault *** address 0x1, cause 'memory not mapped'
pca3d = ks.PCA_3D(trainx, train$Class)
pca3d }

Now we can also correct the batch effect, if there is any. For example, you can use ks.combat() to do so. However, detailed demonstration is out of scope of this tutorial.

Usually, the next step in the exploratory analysis is to perform the differential expression analysis. Differential expression in our package is biomarker-discovery oriented, thus it uses t-test with the correction for multiple comparisons. The following table shows signifiantly differently expressed miRNAs after Benjamini-Hochberg correction.

de = ks.miRNA_differential_expression(trainx, train$Class)
sig_de = de %>% dplyr::filter(`p-value BH` <= 0.05) %>% dplyr::arrange(`p-value BH`) # leave only significant after Benjamini-Hochberg procedure and sort by ascending p-value
ks.table(sig_de) 

Let's visualize the results of differential expression using heatmap and vulcano plot.

ks.heatmap(x = dplyr::select(trainx, sig_de$miR),
           rlab = data.frame(Class = train$Class),
           zscore = F, margins = c(10,10))

Z-scoring the values before clustering and plotting will help us to gain more insight.

ks.heatmap(x = dplyr::select(trainx, sig_de$miR),
           rlab = data.frame(Class = train$Class),
           zscore = T, margins = c(10,10))

We will also create a vulcano plot and label top 10 most significant miRNAs:

ks.vulcano_plot(selected_miRNAs = de$miR, DE = de, only_label = sig_de$miR[1:10])

We may also what to check the consistency of differential expression between datasets:

de_test = ks.miRNA_differential_expression(dplyr::select(test, starts_with("hsa")), test$Class)
de_valid = ks.miRNA_differential_expression(dplyr::select(valid, starts_with("hsa")), valid$Class)
ks.correlation_plot(de$log2FC, de_test$log2FC, "log2FC on training set", "log2FC on test set", "", yx = T)
ks.correlation_plot(de$log2FC, de_valid$log2FC, "log2FC on training set", "log2FC on validation set", "", yx = T)
ks.correlation_plot(de_test$log2FC, de_valid$log2FC, "log2FC on test set", "log2FC on validation set", "", yx = T)

miRNA selection

The main feature of this package is the shotgun-like feature selection evaluation of possible miRNA signatures of biological processes. The function can be applied in a straightforward way, e.g.:

library(miRNAselector)
selected_features = ks.miRNAselector(wd = getwd(), m = c(1:3,51), max_iterations = 1, stamp = "tutorial") # For the sake of this tutorial and vignette building we will use only few fastest methods. The m parameter defines what methods will be tested. See more details below.

But, for largers projects we suggest using the following wrapper wich will perform the feature selection in parallel, significantly reducing computational time. We do not recommend using more than 5 threads, beacuse some of the methods inhereditly use multicore processing:

readLines("https://raw.githubusercontent.com/kstawiski/miRNAselector/master/vignettes/Tutorial_miRNAselector.R") %>% paste0(collapse="\n") %>% cat

If you are using our enviorment to learn this package you can tweak and run this code in the interactive notebook. Click here.

A few notes about what is does:

Files created for each method (e.g. for stamp=tutorial and m=1):

Pearls about the methods:

Notes about methods:

shiny::includeHTML("methods.html")

The miRNAselector functions saves all output files to temp/ directory. As users may want to run multiple selection methods in different configurations, we do not recommend using the return of this function in the following steps. Instead, we provide ks.merge_formulas() which conviniently summerizes the results of feature selection. We can do:

selected_sets_of_miRNAs = ks.merge_formulas(max_miRNAs = 11) # we filter out sets with more than 11 miRNAs.
selected_sets_of_miRNAs_with_own = ks.merge_formulas(max_miRNAs = 11, 
                                                     add = list("my_own_signature" = c("hsa.miR.192.5p","hsa.let.7g.5p","hsa.let.7a.5p","hsa.let.7d.5p","hsa.miR.194.5p", "hsa.miR.98.5p", "hsa.let.7f.5p", "hsa.miR.26b.5p"))) # you can also add your own signature (for example selected from literature)

Note that:

Let's analyze the process of feature selection:

all_sets = readRDS("featureselection_formulas_all.RDS")
length(all_sets) # How many feature selection methods completed in time?
final_sets = readRDS("featureselection_formulas_final.RDS")
length(final_sets) # How many feature selection methods completed in time and fulfilled max_miRNA criteria? (remember about fcsig and cfs_sig)
featureselection_formulas_final = fread("featureselection_formulas_final.csv")
ks.table(featureselection_formulas_final) # show information about selected formulas

Note that my_own_signture has 0 miRNAs according to the table. This trick is done to make sure that signatures added manually will not be filtered out based on max_miRNA parameter.

Now, we will summarize number of microRNAs selected by methods implemented in ks.miRNAselector by creating histogram and calculating some descriptive statistics.

hist(featureselection_formulas_final$ile_miRNA[-which(featureselection_formulas_final$ile_miRNA == 0)], 
     breaks = ncol(train),
     main = "Number of selected microRNAs distribution",
     xlab = "Number of selected microRNAs"
     ) # Histogram showing how many miRNAs were selected in final set.
psych::describe(featureselection_formulas_final$ile_miRNA[-which(featureselection_formulas_final$ile_miRNA == 0)]) # Descriptive statistics of how many features where selected in the final set.

Benchmarking

In the next step of looking for the best microRNA signature, we perform benchmarking. This tests all the signatures using different classifier architectures. Here is the example of benchmark with default parameters:

readLines("Tutorial_benchmark.R") %>% paste0(collapse="\n") %>% cat

If you are using our enviorment to learn this package you can tweak and run this code in the interactive notebook. Click here.

Just for rendering this tutorial we will use a very simple benchmark using only logistic regression and conditional trees.

library(miRNAselector)
miRNAselector_tutorial_balanced_benchmark = ks.benchmark(search_iters = 5, # 5 random hyperparameter sets will be checked; 5 is set here for speed purposes.. for real projects use more, like 5000...
            algorithms = c("ctree"), # just add ctree, note that logistic regression (glm) is always included
            output_file = paste0("benchmark.csv")) # the main output
# exemplary benchmark data can be loaded using data('miRNAselector_tutorial_balanced_benchmark')

As benchmarking is done, the main result file is saved in file specified in output_file parameter. It contains the performance metrics of signatures satisfying initial criteria (e.g. max_miRNA) across different methods of modelling. Let's take a quick look:

ks.table(fread("benchmark.csv"))

Description of columns:

By this logic every parameter is also calculated from testing (_test_) and validation (_valid_) set. If the method generated a probability, a default cutoff is used for all of the predictions.

Let's see the general performance (accuracy) of methods in the benchmark:

metody = ks.get_benchmark_methods("benchmark.csv") # gets the methods used in benchmark
par(mfrow = c(2,2))
for(i in 1:length(metody)){
    temp = ks.get_benchmark("benchmark.csv") # loads benchmark
    temp2 = dplyr::select(temp, starts_with(paste0(metody[i],"_")))
    boxplot(temp[,paste0(metody[i],"_train_Accuracy")], temp[,paste0(metody[i],"_test_Accuracy")], temp[,paste0(metody[i],"_valid_Accuracy")],
            main = paste0("Method: ", metody[i]), names = c("Training","Testing","Validation"), ylab = "Accuracy", ylim = c(0.5,1))
    tempids = c(match(paste0(metody[i],"_train_Accuracy"), colnames(temp)), match(paste0(metody[i],"_test_Accuracy"), colnames(temp)), match(paste0(metody[i],"_valid_Accuracy"), colnames(temp)))
  }
par(mfrow = c(1,1))

With miRNAselector package, the final optimal feature signature can be selected in 3 ways:

1. Picking the signture which achived the best accuracy in training, testing and validation: (metaindex = mean of all 3 accuracy metrics)

acc1 = ks.best_signiture_proposals(benchmark_csv = "benchmark.csv", without_train = F) # generates the benchmark sorted by metaindex
best_signatures = acc1[1:3,] # get top 3 methods
ks.table(best_signatures[,c("metaindex","method","miRy")])

2. Picking the signture which achived the best accuracy in testing and validation: (metaindex = mean of 2 accuracy metrics)

acc1 = ks.best_signiture_proposals(benchmark_csv = "benchmark.csv", without_train = T) # generates the benchmark sorted by metaindex
best_signatures = acc1[1:3,] # get top 3 methods
ks.table(best_signatures[,c("metaindex","method","miRy")])

3. The signture which achived the best sensitivity and specificity in validation: (metaindex = mean of sensivitiy and specificity in validation dataset)

acc = ks.best_signiture_proposals_meta11(benchmark_csv = "benchmark.csv") # generates the benchmark sorted by metaindex
best_signatures = acc[1:3,] # get top 3 methods
ks.table(best_signatures[,c("metaindex","method","miRy")])

It is a good practice to assess learning performance for considered signatures across different classification methods. Here, we visualize the over/underfitting of selected methods by comparing accuracy score between testing and validation sets for 3 top scoring signatures across selection of model architectures.

for(i in 1:length(metody))
  {
suppressMessages(library(PairedData))
suppressMessages(library(profileR))
pd = paired(as.numeric(acc[1:3,paste0(metody[i],"_train_Accuracy")]),as.numeric(acc[1:3,paste0(metody[i],"_test_Accuracy")]))
colnames(pd) = c("Train Accuracy","Test Accuracy")
plot2 = ks.profileplot(pd, Method.id = acc$method[1:3], standardize = F)
pd = paired(as.numeric(acc[1:3,paste0(metody[i],"_train_Accuracy")]),as.numeric(acc[1:3,paste0(metody[i],"_valid_Accuracy")]))
colnames(pd) = c("Train Accuracy","Valid Accuracy")
plot3 = ks.profileplot(pd, Method.id = acc$method[1:3], standardize = F)
pd = paired(as.numeric(acc[1:3,paste0(metody[i],"_test_Accuracy")]),as.numeric(acc[1:3,paste0(metody[i],"_valid_Accuracy")]))
colnames(pd) = c("Test Accuracy","Valid Accuracy")
plot4 = ks.profileplot(pd, Method.id = acc$method[1:3], standardize = F)



require(gridExtra)
grid.arrange(arrangeGrob(plot2, plot3, ncol=2, nrow = 1, top=metody[i]))
grid.arrange(arrangeGrob(plot4, ncol=1, nrow = 1, top=metody[i]))
}

The relationship betweend accuracy on testing and validation sets can also be pictured as scatterplot:

acc2 = acc[1:6,] # get top 6 methods
accmelt = melt(acc2, id.vars = "method") %>% filter(variable != "metaindex") %>% filter(variable != "miRy")
accmelt = cbind(accmelt, strsplit2(accmelt$variable, "_"))
acctest = accmelt$value[accmelt$`2` == "test"]
accvalid = accmelt$value[accmelt$`2` == "valid"]
accmeth = accmelt$method[accmelt$`2` == "test"]
unique(accmeth)
plot5 = ggplot(, aes(x = as.numeric(acctest), y = as.numeric(accvalid), shape = accmeth)) +
  geom_point() + scale_x_continuous(name="Accuracy on test set", limits=c(0.5, 1)) +
  scale_y_continuous(name="Accuracy on validation set", limits=c(0.5, 1)) +
  theme_bw()
grid.arrange(arrangeGrob(plot5, ncol=1, nrow = 1))

Best signture analysis

Suppose we decide to use sensitivity and specificity in validation set (e.g. in the model architectures we plan to use in our research) as decisive scoring metric for feature signature. The three best signatures may be:

ks.table(best_signatures[1:3,2:4])

To get the miRNAs from formula you can use ks.get_miRNAs_from_benchmark.

selected_miRNAs = ks.get_miRNAs_from_benchmark(benchmark_csv = "benchmark.csv", best_signatures$method[1]) # for the best performing signiture
gsub("\\.", "-", selected_miRNAs) # R doesn't like hyphens, but we can introduce them easly

As a double check, we can inspect the differential expression of miRNAs constituating selected signature:

best_de = ks.best_signiture_de(selected_miRNAs)
ks.table(best_de)

Let's visualize the performance of those methods using barplots:

for(i in 1:3){
  cat(paste0("\n\n## ", acc$method[i],"\n\n"))
  par(mfrow = c(1,2))
  acc = ks.best_signiture_proposals_meta11("benchmark.csv")
  metody = ks.get_benchmark_methods("benchmark.csv")
  ktory_set = match(acc$method[i], ks.get_benchmark("benchmark.csv")$method)
  #do_ktorej_kolumny = which(colnames(acc) == "metaindex")
  #barplot(as.numeric(acc[i,1:do_ktorej_kolumny]))
  for(ii in 1:length(metody)) {

    temp = ks.get_benchmark("benchmark.csv") %>% 
      dplyr::select(starts_with(paste0(metody[ii],"_t")),starts_with(paste0(metody[ii],"_v")))

    ROCtext = paste0("Training AUC ROC: ", round(temp[ktory_set,1],2), " (95%CI: ", round(temp[ktory_set,2],2), "-", round(temp[ktory_set,3],2), ")")

    temp = temp[,-c(1:3)]
    temp2 = as.numeric(temp[ktory_set,])
    temp3 = matrix(temp2, nrow = 3, byrow = T)
    colnames(temp3) = c("Accuracy","Sensitivity","Specificity")
    rownames(temp3) = c("Training","Testing","Validation")
    temp3 = t(temp3)

    plot1 = barplot(temp3, beside=T, ylim = c(0,1), xlab = paste0(ROCtext,"\nBlack - accuracy, blue - sensitivity, green - specificity"), width = 0.85, col=c("black", "blue", "green"), legend = F,  args.legend = list(x="topright", bty = "n", inset=c(0, -0.25)), cex.lab=0.7, main = paste0(acc$method[i], " - ", metody[ii]), font.lab=2)
    ## Add text at top of bars
    text(x = plot1, y = as.numeric(temp3), label = paste0(round(as.numeric(temp[ktory_set,])*100,1),"%"), pos = 3, cex = 0.6, col = "red")
  }
  par(mfrow = c(1,1))

}

Finally, we can assess the overlap of top 3 feature selection methods:

overlap = ks.miRNA_signiture_overlap(acc$method[1:3], "benchmark.csv")

Which miRNAs are common for between feature selection methods?

attr(overlap,"intersections")

Let's draw vulcano plot and mark the miRNAs selected in best signature:

ks.vulcano_plot(selected_miRNAs = de$miR, DE = de, only_label = selected_miRNAs)

Let's draw heatmap for selected miRNAs in whole dataset (training, testing and validation set).

ks.heatmap(x = dplyr::select(mixed, gsub("\\.", "-", selected_miRNAs)),
           rlab = data.frame(Class = mixed$Class, Mix = mixed$mix),
           zscore = F, margins = c(10,10))
ks.heatmap(x = dplyr::select(mixed, gsub("\\.", "-", selected_miRNAs)),
           rlab = data.frame(Class = mixed$Class, Mix = mixed$mix),
           zscore = T, margins = c(10,10))

Based on everything we have done so far, we suggest using the following signiture in further validation of biomarker study.

cat(paste0(gsub("\\.", "-", selected_miRNAs), collapse = ", "))

Sesssion

session_info()
packageDescription("miRNAselector")

To render this tutorial we used:

render("Tutorial.Rmd", output_file = "Tutorial.html", output_dir = "../inst/doc/")

Packages installed in our docker enviorment:

ks.table(as.data.frame(installed.packages()))

Clean the temporary and model files (as the results of this tutorial are simplified and we do not need them).

unlink("temp", recursive=TRUE)
unlink("models", recursive=TRUE)


kstawiski/miRNAselector documentation built on Oct. 10, 2020, 9:03 a.m.