knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width=7, fig.height=6 )
iMIRAGE stands for imputed microRNA (miRNA) activity from gene expression. As the name suggests, the package imputes expression of miRNAs by constructing prediction models that only depend on the expression levels of protein-coding genes. In essence, iMIRAGE package can impute the miRNA profiles of samples where protein-coding expression data is available (for example, from microarray or RNA-seq), but do not contain reliable miRNA expression. By utilizing a training data set containing both protein-coding and miRNA expression profiles, iMIRAGE constructs prediction models using machine-learning to impute miRNA profiles of the independent (test) data set of interest.
The iMIRAGE package also provides tools to create an integrated workflow, to harmonize, clean-up, normalize and standardize the expression data sets. In addition, the package provides an option of using miRNA-target gene pair information to construct the prediction models
The iMIRAGE package for R can be downloaded from the GitHub repository:
#Download package from GitHub using devtools devtools::install_github("aritronath/iMIRAGE")
#Load the package in R library(iMIRAGE)
The iMIRAGE package includes necessary functions for pre-processing and harmonization of expression data sets, miRNA expression imputation and cross-validation analyses. In addition, TargetScan miRNA-target gene pairs are provided for constructing imputation models. The package also includes the example data sets that are used in this vignette.
Report bugs and issues here via GitHub
In this example, we will use the two independent miRNA data sets that were derived from the Cancer Genome Atlas (TCGA) Breast Cancer (BRCA) project. These data sets are automatically available when the iMIRAGE package library is loaded.
Training: GA.pcg, GA.mir
Test: HS.pcg, HS.mir
The prefix HS refers to TCGA BRCA samples for which miRNA mature strand expression were obtained using the Illumina Hiseq system.
The samples in the two data sets are mutually exclusive. The small subset included in these example data sets were randomly selected
Note: the complete TCGA BRCA HiSeq and GA data sets can be downloaded from the XENA portal
match.gex
# return a pair of matrices with matched columns (protein-coding genes that will be used as training features) temp <- match.gex(GA.pcg, HS.pcg) GA.pcg <- temp[[1]] HS.pcg <- temp[[2]]
imirage.cv
to determine the expected accuracy of imputing a miRNA of interest using the training datasets#Use **imirage.cv** with the default parameters of using K-nearest neighbors as the prediction algorithm and using 50 protein-coding genes as training features CV.miRNA <- imirage.cv(train_pcg=GA.pcg, train_mir=GA.mir, gene_index="hsa-let-7c", method="KNN", target="none") #Print the accuracy metrics from each fold of the cross-validation analysis print(CV.miRNA) #note: this example performs cross-validation analysis for 1 unique miRNA, hsa-let-7c. The name must match with one of the names in the train_mir object's column names
imirage.cv.loop
to determine the expected imputation accuracy of entire miRNA dataset with#Perform cross-validation analysis over the entire training dataset CV.full <- imirage.cv.loop(train_pcg = GA.pcg, train_mir = GA.mir, method = "KNN", target="none") #Plot some performance metrics from cross-validation analysis plot(CV.full[,1], -log10(CV.full[,2]), xlab="Spearman R", ylab="P-value (-log10)", pch=16, main="Cross-validation accuracy") abline(h=-log10(0.05), lty=3)
#Find out which miRNAs are imputed with good accuracy colnames(GA.mir)[which(CV.full[,1] > 0.5)] #arbritarily, Spearman correlation > 0.5 colnames(GA.mir)[which(CV.full[,2] < 0.05)] #P-value < 0.05
imirage
in the test dataset#Use **imirage** with default parameters Pred.miRNA <- imirage(train_pcg=GA.pcg, train_mir=GA.mir, my_pcg=HS.pcg , gene_index="hsa-let-7c", target="none") #Display the predicted miRNA expression values print(Pred.miRNA) #Compare the predicted miRNA expression values with measured expression plot(Pred.miRNA, HS.mir[,"hsa-let-7c"], xlab="Predicted", ylab="Measured", main="hsa-let-7c imputation performance", pch=16) abline(lm(HS.mir[,"hsa-let-7c"] ~ Pred.miRNA), lty=3)
imirage
to impute expression of all miRNAs available in the training dataset#Create an empty matrix to store imputed expression Pred.full <- matrix(data=NA, ncol=ncol(GA.mir), nrow=nrow(HS.pcg)) #Execute a loop to impute each miRNA using the test protein coding dataset for (i in 1:ncol(GA.mir)) { Pred.full[,i] <- imirage(train_pcg = GA.pcg, train_mir = GA.mir, my_pcg = HS.pcg, gene_index = i, method="KNN", target="none") }
Find out how well can we impute miRNA expression in the independent data set
#Obtain correlation coefficients between imputed and measured miRNA expression Pred.Cors <- array(dim=ncol(GA.mir)) for (i in 1:ncol(GA.mir)) { Pred.Cors[i] <- cor(HS.mir[,i], Pred.full[,i], method="spearman") } #Plot imputation correlations in comparison to cross-validation results plot(Pred.Cors, CV.full[,1], xlab="Imputation accuracy", ylab="Cross-validation accuracy", main="Imputation performance", pch=16)
The principle behind iMIRAGE can be explained in a simplified form by stating the problem in the form of a linear model. Let us assume the user wants to obtain miRNA expression using their protein-coding data set which we will call test. To impute the miRNA expression using the test protein-coding data set, we first determine the mathematical relation between the expression of a set of miRNAs Y with the set of protein-coding gene expression X of the training data set, where p indicates the number of protein-coding genes used in the model. This number can range anywhere from 1 to P, where, P = total number of protein-coding genes available in both the training and test data sets.
$Y^{Training} = \sum_{i=1}^{p} \beta X_{i}^{Training} + \epsilon$
The objective of iMIRAGE is to determine $\beta$ so that it can be harnessed to predict the expression of miRNA Y by applying the model to the test protein-coding data set:
$Y^{Predicted} = \sum_{i=1}^{p} \beta X_{i}^{Test}$
(Note that the machine-learning algorithms that are used iMIRAGE are not necessarily based on the assumption of linearity. This is only for simplified illustration purposes)
iMIRAGE assumes that not all protein-coding genes are meaningful for constructing imputation models. Therefore, it suggests using a subset of informative protein-coding genes for each miRNA. These informative features are automatically determined by calculating the correlation between the expression of all features with miRNA expression in the training data sets. Then, only a user-specified subset of the top correlated features are retained for constructing the models (by default 50).
Training data: protein-coding and miRNA expression data. In the package documentation, these data sets are referred by their alias train_pcg and train_mir respectively
Protein-coding gene expression data from the independent samples of interest. In the package documentation, this data set is referred by its alias my_pcg
Ideally, the training dataset should be of the same tissue-type as the independent test dataset. However, large datasets spanning multiple tissues can also be used. For example, the pan-cancer TCGA cohort covering more than 30 tissue-types can be used as training data to predict miRNA profiles of various individual cancers.
Organizing your data before use:
```
library(biomaRt) ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl") old.ids <- colnames(my_pcg) #RefSeq IDs gene.annot <- getBM(attributes=c('refseq_mrna','ensembl_gene_id'), filters = 'refseq_mrna', values = as.character(old.ids), mart = ensembl)
``` See the biomaRt documentation for complete details and instructions
We will use the unprocessed TCGA BRCA data sets to illustrate a typical workflow. The data set used in the subsequent examples can be downloaded from the Open Science Framework repository by following the following URL: https://osf.io/s5rbn/download
After downloading the TCGA_BRCA_Datasets.RData file in the current working directory, load the contents using
load("TCGA_BRCA_Datasets.RData")
The following four matrices should appear after the object is loaded: ga.gex, ga.mirna, hiseq.gex, hiseq.mirna
filter.exp
to remove gene or miRNAs that are not expressed in most samples using (optional)We first remove gene or miRNAs that are not expressed in most samples, as these genes are likely not measured due to technical limitations and may not be reliable expression estimates. Ultimately, it is up to the user to decide whether they would like to include the sparse genes or miRNA in their analyses.
#Here, we filter the miRNA datasets to retain all miRNAs that are expressed above a level of 0 in atleast 75% of the samples ga.mirna <- filter.exp(ga.mirna, cutoff=75, threshold = 0) hiseq.mirna <- filter.exp(hiseq.mirna, cutoff=75, threshold = 0)
match.gex
to match training and test protein-coding expression matrices (required)Here, we match the two protein-coding expression matrices for subsequent processing
#We also will keep the unprocessed hiseq.gex.1 and ga.gex.1 datasets for subsquent comparisons temp <- match.gex(hiseq.gex, ga.gex) hiseq.gex.1 <- temp[[1]] ga.gex.1 <- temp[[2]] #For subsquent prediction analyses, we will also select miRNAs are common to both datasets temp <- match.gex(ga.mirna, hiseq.mirna) ga.mirna.1 <- temp[[1]] hiseq.mirna.1 <- temp[[2]]
pre.process
training and test datasets (optional, recommended)In this step, we process the training and test data sets using pre.process
to make sure they are transformed, normalized and standardized before proceeding with further analysis. These steps tend to have a significant impact on subsequent analyses.
Generally, the user should determine whether their data has been transformed ($\log2$ or $\log2(x+1)$) or normalized from the original source. In most cases, microarray data sets in public domain, such as NCBI GEO or ArrayExpress, are already transformed and normalized. However, several RNA-seq data sets may be available prior to transformation and normalization. In these instances, it is advisable to perform log transformation and upper-quantile normalization prior to use in the workflow.
In addition, the variance filter removes genes with zero variance that will offer no predictive power. Finally, the data is scaled to a mean = 0 and standard deviation = 1. These are bare-minimum standardization techniques which can be skipped if the user has already pre-processed their data using other approaches.
ga.gex.2 <- pre.process(ga.gex.1, log = TRUE, var.filter = TRUE, UQ = TRUE, std = TRUE) hiseq.gex.2 <- pre.process(hiseq.gex.1, log = TRUE, var.filter = TRUE, UQ = TRUE, std = TRUE) #This is how the data looks before and after pre.processing (showing the first 10 samples from each dataset) par(mfrow=c(2,2), cex=0.75) boxplot(t(ga.gex.1[1:10,]), main="GA - raw expression") boxplot(t(ga.gex.2[1:10,]), main="GA - processed expression") boxplot(t(hiseq.gex.1[1:10,]), main="Hiseq - raw expression") boxplot(t(hiseq.gex.2[1:10,]), main="Hiseq - processed expression")
Here, we will first find out which miRNAs can be imputed with good accuracy using the GA data set as our training data set. The imirage.cv
function performs a 10-fold cross-validation analysis for a single miRNA specified by the gene_index argument, as shown in the quick start guide above. In addition, the imirage.cv.loop
wrapper can be used to perform the cross-validation analysis on the entire training miRNA data set.
The cross-validation analysis can be performed using one of the three methods: K-nearest neighbor regression (KNN), Random Forests (RF) or Support Vector Machines (SVM). These can be specified using the method argument.
For further details on the arguments that can be passed on the machine-learning methods, please see the documentation for randomForest, e0171 and FNN packages.
By default, the number of training features that are used by each method are set at 50. This can adjusted by the user using the num argument to achieve a balance between desired imputation accuracy and computational costs.
Additionally, the "K"" or number of cross-validation iterations can be set using folds, which set at 10 by default.
Here, we perform a 10-fold cross-validation analysis using the entire GA training data sets using KNN method. We perform the analysis with both the processed and unprocessed GA protein-coding data for comparison
``` {r results="hide"}
CV.ga.gex1 <- imirage.cv.loop(train_pcg = ga.gex.1, train_mir = ga.mirna.1, method = "KNN", target="none")
CV.ga.gex2 <- imirage.cv.loop(train_pcg = ga.gex.2, train_mir = ga.mirna.1, method = "KNN", target="none")
CV.hs.gex1 <- imirage.cv.loop(train_pcg = hiseq.gex.1, train_mir = hiseq.mirna.1, method = "KNN", target="none")
CV.hs.gex2 <- imirage.cv.loop(train_pcg = hiseq.gex.2, train_mir = hiseq.mirna.1, method = "KNN", target="none")
boxplot(CV.ga.gex1[,1], CV.ga.gex2[,1], names=c("Raw", "Processed"), main="GA CV accuracy")
boxplot(CV.hs.gex1[,1], CV.hs.gex2[,1], names=c("Raw", "Processed"), main="Hiseq CV accuracy")
Subsequently, the user can either select the miRNAs with good performance metrics in cross-validation analysis or perform the imputation using the full training data set with `imirage`. Generally, poor cross-validation metrics are a good indicator that the miRNA can be safely excluded from the prediction analysis in the independent test data set. ---- ### Session Info ```r sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.