Introduction

miRDriver is a computational tool that infers copy number aberration-based miRNA-gene interactions in cancer utilizing multi-omics datasets. miRDriver has four computational steps. In the first step, it finds frequently aberrated copy number regions of cancer samples utilizing the GISTIC2.0 tool (https://cloud.genepattern.org/). In the second step, for each GISTIC region miRDriver computes differentially expressed (DE) genes between frequently aberrated and non-aberrated sample groups. In the third step, miRDriver retrieves DE genes and miRNAs that reside in the aberrated regions (i.e., cis genes and cis miRNAs) and retrieves DE genes that are outside of aberrated regions (i.e., trans genes). In the last step, miRDriver employs a LASSO-based regression model to compute potential miRNA regulators of the trans genes.

library(miRDriver)

Example Datasets

The example datasets are provided to run all four steps of miRDriver and can be accessed using data("miRDriverData") after the package installation. This includes genomic data (gene expression, copy number variation, transcription factor) and epigenomic data (methylation). The gene expression data can be RNASeq or Microarray with proper normalization, although the code has not been tested on Microarray data. The first step of miRDriver requires segmented copy-number data. The Lasso step of miRDriver requires gene-centric copy number data. The methylation data needs to be gene-centric beta values. For further details on how to preprocess the datasets, one should refer to the method paper [Bose, B. & Bozdag, S. miRDriver: A Tool to Infer Copy Number Derived miRNA-Gene Networks in Cancer. in Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics 366–375 (Association for Computing Machinery, 2019). doi:10.1145/3307339.3342172.]. The example datasets included in this package are dummy datasets that are used to run the code in the vignette, but not to produce meaningful results. Although the example datasets in the vignette contain ENSEMBL IDs for the genes, this tool can be run with any gene id format, such as HGNC. We downloaded and pre-processed the original multi-omics datasets from TCGA (The Cancer Genome Atlas) database for eighteen different cancer types. Our pre-processed TCGA datasets can be accessed from Figshare via https://figshare.com/s/7400ad8445b2e78e4636.

mirna_bedfile

This is an R dataframe object where the data represents the genomic position of the miRNAs. The rows in this dataset are the chromosomal regions that start with "chr" followed by chromosome numbers as integers 1-22. The columns are “chromosome number”, “start position”, “end position”, and "miRNA ID".

data("miRDriverData")
mirna_bedfile[1:5, ]

RNA-Seq count data

This is an R dataframe object that contains RNA-Seq count data for each sample and each gene in the study. The rows are the genes. The columns are the sample IDs for the patients.

Dummy_RnaSeq_Count_Data[1:5,1:3]

Lesions file

This is a text file produced by the GISTIC2.0 tool. Here, the first 9 columns are “Unique.Name, Descriptor, Wide.Peak.Limits, Peak.Limits, Region.Limits, q.values". After the first 9 columns there are sample id's with their aberration status "0", "1", "2". "0" represents no copy number aberration, "1" represents a low level copy number aberration, and "2" represents a high level copy number aberration.

x <-
    read.table(
        system.file("extdata", "Dummy.gistic.lesions.txt", package = "miRDriver"),
        sep = "\t",
        header = TRUE
    )
x[1:5, c(1, 10:13)]

hg38 genes coordinate bedfile

This is an R dataframe object that contains the coordinates for all the genes in the human genome. There are four columns, such as the chromosome, the start position, the end position and the ENSEMBLE ID. The rows are labeled with the gene IDs.

Genes_coord[1:5,]

CNVData

This is an R dataframe object with gene-centric copy number values. The first column, labeled as "sample", contains the sample IDs. The rest of the columns are the copy number values of the genes.

dummy_CNVData10[1:5,1:3]

mirnaData

This is an R dataframe object that contains the miRNA expression for the samples. The first column labeled as "sample", contains the sample IDs. The rest of the columns are the miRNAs with expression values.

dummy_mirnaData10[1:5,1:3]

RSeqData

This is an R dataframe object that contains the gene expression data for the samples. The first column is labeled as "Sample" and contains the Sample IDs. The rest of the columns are the genes with expression values.

dummy_RSeqData10_200[1:5,1:3]

TFData

This is an R dataframe object which contains gene, transcription factor (TF) with a confidence score column for their interactions. The first column, labeled as "V1" contains the genes, the second column, labeled as "TF", contains the transcription factors, and the third column is labeled as "Confidence". The "Confidence" column can work with any numeric value and is supposed to filter the interactions based on a maximum threshold that we set as "5" as default. If the user uses our gene-TF interaction file which was curated from DoRothEA gene set resource (https://saezlab.github.io/dorothea/) for cancer-related TF-target interactions the "Confidence" score as "5" will select the highest confidence level for the interactions. If the user provides their TF-target file then they require to provide this "Confidence" column; in the case where the confidence scores are unknown, the user can set all values in the column as "5" to run the tool.

dummy_TFData[1:5,]

Dorothea_data

This is an R dataframe object named “Dorothea_data” containing cancer-related TF-target interactions from the DoRothEA gene set resource (https://saezlab.github.io/dorothea/). DoRothEA TF-targets were curated and collected from different types of evidence such as (1) literature curated resources, (2) ChIP-seq peaks (includes ENCODE ChIP-seq profiles), (3) TF binding site motifs, and (4) interactions inferred directly from gene expression. In Dorothea_data, the first column, labeled as “V1” contains the genes, the second column, labeled as “TF”, contains the TFs, and the third column labeled as “Confidence” contains a confidence level column for the interactions. This confidence level ranges from “5” (highest confidence) to “1” (lowest confidence). Interactions that are supported by all four lines of evidence from DoRothEA, manually curated by experts in specific reviews, or supported both in at least two curated resources are considered to be highly reliable and were assigned a “5” level. Level “4”-“2” are reserved for curated and/or ChIP-seq interactions with different levels of additional evidence. Finally, the “1” level is used for interactions that are uniquely supported by computational predictions. Users can apply any confidence level threshold to filter interactions. Users can also provide their TF-target file in the same format as described above. The default value has been set to "5" as we believe that using high confident interactions would be crucial to make LASSO step more stable.

Dorothea_data[1:5,]

methylationData

This is an R dataframe object that contains the gene-centric DNA methylation beta values. The first column labeled as "sample" contains the sample IDs. The rest of the columns are the genes with methylation values.

dummy_methylationData10[1:5,1:3]

Using the functions

The following sections describe all the functions of miRDriver. The functions are dependent on each other in the sense that the user needs to finish the first step of miRDriver to start the second step and so on.

Step 1 Finding Gistic Regions

Step one of the miRDriver pipeline finds frequently aberrated copy number regions among cancer patients. This step consists of one function, getRegionWiseGistic().

getRegionWiseGistic

This function uses the lesions file to find the aberrations for each peak. The lesions file, as stated in the example data sets section, can be created with the GISTIC2.0 module on genepattern.com

(https://cloud.genepattern.org/gp/pages/index.jsf?lsid=urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00125:6.15.28).

If the user utilizes the GISTIC2.0 to make the lesions file, then the lesions file will be labeled all_lesions.conf_90.txt. In the example code, this file is used in the function as the argument gisticfile. The gistic_bedfile argument should be either TRUE or FALSE, representing if the user would like the function to make a gistic_bedfile. The mirdirectory argument needs to be the file path that the user wants the results to be in. The mirdirectory needs to be the same for all functions in the same analysis.The argument Aber can be set as "Amp","Del" or "All" for getting genes from amplified, deleted or from both types of GISTIC regions; the default is Aber = "All".

Arguments:

gisticfile - a character string for the file path of the GISTIC output file

gistic_bedfile - a logical value for whether or not one wants to create a gistic_bedfile (TRUE/FALSE)

mirdirectory - a character string for the file path of the directory one wants the results in

aber - a character string that can be set as "Amp","Del" or "All" for getting genes from amplified, deleted or from both types of GISTIC regions; the default is Aber = "All"

mirdirectory <- tempdir()
getRegionWiseGistic(
    gisticfile = system.file("extdata", "Dummy.gistic.lesions.txt", package = "miRDriver"),
    gistic_bedfile = TRUE,
    mirdirectory = mirdirectory
)

Output: The output from this function can be found in a subdirectory of the mirdirectory entitled mirDriverFold/miRDriver_Step1. Inside mirDriverFold/miRDriver_Step1 there is a subdirectory entitled GisticResults and a subdirectory entitled Regionwise_Gistic_Files.

load(
    file.path(
        mirdirectory,
        "mirDriverFold",
        "miRDriver_Step1",
        "GisticResults",
        "gistic_bedfile.rda"
    )
)
region[1:5, ]
x <-
    data.frame(read.table(
        file.path(
            mirdirectory,
            "mirDriverFold",
            "miRDriver_Step1",
            "Regionwise_Gistic_Files",
            "Amplification Peak  5.txt"
        ),
        sep = "\t",
        header = TRUE
    ))

x[1:5, ]

Step 2 Finding Differentially Expressed Trans Genes

Step two of the miRDriver pipeline is to find the differentially expressed (DE) genes. This step uses two functions, getDifferentiallyExpressedGenes() and makingTransAndCisGenes().

getDifferentiallyExpressedGenes

This function finds the differentially expressed genes from the peak wise GISTIC region files generated by the getRegionWiseGistic function and RNASeq count data. This function uses parallel computation by setting the argument value of ncore; the default is ncore = 1. The argument RNAcount needs to be set to the file path for the RNASeq count data. The argument mirdirectory needs to be the same as in all previous functions. The argument DEgene can be set as "Up","Down" or "All" for getting up-regulated, down-regulated or both up and down genes, respectively from the GISTIC regions; the default is DEgene = "All"

Arguments:

ncore - an integer value for the number of cores user wants to use (default = 2)

RNAcount - R dataframe object containing RNASeq count data. The rows are the genes. The columns are the sample IDs for the patients

mirdirectory - a character string for the file path of the directory one wants the results in, must be the same as in all previous functions.

DEgene - a character string for specifying up regulated, down regulated or both type of genes by setting "Up","Down" or "All", respectively (default = "All")

getDifferentiallyExpressedGenes(ncore = 1,
                                RNACount = Dummy_RnaSeq_Count_Data,
                                mirdirectory = mirdirectory)

Output: The output for this function can be found in the mirdirectory in mirDriverFold/miRDriver_Step2/UpDownGeneLibrary. These files are the intermediary files for step two.

x <-
    read.table(
        file.path(
            mirdirectory,
            "mirDriverFold",
            "miRDriver_Step2",
            "UpDownGeneLibrary",
            "UPgene_Amplification Peak  5.txt"
        ),
        sep = "",
        header = TRUE
    )
x[1:3, ]

makingTransAndCisGenes

This function finds the trans and cis genes for each GISTIC region peak. The trans and cis genes will be stored in R dataframe objects located in the file path ~/mirdirectory/miRDriver_Step2/Trans_Cis_Files/peak/. In the file path the mirdirectory is the mirdirectory specified in the mirdirectory argument and should be the same as the mirdirectory used in the previous functions. The peak in the file path represents the subdirectories for each peak which will follow the GISTIC region name convention, such as "Amplification Peak 1". The gistic_bedfile argument needs to be set to an R dataframe object that follows the convention of having four columns, "chromosome", "start position", "end position", and "peak". The chromosome column will have the chromosome number in the form "chr1". The start and end positions will be integers representing the start and end positions on the chromosome. The peak column will have a string representing the peak following the convention of "Amplification Peak 1". The required gistic_bedfile can be made utilizing the getRegionWiseGistic() function.

Arguments:

Genes_coord_bedfile - A R dataframe object that fits the format of a bedfile for gene coordinates

gistic_bedfile - A R dataframe object that fits the format of a bedfile for the GISTIC results with amplification/deletion peaks with patients

mirdirectory - a character string for the file path of the directory one wants the results in, must be the same as all previous functions

load(
    file.path(
        mirdirectory,
        "mirDriverFold",
        "miRDriver_Step1",
        "GisticResults",
        "gistic_bedfile.rda"
    )
)
makingCisAndTransGenes(
    Genes_coord_bedfile = Genes_coord,
    gistic_bedfile = region,
    mirdirectory = mirdirectory
)

Output: The output for this function can be found in subdirectory of the mirdirectory entitled mirDriverFold/miRDriver_Step2/Trans_Cis_Files with trans_genes.Rda files under each peak/region directory.

load(
    file.path(
        mirdirectory,
        "mirDriverFold",
        "miRDriver_Step2",
        "Trans_Cis_Files",
        "Amplification Peak  1",
        "trans_genes.Rda"
    )
)
trans_genes[1:3, ]
message("These are the trans genes")

Step 3 Gathering Cis Genes and Cis miRNA For Each Trans Gene

Step three of the miRDriver pipeline is the gathering of the cis genes and cis miRNA for each trans gene. This step uses one function, gatherCisGenes().

gatherCisGenes

This function collects all the DE cis genes and all miRNAs for each trans gene. This function uses the output from the makingCisAndTransGene() function as well as a mirna_bedfile and a gistic_bedfile. The gistic_bedfile is described in the makingCisAndTransGene() function description and can be made using the getRegionWiseGistic() function. The mirna_bedfile is described in the example datasets section and needs to be provided by the user. The output from this function is in the folders gene_all_DEcis and gene_all_mirna. In these folders, there is a folder for each trans gene where all the miRNA's or DE cis genes are listed. As with all the functions the mirdirectory needs to be consistent with the other functions.

Arguments:

ncore - An integer value that represents the number of cores that the user wants to use (default = 2)

mirna_bedfile - A dataframe in the format of a bedfile for all of the miRNA

gistic_bedfile - A dataframe in the format of a bedfile for the GISTIC results

mirdirectory - a character string for the file path of the directory one wants the results , must be the same as in all previous functions

load(
    file.path(
        mirdirectory,
        "mirDriverFold",
        "miRDriver_Step1",
        "GisticResults",
        "gistic_bedfile.rda"
    )
)
gatherCisGenes(
    ncore = 1,
    mirna_bedfile = mirna_bedfile,
    gistic_bedfile = region,
    mirdirectory = mirdirectory
)

Output: The output for this function can be found in the mirdirectory in mirDriverFold/miRDriver_Step3. The miRNA files for each gene are in the folder gene_all_mirna and the DE cis genes for each gene are in the folder gene_all_DEcis. In this case there are no DECis genes for our output due to the dummy datasets, but the dataframes will follow the pattern of the miRNA dataframe.

load(
    file.path(
        mirdirectory,
        "mirDriverFold",
        "miRDriver_Step3",
        "gene_all_DEcis",
        "ENSG00000036828.Rda"
    )
)
TC
message("For this trans gene we do not have any cis gene")
load(
    file.path(
        mirdirectory,
        "mirDriverFold",
        "miRDriver_Step3",
        "gene_all_mirna",
        "ENSG00000124205.Rda"
    )
)
GM[1:3, ]

Step 4 Selecting Potential miRNA's Regulating DE Trans Genes

Step four of the miRDriver pipeline is to select potential miRNAs regulating DE trans genes. This step uses two functions, getTransGenePredictorFile() and lassoParallelForTransGene().

getTransGenePredictorFile

This function creates a TransGenePredictorFile for each trans gene based on the miRNA and DE cis files from the output of the gatherCisGenes() function and the user-provided dataframes methylationData, RSeqData, CNVData, TFData, mirnaData. The five dataframes needed for this function are described in the example datasets section. The TransGenePredictorFiles will contain information about the possible predictors for each trans gene based on the data provided. The TransGenePredictorFiles will be saved in the file path ~/mirdirectory/mirDriverFold/miRDriver_Step4/TransGenePredFile/. As always mirdirectory in the file path represents the mirdirectory chosen for the analysis and should be the same as the mirdirectory used in all previous functions. For TFData, the "Confidence" column can work with integers or numeric values and is supposed to filter the interactions based on a threshold that we set as "minConfidence = 5" as default. Here, users can provide their TF-target interaction data or they can use the Dorothea cancer-related dataset that we have provided with this tool. For the details, please check the descriptions of the datasets.

Arguments:

ncore - An integer value for the number of cores the user wants to use (default=2)

methylationData - a dataframe for the methylation data

RSeqData - a dataframe for the RNASeq data

CNVData - a dataframe for the copy number variation data

TFData - a dataframe for the transcription factor data

mirnaData - a dataframe for the miRNA data

mirdirectory - a character string for the file path of the directory one wants the results in, should be the same as in previous functions

minConfidence - an integer or numeric value for the TFData for filtering the "Confidence" column which is supposed to filter the interactions based on a threshold that we set as "minConfidence = 5" as default

getTransGenePredictorFile(
    ncore = 1,
    methylationData = dummy_methylationData10,
    RSeqData = dummy_RSeqData10_200,
    CNVData = dummy_CNVData10,
    TFData = dummy_TFData,
    mirnaData = dummy_mirnaData10,
    mirdirectory = mirdirectory
)

Output: The output for this function can be found in the mirdirectory in mirDriverFold/miRDriver_Step4/TransGenePredFile.

load(
    file.path(
        mirdirectory,
        "mirDriverFold",
        "miRDriver_Step4",
        "TransGenePredFile",
        "ENSG00000108231.Rda"
    )
)
GenePred[1:3, ]

lassoParallelForTransGene

This function runs a LASSO regression to select the miRNA regulators responsible for the trans genes’ expression variation. This function considered the gene-centric copy number, gene-centric DNA methylation, TFs, miRNA expression as independent variables and the trans gene's expression as the response variable coming from TransGenePredictorFiles created by the getTransPredictorFile function(). For each trans gene, out of all its candidate predictors (i.e., independent variables), LASSO selects a set of non-zero coefficient predictors. To find accurate results, each LASSO regression is set to run multiple times (the user specifies with the argument numCounter), with a certain percentage of times the LASSO coefficient to be non-zero (the user specifies this with the argument nonZeroPercent) for the predictors to be selected. The user also needs to provide the number of cross-validations (argument Nfolds) for each LASSO run. Users can leverage the parallel computation option in their high-performance computing environment by setting the argument ncore for all the available cores. Users are recommended to choose miRNA predictors with improved confidence levels by increasing the hyperparameters (numCounter and nonZeroPercent ) that control the number of LASSO runs and the minimum number of times LASSO must pick a miRNA predictor in the lassoParallelForTransGene() function. The output from this function is found in the file path ~/mirdirectory/mirDriverFold/miRDriver_RESULTS/. The mirdirectory in the file path represents the directory chosen by the user and specified in the argument, mirdirectory. The argument mirdirectory needs to be set to the same directory as the mirdirectory in the previous steps. Furthermore, considering gene expression is controlled at multiple levels, including transcriptional regulation and post-transcriptional regulation, users can run the LASSO step in two phases. An experienced R user could run lassoParallelForTransGene() function twice. In the first run, only the transcriptional predictors could be utilized to explain the expression variation and can be accessed from the *LassoMinCoeff” directory in the ~/mirdirectory/mirDriverFold/miRDriver_RESULTS/. In the second run, post-transcriptional predictors and the residual of the first LASSO run can be utilized as the independent and dependent variables, respectively. Alternatively, if the user has the transcriptional and post-transcriptional expression change data, both LASSO runs can be performed in any order.

Arguments:

ncore - An integer value for the number of cores the user wants to use (default=2)

Nfolds - an integer for the number of times the LASSO regression is run

numCounter - an integer for the number of folds one wants for their LASSO regression

nonZeroPercent - an integer for the percentage of appearance as a non-zero LASSO coefficient for a miRNA to be selected

mirdirectory - a character string for the file path of the directory one wants the results in

lassoParallelForTransGene(
    ncore = 1,
    numCounter = 6,
    Nfolds = 3,
    nonZeroPercent = 20,
    mirdirectory =  mirdirectory
)

Output: The output for this function can be found in the mirdirectory in mirDriverFold/miRDriver_RESULTS. There is a field for each gene and its predictors in the LassoMinCoeff folder and a dataframe with the miRNA that are most significant.

load(
    file.path(
        mirdirectory,
        "mirDriverFold",
        "miRDriver_RESULTS",
        "LassoMinCoeff",
        "ENSG00000088387.Rda"
    )
)
#Showing gene-miRNA interaction network with non-zero coefficients
min_Coeff[min_Coeff$coefficients_min!=0, ][1:5,]
load(
    file.path(
        mirdirectory,
        "mirDriverFold",
        "miRDriver_RESULTS",
        "Final_gene_mirna_network.Rda"
    )
)
Final_gene_mirna_network[1:5, ]

Summary

This package is designed to Infer Copy Number Derived miRNA-Gene Networks in Cancer. The four step process breaks up the different parts of the process into chunks for easy use and produces accurate results.

Acknowledgements

This work was supported by the National Institute of General Medical Sciences of the National Institutes of Health under Award Number R35GM133657.



bozdaglab/miRDriver documentation built on Feb. 1, 2022, 1:11 p.m.