Description Usage Arguments Details Value Examples
View source: R/Download_Preprocess.R
This function wraps the functions for downloading and pre-processing DNA methylation and gene expression data, as well as for clustering CpG probes.
1 | GetData(cancerSite, targetDirectory)
|
cancerSite |
character of length 1 with TCGA cancer code. |
targetDirectory |
character with directory where a folder for downloaded files will be created. |
Pre-process of DNA methylation data includes eliminating samples and genes with too many NAs, imputing NAs, and doing Batch correction. If there is both 27k and 450k data, and both data sets have more than 50 samples, we combine the data sets, by reducing the 450k data to the probes present in the 27k data, and bath correction is performed again to the combined data set. If there are samples with both 27k and 450k data, the 450k data is used and the 27k data is discarded, before the step mentioned above. If the 27k or the 450k data does not have more than 50 samples, we use the one with the greatest number of samples, we do not combine the data sets.
For gene expression, this function downloads RNAseq data (file tag "mRNAseq_Preprocess.Level_3"), with the exception for OV and GBM, for which micro array data is downloaded since there is not enough RNAseq data. Pre-process of gene expression data includes eliminating samples and genes with too many NAs, imputing NAs, and doing Batch correction.
For the clustering of the CpG probes, this function uses the annotation for Illumina methylation arrays to map each probe to a gene. Then, for each gene, it clusters all its CpG sites using hierchical clustering and Pearson correlation as distance and complete linkage. If data for normal samples is provided, only overlapping probes between cancer and normal samples are used. Probes with SNPs are removed.
This function is prepared to run in parallel if the user registers a parallel structure, otherwise it runs sequentially.
This function also cleans up the sample names, converting them to the 12 digit format.
The following files will be created in target directory:
gdac
: a folder with the raw data downloaded from TCGA.
MET_CancerSite_Processed.rds
: processed methylation data at the CpG sites level (not clustered).
GE_CancerSite_Processed.rds
: processed gene expression data.
data_CancerSite.rds
: list with both gene expression and methylation data. Methylation data is clustered and presented at the gene level. A matrix with the mapping from CpG sites to genes is included.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | ## Not run:
# Get data for ovarian cancer
cancerSite <- "OV"
targetDirectory <- paste0(getwd(), "/")
GetData(cancerSite, targetDirectory)
# Optional register cluster to run in parallel
library(doParallel)
cl <- makeCluster(5)
registerDoParallel(cl)
cancerSite <- "OV"
targetDirectory <- paste0(getwd(), "/")
GetData(cancerSite, targetDirectory)
stopCluster(cl)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.