library(knitr) opts_chunk$set(comment="", message=FALSE, warning = FALSE, tidy.opts=list(keep.blank.line=TRUE, width.cutoff=150),options(width=150), eval = FALSE)
The Cancer Genome Atlas (TCGA) Data Portal provides a platform for researchers to search, download, and analyze data sets generated by TCGA. It contains clinical information, genomic characterization data, and high level sequence analysis of the tumor genomes [1].
TCGA data is available through Firehose Broad GDAC portal [1]. One can select cancer type (cohort) and data type (e.g. clinical, RNA expression, methylation, ..) and download a tar.gz
file with compressed data.
When working with many cancer types we find this approach burdensome:
For these reasons we prepared selected datasets from the TCGA project in an easy to handle and process way and embed them in 4 separate R packages. All packages can be installed from BioConductor by evaluating the following code:
if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install("RTCGA.clinical") BiocManager::install("RTCGA.rnaseq") BiocManager::install("RTCGA.mutations") BiocManager::install("RTCGA.cnv") # or developers version BiocManager::install("mi2-warsaw/RTCGA.cnv")
A TCGA barcode is composed of a collection of identifiers. Each specifically identifies a TCGA data element. An illustration on what each part of the patient's barcode can be found on \newline \ https://wiki.nci.nih.gov/display/TCGA/TCGA+barcode.
\pkg{RTCGA.data} family contains 4 packages:
RTCGA.clinical
package containing clinical datasets from TCGA. Each cohort contains one dataset prepared in a tidy format. Each row, marked with patients' barcode, corresponds to one patient. Clinical data format is explained here https://wiki.nci.nih.gov/display/TCGA/Clinical+Data+OverviewRTCGA.rnaseq
package containing genes' expressions datasets from TCGA. Each cohort contains one dataset with over 20 thousand columns corresponding to genes' expression. Rows correspond to patients, that can be matched with the patient's barcode. Genes' expressions data format is explained here https://wiki.nci.nih.gov/display/TCGA/RNASeq+Version+2RTCGA.mutations
package containing genes' mutations datsets from TCGA. Each cohort contains one dataset with extra column specifying patient's barcode which enables to distinguish which rows correspond to which patient. Mutations' data format is explained here https://wiki.nci.nih.gov/display/TCGA/Mutation+Annotation+Format+(MAF)+Specification. RTCGA.cnv
package containing copy number (the number of copies of a given gene per cell) variation datasets from TCGA. More detailed information about datasets included in \pkg{RTCGA.data} family are shown in Table \ref{data_details}
library(dplyr) library(XML) library(stringi) readHTMLTable("http://gdac.broadinstitute.org/") -> df info <- df[[39]][,1:3] # this function produces information correspoding to Table nr 1 show_dims_RTCGA <- function( package_name ){ data(package = paste0("RTCGA.", package_name))$results[,3] %>% sapply( function(element){ #get(element, envir = .GlobalEnv) %>% dim() data(list=element, package = paste0("RTCGA.", package_name), envir = .GlobalEnv) get(element, envir = .GlobalEnv) %>% dim() -> res rm(list = element, envir = .GlobalEnv) return(res) }) %>% t -> df df_dims <- data.frame( "Cohort" = stri_extract_all_regex(row.names(df), pattern = paste0("[^\\.",package_name,"]")) %>% lapply( stri_flatten) %>% unlist, package_name = paste0(df[,1]," x ",df[,2]) ) names(df_dims)[2] <- package_name return(df_dims) } library(RTCGA.clinical) library(RTCGA.rnaseq) library(RTCGA.mutations) library(RTCGA.cnv) left_join( x = info , y = show_dims_RTCGA("clinical"), by = "Cohort") %>% left_join( y = show_dims_RTCGA("cnv"), by="Cohort") %>% left_join( y = show_dims_RTCGA("mutations"), by="Cohort") %>% left_join( y = show_dims_RTCGA("rnaseq"), by="Cohort") %>% xtable::xtable()
\begin{widetable}[h] \centering \caption{\label{data_details}Dimensions of available datasets in \pkg{RTCGA.family}.} \begin{tabular}{rlllllll} \toprule & Disease Name & Cohort & Cases & clinical & cnv\footnote{The second dimension is always equal to 6.} & mutations & rnaseq\footnote{The second dimension is always equal to 20532.} \ \toprule 1 & Adrenocortical carcinoma & ACC & 92 & 92 x 1115 & 21052 & 20255 x 53 & 79 \ 2 & Bladder urothelial carcinoma & BLCA & 412 & 401 x 2098 & 105795 & 39441 x 96 & 427 \ 3 & Breast invasive carcinoma & BRCA & 1098 & 1085 x 3668 & 284510 & 91471 x 68 & 1212 \ 4 & Cervical and endocervical cancers & CESC & 307 & 305 x 1674 & 59450 & 46740 x 58 & 309 \ 5 & Cholangiocarcinoma & CHOL & 36 & 36 x 846 & 7570 & 6789 x 49 & 45 \ 6 & Colon adenocarcinoma & COAD & 460 & 453 x 3149 & 91166 & 62683 x 40 & 328 \ 7 & Colorectal adenocarcinoma & COADREAD & 631 & 624 x 3488 & 126931 & & \ 8 & Lymphoid Neoplasm Diffuse Large... & DLBC & 58 & 47 x 760 & 9343 & & 28 \ 9 & Esophageal carcinoma & ESCA & 185 & 183 x 1197 & 60803 & & 196 \ 10 & FFPE Pilot Phase II & FPPP & 38 & 38 x 3277 & & & \ 11 & Glioblastoma multiforme & GBM & 613 & 593 x 5379 & 146852 & 22362 x 80 & 166 \ 12 & Glioma & GBMLGG & 1129 & 1085 x 5660 & 226643 & & \ 13 & Head and Neck squamous cell carcinoma & HNSC & 528 & 523 x 1754 & 110289 & 52077 x 90 & 566 \ 14 & Kidney Chromophobe & KICH & 113 & 111 x 907 & 10164 & 7624 x 37 & 91 \ 15 & Pan-kidney cohort (KICH+KIRC+KIRP) & KIPAN & 973 & 917 x 2766 & 142122 & 73527 x 36 & 1020 \ 16 & Kidney renal clear cell carcinoma & KIRC & 537 & 533 x 2682 & 85044 & 26785 x 36 & 606 \ 17 & Kidney renal papillary cell carcinoma & KIRP & 323 & 273 x 1890 & 46914 & 15745 x 53 & 323 \ 18 & Acute Myeloid Leukemia & LAML & 200 & 200 x 1148 & 28324 & 2781 x 65 & 173 \ 19 & Brain Lower Grade Glioma & LGG & 516 & 492 x 2127 & 79791 & 10170 x 39 & 530 \ 20 & Liver hepatocellular carcinoma & LIHC & 377 & 364 x 1583 & 93328 & 28089 x 49 & 423 \ 21 & Lung adenocarcinoma & LUAD & 585 & 521 x 3009 & 122927 & 72770 x 92 & 576 \ 22 & Lung squamous cell carcinoma & LUSC & 504 & 495 x 2692 & 134864 & 65482 x 87 & 552 \ 23 & Mesothelioma & MESO & 87 & 87 x 893 & 18335 & & 86 \ 24 & Ovarian serous cystadenocarcinoma & OV & 602 & 591 x 3626 & 261680 & 20534 x 44 & 265 \ 25 & Pancreatic adenocarcinoma & PAAD & 185 & 185 x 1248 & 34808 & 15779 x 85 & 183 \ 26 & Pheochromocytoma and Paraganglioma & PCPG & 179 & 179 x 1186 & 31256 & 4784 x 91 & 187 \ 27 & Prostate adenocarcinoma & PRAD & 499 & & 117345 & 12679 x 86 & 550 \ 28 & Rectum adenocarcinoma & READ & 171 & 171 x 2740 & 35765 & 22143 x 40 & 105 \ 29 & Sarcoma & SARC & 260 & & 106617 & 26753 x 78 & \ 30 & Skin Cutaneous Melanoma & SKCM & 470 & 469 x 1875 & 108084 & 276271 x 91 & 472 \ 31 & Stomach adenocarcinoma & STAD & 443 & 443 x 1690 & 118389 & 148808 x 80 & \ 32 & Stomach and Esophageal carcinoma & STES & 628 & 626 x 1828 & 179192 & 148808 x 80 & 196 \ 33 & Testicular Germ Cell Tumors & TGCT & 150 & 134 x 983 & 24952 & 14826 x 58 & 156 \ 34 & Thyroid carcinoma & THCA & 503 & 502 x 1662 & 55377 & 7862 x 91 & 568 \ 35 & Thymoma & THYM & 124 & 123 x 848 & 15571 & & 122 \ 36 & Uterine Corpus Endometrial Carcinoma & UCEC & 560 & 540 x 2180 & 127430 & 185108 x 50 & 201 \ 37 & Uterine Carcinosarcoma & UCS & 57 & 57 x 918 & 19298 & 11210 x 91 & 57 \ 38 & Uveal Melanoma & UVM & 80 & 80 x 594 & 12973 & 2607 x 91 & 80 \ \bottomrule \end{tabular} \end{widetable}
After installation, one can load any package from \pkg{RTCGA.data} family with commands
library(RTCGA.clinical) library(RTCGA.rnaseq) library(RTCGA.mutations) library(RTCGA.cnv)
and one can check what datasets are available (Table \ref{data_details}) with commands
?clinical ?rnaseq ?mutations ?cnv
The data loading proceeds in a regular way. Simply type
data(cohort.package)
where cohort
corresponds to a specific Cohort of patients and package
corresponds to the one of four packages from \pkg{RTCGA.data} family.
#library(devtools);BiocManager::install("mi2-warsaw/RTCGA.tools")
\newpage
\pkg{RTCGA.data} family is excellent when one researches in a field of survival analysis and genomics. Survival times for patients are included in clinical datasets. The following example plots Kaplan-Meier [5] estimates of the survival functions for patients suffering from LUAD cancer, divided into stages of the cancer.
library(RTCGA.clinical) library(RTCGA.tools) RTCGA.tools::clinicalStageSurvival(LUAD.clinical, xlims = c(0,2000), title = "Lung adenocarcinoma")
pdf(file = "km_plot_luad.pdf") RTCGA.tools::clinicalStageSurvival(LUAD.clinical, xlims = c(0,2000), title = "Lung adenocarcinoma") dev.off()
\begin{figure}[h!] \begin{centering} \includegraphics[width=12cm, height=8cm]{km_plot_luad.pdf} \caption{\label{km_plot}The Kaplan-Meier estimate of the survival curve for the LUAD cancer. } \end{centering} \end{figure}
\newpage
RTCGA.data family and RTCGA.tools package provide an easy to accees set of tools that allow to create useful figures in a simple R command. Below is an exmaple of boxplots for logarithm transformation for \textbf{ETF1} gene expression, divided on cancer types and 3 most popular levels of mutations in gene \textbf{TP53}.
library(RTCGA.mutations)
pdf( file = "mutationsBox.pdf", width = 10 ) mutationsBox(c("BRCA", "HNSC", "LUSC", "PRAD"), "TP53", "ETF1") dev.off()
\begin{figure}[h!] \begin{centering} \includegraphics[width=8cm, height=6cm]{mutationsBox.pdf} \caption{\label{figRes}Caption. } \end{centering} \end{figure}
\newpage
Copy-number variations (CNVs)—a form of structural variation—are alterations of the DNA of a genome that results in the cell having an abnormal or, for certain genes, a normal variation in the number of copies of one or more sections of the DNA. CNVs correspond to relatively large regions of the genome that have been deleted (fewer than the normal number) or duplicated (more than the normal number) on certain chromosomes.
With access to CNV data one can compare frequency of DNA alterations for different cancers. The CNV data is stored in a non-standard format. For each patient, for each chromosome the sequence is split into a different number of segments, and each segment has score that correspond to the multiplication factor. Such dataset has to be first converted to a more standard format, with information about multiplication for given gene, and then it is possible to compare changes of CNV across clinical cohorts / cancers.
In the example below we compare the variation of number of copies for the MDM2 gene, which is located on the chromosome 12, positions 69240000-69200000. First we extract data for the given gene and then we plot distributions of CNV for different tumors (single observation is a single patient).
As you can see in the figure below, it turns out that in some tumors there are patients with very high number of multiplications of this gene.
library(RTCGA.cnv) # find all fragments that overlaps with MDM2 MDM2.scores <- get.region.cnv.score(chr="12", start=69240000, stop=69200000) # only samples from primary tumor MDM2.scores <- MDM2.scores[grepl(MDM2.scores$Sample, pattern = "-01A-"),] # remove duplicates MDM2.scores$Sample <- substr(MDM2.scores$Sample, 1, 12) MDM2.scores <- MDM2.scores[!duplicated(MDM2.scores$Sample),] MDM2.scores$cohort <- gsub(MDM2.scores$cohort, pattern=".cnv", replacement = "") # sort along average CNV MDM2.scores$cohort <- reorder(MDM2.scores$cohort, MDM2.scores$Segment_Mean, median, na.rm=TRUE) # plotit library(ggplot2) ggplot() + geom_boxplot(data=MDM2.scores, aes(y=2*2^Segment_Mean, x=cohort, fill=cohort)) + coord_flip() + theme_bw() + theme(legend.position="none") + ylab("Average number of CNV copies for MDM2 gene") + scale_y_continuous(trans="log2") + xlab("")
pdf(file = "cnv.pdf") ggplot() + geom_boxplot(data=MDM2.scores, aes(y=2*2^Segment_Mean, x=cohort, fill=cohort)) + coord_flip() + theme_bw() + theme(legend.position="none") + ylab("Average number of CNV copies for MDM2 gene") + scale_y_continuous(trans="log2") + xlab("") dev.off()
\begin{figure}[h!] \includegraphics[width=12cm]{cnv.pdf} \caption{\label{biplot2}The boxplot for 31 cancer types with CNV copies of MDM2 gene.} \end{figure}
\newpage
One can also perform a Principal Components Analysis, after binding rnaseq data for few random cancer types like below. It can be seen that genes' expressions amongs those cancers (Adrenocortical carcinoma, Cholangiocarcinoma, Glioma, Pheochromocytoma + Paraganglioma and Uveal Melanoma) vary and samples group in view of cancer type.
library(RTCGA.rnaseq) rnaseqBiplot(cohorts = c("ACC", "CHOL", "GBM", "PCPG", "UVM"))
pdf(file = "biplot_rnaseq.pdf") rnaseqBiplot(cohorts = c("ACC", "CHOL", "GBM", "PCPG", "UVM")) dev.off()
\begin{figure}[h!] \includegraphics[width=12cm]{rnaseq_biplot.pdf} \caption{\label{biplot2}The biplot for 2 main components of the principal component analysis of genes' expressions data for 5 various cancer types.} \end{figure}
\newpage
[1] http://cancergenome.nih.gov/ [2] http://gdac.broadinstitute.org/ [3] http://cran.r-project.org/bin/windows/Rtools/ [4] https://wiki.nci.nih.gov/display/TCGA/TCGA+barcode [9] Cox D. R., (1972) \textit{Regression models and life-tables (with discussion)}, Journal of the Royal Statistical Society Series B 34:187-220. [5] Kaplan, E. L.; Meier, P. (1958). "Nonparametric estimation from incomplete observations". J. Amer. Statist. Assn. 53 (282): 457–481. JSTOR 2281868. \bibliography{RJreferences}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.