knitr::opts_chunk$set(echo = TRUE)
The package was written to simplify the analysis of the TCGA / GDAC Datasets, as most of the availible packages don't work so far. The analysis is generally based on the explanation of TriS on Biostars.org: https://www.biostars.org/p/153013/
So what kind of Data needs to be downloaded?
Using the GDAC firehose and cBioportal one can get access to the curated level 3 data of the TCGA / GDAC consortium and download it, nevertheless, the data from www.cBioportal.org has apperently the more recent data for mutations, and that's why you can have to define in the parameters, which kind of data you are using.
source = "cBio"
for cBioportal Data, that was not zscored (will be done by the script with internally but not against normal samples as it is done by cBioportal) (http://www.cbioportal.org/data_sets.jsp),
source = "cBio_z_score"
for the data that was zscored by cBioportal
and source = "firehose"
for the firehose data (http://gdac.broadinstitute.org/).
For the clinical data please download:
clinical merged data
For the expression data please download:
"illuminahiseq rnaseqv Level 3 RSEM genes normalized" data
If you want to include mutational data, please download:
Mutation Packager Oncotated Calls
For the clinical data please use:
data_bcr_clinical_data_patient.txt
For the expression data please use either:
data_RNA_Seq_v2_mRNA_median_Zscores.txt
data_RNA_Seq_v2_expression_median.txt
If you want to include mutational data, please use:
mutations\
data_mutations_extended.txt
I will show the example for the cBioportal Data, but working with the data from Firehose is quite similar (I will explain the differences when needed)
Load the package
library(TCGADataFelix) # for the expressionset functions: library(Biobase) # for the modification of the output library(ggplot2)
Define filepath for TCGA patient data downloaded from:
filepath <- "~/R/Projects/TCGA-Elisa/tcga_KIRC_cBio/" # The filename of the clinical data clinical_data <- "data_bcr_clinical_data_patient.txt" # The filename of the expression data expression_data <- "data_RNA_Seq_v2_expression_median.txt"
For the mutation Data, cBioportal and Firehose Data differs, as Firehose gives one file for each patient in a folder and cBioportal one file with all mutations and patients in.
# The Folder of the mutation data (each patient has one file) mutation_data <- "KIRC.Mutation_Packager_Oncotated_Calls/"
# The File of the mutation data mutation_data <- "data_mutations_extended.txt"
values for source:
firehose
for Firehose data
cBio
for not zScored cBioportal data
* cBio_z_score
for zScored cBioportal data
values for skip:
So far you need to test which skip
is the right one for the clinical data. Check in the clinical data file at which row the header for the clinical data is and use this row number minus 1 as the value for skip
.
source <- "cBio" Eset <- build_TCGA_Eset(clinical_file = paste(filepath,clinical_data, sep = ""), expression_file = paste(filepath,expression_data, sep = ""), source = source, skip = 4 )
fetch_TCGA_mutations(path = paste(filepath,mutation_data,sep=""), source = source )
The Mutations are now loaded into a new variable: mutationdata
add_mutations()
takes care of that.Columns are added to: pData(Eset)
To get an overview over the type and number of mutations in the dataset. Attention: Some patients and mutations are duplicated.
head(sort(table(mutationdata$Hugo_Symbol), decreasing=T), n= 30L)
Define now the mutations, you want to analyze and load them into the ExpressionSet using add_mutation()
If more than one mutation was entered, a additional combined mutation column will be added
multiple = "all"
-> a patient that has all mutations presentmultiple = "one"
-> mutation is in at least one patientmutations <- c("VHL","PBRM1") mutset <- add_mutation(Eset, mutations, multiple="all")
Define now, what you want to analyze as a factor for the survival data:
gene <- "FOXD1"
Survival_adaptable(expression = c(gene), Eset = mutset, mutation = mutations, xlabel="Days", legend_position = "bottom", legend_rows = 4, average = "median", optimal=T, plot_cutpoint=F, plot_title = paste("Kaplan Meier Estimator of", gene, "Expression and", paste(mutations, collapse = " & "), "\nstatus in ccRCC (KIRC, TCGA)") )
We can analyze now various different things, e.g. expression of different genes or categories from the clinical data table, such as treatment schemes etc.
Variable | input
----------| ---------
gene | one gene using "gene"
gene | multiple genes using c("gene1","gene2","gene3",...)
clinical | a factor from the clinical dataset, e.g.: "GRADE"
clinical | multiple factors from the clinical dataset, e.g.: c("GRADE","AGE")
mutation | mutations added to the dataset by add_mutation(), e.g. c("TP53", "VHL")
You can exclude single factors from clinical by using the parameter exclude = "exclude1"
or multiple factors using exclude = c("exlude1","exclude2")
e.g.:
clinical = "GRADE"
exclude = c("G1","GX")
This would exclude the histologic grades g1 and gx from additional
clinical = c("GRADE","AGE")
exclude = c("G1","GX", "old")
This would exclude histological grades G1 and GX and old from the clinical data.
The data has to have at least two factors remaining for survival analysis
Example 1:
Survival_adaptable(clinical = c("GRADE","GENDER"), Eset = mutset, xlabel="Days", legend_position = "right", legend_rows = 4, exclude=c("[Not Available]","GX","G1","MALE"), plot_title = paste("Kaplan Meier Estimator of histologic grade and patient gender \nin ccRCC (KIRC, TCGA)") )
Example 2:
Survival_adaptable(clinical = c("GRADE"), Eset = mutset, xlabel="Days", legend_position = "bottom", legend_rows = 4, mutation = mutations, exclude=c("G1","GX","G2", "[Not Available]"), plot_title = paste("Kaplan Meier Estimator of all high grade tumors with ", paste(mutations, collapse = " & "), "\nstatus in ccRCC (KIRC, TCGA)") )
The dataframe that was used to calculate the Cox Regression can be returned by using the parameter return_fit = T
factor_list <- Survival_adaptable(expression = c(gene), Eset = mutset, mutation = mutations, clinical = "GRADE", average = "median", optimal=T, factor_list = T ) factor_list
It can seem quite complicated to use all the different exclude and show_only etc. Therefore you can also subset the ExpressionSet first by using
subset_eset <- mutset[,!mutset$AJCC_PATHOLOGIC_TUMOR_STAGE %in% "Stage IV"]
pData(mutset)[,"Age_category"] <- ifelse(pData(mutset)$AGE > 65,"old","young")
Example:
Survival_adaptable(clinical = c("Age_category"), Eset = mutset, p.val=TRUE, xlabel="Days", legend_position = "bottom", legend_rows = 4, mutation = mutations )
One can modify the output by putting the graph into a new variable and use it then as a r BiocStyle::CRANpkg("ggplot2")
object, using r BiocStyle::CRANpkg("ggpubr")
, see: ggpubr reference
For example:
caption = "A user defined caption"
font.caption = c(10, "plain", "orange"),
legend.title = "override the title of the legend"
The legend can be put in two or more rows using the legend_rows
parameter.
The risk table does not work with ggpar()
library(ggpubr) p <- Survival_adaptable(clinical = c("Age_category"), Eset = mutset, p.val=FALSE, xlabel="Days", legend_position = "bottom", legend_rows = 3, average = "median",, mutation = mutations, risk_table = FALSE, plot_title = "A dummy Title" ) p <- ggpar( p, caption = "A user defined caption", font.caption = c(10, "plain", "orange"), font.title = c(16, "bold", "darkblue"), font.subtitle = c(12, "bold.italic", "purple"), font.x = c(14, "bold.italic", "red"), font.y = c(10, "bold.italic", "darkred"), font.xtickslab = c(12, "plain", "darkgreen"), font.legend = c(8, "plain", "red"), ggtheme = theme(plot.margin=unit(c(1,1,2,1),"cm")), legend.title = "override the title of the legend" ) p
The dataframe that was used to calculate the Cox Regression can be returned by using the parameter return_df = T
survival_df <- Survival_adaptable(expression = c(gene), Eset = mutset, mutation = mutations, average = "median", optimal=T, return_df = T ) head(survival_df)
The dataframe that was used to calculate the Cox Regression can be returned by using the parameter return_fit = T
.
It is stored in a list:
survival_fit <- Survival_adaptable(expression = c(gene), Eset = mutset, mutation = mutations, average = "median", optimal=T, return_fit = T ) survival_fit[[1]]$strata
smoothCoxph_man(x = c("BAMBI", "CD44"), Eset=mutset, #exclude_values=c("pathologic_stage", "Stage II", "Stage I", "Stage IV"), logrisk=T, average="median")
How the exclude works:
c("column of clinical data to look for exclusion", "exclude group 1 from this column", "exclude group 2 from this column", ...)
hazard_list(x = c("FOXA2","FOXD1","TSPAN8","ICAM1"), Eset = Eset, additional = "GRADE", exclude = c("GX"))
To exclude data: Subset the ExpressionSet if anything should be excluded from analysis
x = c("FOXA2","EEF1A2","SPDEF","GENDER") Multivariate(x = x, Eset = Eset, factor_list= F, average = "mean", optimal=T, plot_cutpoint=F, return_df=F, coef=T, survival="overall", combined=F ) # Return List of Factors used for the analysis: Multivariate(x = x, Eset = Eset, factor_list= T, average = "mean", optimal=T, plot_cutpoint=F, return_df=F, coef=T, survival="overall", combined=F )
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.