knitr::opts_chunk$set(echo = TRUE)

General Purpose

The package was written to simplify the analysis of the TCGA / GDAC Datasets, as most of the availible packages don't work so far. The analysis is generally based on the explanation of TriS on Biostars.org: https://www.biostars.org/p/153013/

So what kind of Data needs to be downloaded?

Using the GDAC firehose and cBioportal one can get access to the curated level 3 data of the TCGA / GDAC consortium and download it, nevertheless, the data from www.cBioportal.org has apperently the more recent data for mutations, and that's why you can have to define in the parameters, which kind of data you are using.

source = "cBio" for cBioportal Data, that was not zscored (will be done by the script with internally but not against normal samples as it is done by cBioportal) (http://www.cbioportal.org/data_sets.jsp),

source = "cBio_z_score" for the data that was zscored by cBioportal

and source = "firehose" for the firehose data (http://gdac.broadinstitute.org/).

Using Data from Firehose

For the clinical data please download:

clinical merged data

For the expression data please download:

"illuminahiseq rnaseqv Level 3 RSEM genes normalized" data

If you want to include mutational data, please download:

Mutation Packager Oncotated Calls

Using Data from cBioportal

For the clinical data please use:

data_bcr_clinical_data_patient.txt

For the expression data please use either:

If you want to include mutational data, please use:

How to use the package:

I will show the example for the cBioportal Data, but working with the data from Firehose is quite similar (I will explain the differences when needed)

Load the package

library(TCGADataFelix)

  # for the expressionset functions:
  library(Biobase)
  # for the modification of the output
  library(ggplot2)

Define filepath for TCGA patient data downloaded from:

filepath <- "~/R/Projects/TCGA-Elisa/tcga_KIRC_cBio/"

# The filename of the clinical data
clinical_data <- "data_bcr_clinical_data_patient.txt"  

# The filename of the expression data
expression_data <- "data_RNA_Seq_v2_expression_median.txt"

For the mutation Data, cBioportal and Firehose Data differs, as Firehose gives one file for each patient in a folder and cBioportal one file with all mutations and patients in.

Importing Firehose mutational data:

# The Folder of the mutation data (each patient has one file)
mutation_data <- "KIRC.Mutation_Packager_Oncotated_Calls/"

Importing cBioportal mutational data:

# The File of the mutation data
mutation_data <- "data_mutations_extended.txt"

Load the expression Data using build_TCGA_Eset():

values for source: firehose for Firehose data cBio for not zScored cBioportal data * cBio_z_score for zScored cBioportal data

values for skip:

So far you need to test which skip is the right one for the clinical data. Check in the clinical data file at which row the header for the clinical data is and use this row number minus 1 as the value for skip.

source <- "cBio"
Eset <- build_TCGA_Eset(clinical_file = paste(filepath,clinical_data, sep = ""),
                        expression_file = paste(filepath,expression_data, sep = ""),
                        source = source,
                        skip = 4
                        )

Load the mutation data using fetch_TCGA_mutations()

fetch_TCGA_mutations(path = paste(filepath,mutation_data,sep=""),
                     source = source
                     )

The Mutations are now loaded into a new variable: mutationdata

Newly added columns to the clinical data by the add_mutations() function

Columns are added to: pData(Eset)

To get an overview over the type and number of mutations in the dataset. Attention: Some patients and mutations are duplicated.

head(sort(table(mutationdata$Hugo_Symbol), decreasing=T), n= 30L)

Define now the mutations, you want to analyze and load them into the ExpressionSet using add_mutation()

If more than one mutation was entered, a additional combined mutation column will be added

mutations <- c("VHL","PBRM1")
mutset <- add_mutation(Eset,
                       mutations,
                       multiple="all")

Use and analyze the data

Define now, what you want to analyze as a factor for the survival data:

gene <- "FOXD1"

Do a survival analysis using Survival_adaptable()

Survival_adaptable(expression = c(gene), 
                   Eset = mutset,
                   mutation = mutations,
                   xlabel="Days",
                   legend_position = "bottom",
                   legend_rows = 4,
                   average = "median",
                   optimal=T,
                   plot_cutpoint=F,
                   plot_title = paste("Kaplan Meier Estimator of", gene, "Expression and", paste(mutations, collapse = " & "), "\nstatus in ccRCC (KIRC, TCGA)")
                   )

How to continue

We can analyze now various different things, e.g. expression of different genes or categories from the clinical data table, such as treatment schemes etc.

Examples for clinical and expression and mutation:

Variable | input ----------| --------- gene | one gene using "gene" gene | multiple genes using c("gene1","gene2","gene3",...) clinical | a factor from the clinical dataset, e.g.: "GRADE" clinical | multiple factors from the clinical dataset, e.g.: c("GRADE","AGE") mutation | mutations added to the dataset by add_mutation(), e.g. c("TP53", "VHL")

Examples for exclude

You can exclude single factors from clinical by using the parameter exclude = "exclude1" or multiple factors using exclude = c("exlude1","exclude2")

e.g.:

clinical = "GRADE"
exclude = c("G1","GX")

This would exclude the histologic grades g1 and gx from additional

clinical = c("GRADE","AGE")
exclude = c("G1","GX", "old")

This would exclude histological grades G1 and GX and old from the clinical data.

The data has to have at least two factors remaining for survival analysis

Example 1:

Survival_adaptable(clinical = c("GRADE","GENDER"), 
                   Eset = mutset,
                   xlabel="Days",
                   legend_position = "right",
                   legend_rows = 4,
                   exclude=c("[Not Available]","GX","G1","MALE"),
                  plot_title = paste("Kaplan Meier Estimator of histologic grade and patient gender \nin ccRCC (KIRC, TCGA)")
)

Example 2:

Survival_adaptable(clinical = c("GRADE"), Eset = mutset,
                   xlabel="Days",
                   legend_position = "bottom",
                   legend_rows = 4,
                   mutation = mutations,
                   exclude=c("G1","GX","G2", "[Not Available]"),
                   plot_title = paste("Kaplan Meier Estimator of all high grade tumors with ", paste(mutations, collapse = " & "), "\nstatus in ccRCC (KIRC, TCGA)")
                   )

Possible Factors

The dataframe that was used to calculate the Cox Regression can be returned by using the parameter return_fit = T

factor_list <- Survival_adaptable(expression = c(gene), 
                   Eset = mutset,
                   mutation = mutations,
                   clinical = "GRADE",
                   average = "median",
                   optimal=T,
                   factor_list = T
                   )
factor_list

Subsetting the ExpressionSet

It can seem quite complicated to use all the different exclude and show_only etc. Therefore you can also subset the ExpressionSet first by using

subset_eset <- mutset[,!mutset$AJCC_PATHOLOGIC_TUMOR_STAGE %in% "Stage IV"]

Add additional categories, e.g. age:

pData(mutset)[,"Age_category"] <- ifelse(pData(mutset)$AGE > 65,"old","young")

Example:

Survival_adaptable(clinical = c("Age_category"), 
                   Eset = mutset,
                   p.val=TRUE,
                   xlabel="Days",
                   legend_position = "bottom",
                   legend_rows = 4,
                   mutation = mutations
                   )

Modify the output

One can modify the output by putting the graph into a new variable and use it then as a r BiocStyle::CRANpkg("ggplot2") object, using r BiocStyle::CRANpkg("ggpubr"), see: ggpubr reference

For example:

The legend can be put in two or more rows using the legend_rows parameter.

The risk table does not work with ggpar()

library(ggpubr)

p <- Survival_adaptable(clinical = c("Age_category"), 
                        Eset = mutset,
                   p.val=FALSE,
                   xlabel="Days",
                   legend_position = "bottom",
                   legend_rows = 3,
                   average = "median",,
                   mutation = mutations,
                   risk_table = FALSE,
                   plot_title = "A dummy Title"
)

p <- ggpar(
  p,
  caption       = "A user defined caption",
  font.caption  = c(10, "plain", "orange"),    
  font.title    = c(16, "bold", "darkblue"),         
  font.subtitle = c(12, "bold.italic", "purple"), 
  font.x        = c(14, "bold.italic", "red"),          
  font.y        = c(10, "bold.italic", "darkred"),      
  font.xtickslab = c(12, "plain", "darkgreen"),
  font.legend   = c(8, "plain", "red"),
  ggtheme       = theme(plot.margin=unit(c(1,1,2,1),"cm")),
  legend.title = "override the title of the legend"
)

p

Return Survival Dataframe

The dataframe that was used to calculate the Cox Regression can be returned by using the parameter return_df = T

survival_df <- Survival_adaptable(expression = c(gene), 
                   Eset = mutset,
                   mutation = mutations,
                   average = "median",
                   optimal=T,
                   return_df = T
                   )
head(survival_df)

Return Cox Regression Fit

The dataframe that was used to calculate the Cox Regression can be returned by using the parameter return_fit = T.

It is stored in a list:

survival_fit <- Survival_adaptable(expression = c(gene), 
                   Eset = mutset,
                   mutation = mutations,
                   average = "median",
                   optimal=T,
                   return_fit = T
                   )

survival_fit[[1]]$strata

Plot the log hazard ration using smoothCoxph_man

smoothCoxph_man(x = c("BAMBI", "CD44"), Eset=mutset,
                #exclude_values=c("pathologic_stage", "Stage II", "Stage I", "Stage IV"),
                logrisk=T, average="median")

How the exclude works:

c("column of clinical data to look for exclusion", "exclude group 1 from this column", "exclude group 2 from this column", ...)

Calculate the Hazard Ratio for many genes from the dataset:

hazard_list(x = c("FOXA2","FOXD1","TSPAN8","ICAM1"), 
            Eset = Eset,
            additional = "GRADE",
            exclude = c("GX"))

Calculate a Multivariate Analysis

To exclude data: Subset the ExpressionSet if anything should be excluded from analysis

x = c("FOXA2","EEF1A2","SPDEF","GENDER")
Multivariate(x = x,
             Eset = Eset,
             factor_list= F,
             average = "mean",
             optimal=T,
             plot_cutpoint=F,
             return_df=F,
             coef=T,
             survival="overall",
             combined=F
)

# Return List of Factors used for the analysis:
Multivariate(x = x, 
             Eset = Eset,
             factor_list= T, 
             average = "mean",
             optimal=T,
             plot_cutpoint=F,
             return_df=F,
             coef=T,
             survival="overall",
             combined=F
             )


w2felix/TCGADataFelix documentation built on May 19, 2019, 8:24 a.m.