In w2felix/TCGADataFelix: Analyzing the TCGA Data according to gene expression and survival

knitr::opts_chunk$set(echo = TRUE)

General Purpose

The package was written to simplify the analysis of the TCGA / GDAC Datasets, as most of the availible packages don't work so far. The analysis is generally based on the explanation of TriS on Biostars.org: https://www.biostars.org/p/153013/

So what kind of Data needs to be downloaded?

Using the GDAC firehose and cBioportal one can get access to the curated level 3 data of the TCGA / GDAC consortium and download it, nevertheless, the data from www.cBioportal.org has apperently the more recent data for mutations, and that's why you can have to define in the parameters, which kind of data you are using.

source = "cBio" for cBioportal Data, that was not zscored (will be done by the script with internally but not against normal samples as it is done by cBioportal) (http://www.cbioportal.org/data_sets.jsp),

source = "cBio_z_score" for the data that was zscored by cBioportal

and source = "firehose" for the firehose data (http://gdac.broadinstitute.org/).

Using Firehose: access the RNASeq Level 3 RSEM genes normalized expression Data
Using cBioportal: z-Score normalized gene expression data (mRNA z-Scores (RNA Seq V2 RSEM) compared to the expression distribution of each gene tumors that are diploid for this gene)
Using cBioportal: Expression levels for 20532 genes in 606 ccrcc cases (RNA Seq V2 RSEM)

Using Data from Firehose

For the clinical data please download:

clinical merged data

For the expression data please download:

"illuminahiseq rnaseqv Level 3 RSEM genes normalized" data

If you want to include mutational data, please download:

Mutation Packager Oncotated Calls

Using Data from cBioportal

For the clinical data please use:

data_bcr_clinical_data_patient.txt

For the expression data please use either:

data_RNA_Seq_v2_mRNA_median_Zscores.txt
or data_RNA_Seq_v2_expression_median.txt

If you want to include mutational data, please use:

Firehose: Have a folder with the patient mutation files and link to that mutations\
cBioportal: data_mutations_extended.txt

How to use the package:

I will show the example for the cBioportal Data, but working with the data from Firehose is quite similar (I will explain the differences when needed)

Load the package

library(TCGADataFelix)

  # for the expressionset functions:
  library(Biobase)
  # for the modification of the output
  library(ggplot2)

Define filepath for TCGA patient data downloaded from:

filepath <- "~/R/Projects/TCGA-Elisa/tcga_KIRC_cBio/"

# The filename of the clinical data
clinical_data <- "data_bcr_clinical_data_patient.txt"  

# The filename of the expression data
expression_data <- "data_RNA_Seq_v2_expression_median.txt"

For the mutation Data, cBioportal and Firehose Data differs, as Firehose gives one file for each patient in a folder and cBioportal one file with all mutations and patients in.

Importing Firehose mutational data:

# The Folder of the mutation data (each patient has one file)
mutation_data <- "KIRC.Mutation_Packager_Oncotated_Calls/"

Importing cBioportal mutational data:

# The File of the mutation data
mutation_data <- "data_mutations_extended.txt"

Load the expression Data using build_TCGA_Eset():

values for source: firehose for Firehose data cBio for not zScored cBioportal data * cBio_z_score for zScored cBioportal data

values for skip:

So far you need to test which skip is the right one for the clinical data. Check in the clinical data file at which row the header for the clinical data is and use this row number minus 1 as the value for skip.

source <- "cBio"
Eset <- build_TCGA_Eset(clinical_file = paste(filepath,clinical_data, sep = ""),
                        expression_file = paste(filepath,expression_data, sep = ""),
                        source = source,
                        skip = 4
                        )

Load the mutation data using fetch_TCGA_mutations()

fetch_TCGA_mutations(path = paste(filepath,mutation_data,sep=""),
                     source = source
                     )

The Mutations are now loaded into a new variable: mutationdata

Every mutation from one patient is one row.
Some mutations are duplicated!, so take care to remove duplicates by using unique(). add_mutations() takes care of that.

Newly added columns to the clinical data by the add_mutations() function

Columns are added to: pData(Eset)

To get an overview over the type and number of mutations in the dataset. Attention: Some patients and mutations are duplicated.

head(sort(table(mutationdata$Hugo_Symbol), decreasing=T), n= 30L)

Define now the mutations, you want to analyze and load them into the ExpressionSet using add_mutation()

If more than one mutation was entered, a additional combined mutation column will be added

multiple = "all" -> a patient that has all mutations present
multiple = "one" -> mutation is in at least one patient

mutations <- c("VHL","PBRM1")
mutset <- add_mutation(Eset,
                       mutations,
                       multiple="all")

Use and analyze the data

Define now, what you want to analyze as a factor for the survival data:

gene <- "FOXD1"

Do a survival analysis using Survival_adaptable()

Survival_adaptable(expression = c(gene), 
                   Eset = mutset,
                   mutation = mutations,
                   xlabel="Days",
                   legend_position = "bottom",
                   legend_rows = 4,
                   average = "median",
                   optimal=T,
                   plot_cutpoint=F,
                   plot_title = paste("Kaplan Meier Estimator of", gene, "Expression and", paste(mutations, collapse = " & "), "\nstatus in ccRCC (KIRC, TCGA)")
                   )

How to continue

We can analyze now various different things, e.g. expression of different genes or categories from the clinical data table, such as treatment schemes etc.

Examples for clinical and expression and mutation:

Variable | input ----------| --------- gene | one gene using "gene" gene | multiple genes using c("gene1","gene2","gene3",...) clinical | a factor from the clinical dataset, e.g.: "GRADE" clinical | multiple factors from the clinical dataset, e.g.: c("GRADE","AGE") mutation | mutations added to the dataset by add_mutation(), e.g. c("TP53", "VHL")

Examples for exclude

You can exclude single factors from clinical by using the parameter exclude = "exclude1" or multiple factors using exclude = c("exlude1","exclude2")

e.g.:

clinical = "GRADE"
exclude = c("G1","GX")

This would exclude the histologic grades g1 and gx from additional

clinical = c("GRADE","AGE")
exclude = c("G1","GX", "old")

This would exclude histological grades G1 and GX and old from the clinical data.

The data has to have at least two factors remaining for survival analysis

Example 1:

Survival_adaptable(clinical = c("GRADE","GENDER"), 
                   Eset = mutset,
                   xlabel="Days",
                   legend_position = "right",
                   legend_rows = 4,
                   exclude=c("[Not Available]","GX","G1","MALE"),
                  plot_title = paste("Kaplan Meier Estimator of histologic grade and patient gender \nin ccRCC (KIRC, TCGA)")
)

Example 2:

Survival_adaptable(clinical = c("GRADE"), Eset = mutset,
                   xlabel="Days",
                   legend_position = "bottom",
                   legend_rows = 4,
                   mutation = mutations,
                   exclude=c("G1","GX","G2", "[Not Available]"),
                   plot_title = paste("Kaplan Meier Estimator of all high grade tumors with ", paste(mutations, collapse = " & "), "\nstatus in ccRCC (KIRC, TCGA)")
                   )

Possible Factors

The dataframe that was used to calculate the Cox Regression can be returned by using the parameter return_fit = T

factor_list <- Survival_adaptable(expression = c(gene), 
                   Eset = mutset,
                   mutation = mutations,
                   clinical = "GRADE",
                   average = "median",
                   optimal=T,
                   factor_list = T
                   )
factor_list

Subsetting the ExpressionSet

It can seem quite complicated to use all the different exclude and show_only etc. Therefore you can also subset the ExpressionSet first by using

subset_eset <- mutset[,!mutset$AJCC_PATHOLOGIC_TUMOR_STAGE %in% "Stage IV"]

Add additional categories, e.g. age:

pData(mutset)[,"Age_category"] <- ifelse(pData(mutset)$AGE > 65,"old","young")

Example:

Survival_adaptable(clinical = c("Age_category"), 
                   Eset = mutset,
                   p.val=TRUE,
                   xlabel="Days",
                   legend_position = "bottom",
                   legend_rows = 4,
                   mutation = mutations
                   )

Modify the output

One can modify the output by putting the graph into a new variable and use it then as a r BiocStyle::CRANpkg("ggplot2") object, using r BiocStyle::CRANpkg("ggpubr"), see: ggpubr reference

For example:

caption = "A user defined caption"
Font of the caption: font.caption = c(10, "plain", "orange"),
legend.title = "override the title of the legend"

The legend can be put in two or more rows using the legend_rows parameter.

The risk table does not work with ggpar()

library(ggpubr)

p <- Survival_adaptable(clinical = c("Age_category"), 
                        Eset = mutset,
                   p.val=FALSE,
                   xlabel="Days",
                   legend_position = "bottom",
                   legend_rows = 3,
                   average = "median",,
                   mutation = mutations,
                   risk_table = FALSE,
                   plot_title = "A dummy Title"
)

p <- ggpar(
  p,
  caption       = "A user defined caption",
  font.caption  = c(10, "plain", "orange"),    
  font.title    = c(16, "bold", "darkblue"),         
  font.subtitle = c(12, "bold.italic", "purple"), 
  font.x        = c(14, "bold.italic", "red"),          
  font.y        = c(10, "bold.italic", "darkred"),      
  font.xtickslab = c(12, "plain", "darkgreen"),
  font.legend   = c(8, "plain", "red"),
  ggtheme       = theme(plot.margin=unit(c(1,1,2,1),"cm")),
  legend.title = "override the title of the legend"
)

p

Return Survival Dataframe

The dataframe that was used to calculate the Cox Regression can be returned by using the parameter return_df = T

survival_df <- Survival_adaptable(expression = c(gene), 
                   Eset = mutset,
                   mutation = mutations,
                   average = "median",
                   optimal=T,
                   return_df = T
                   )
head(survival_df)

Return Cox Regression Fit

The dataframe that was used to calculate the Cox Regression can be returned by using the parameter return_fit = T.

It is stored in a list:

[[1]] = Fit
[[2]] = coxph
[[3]] = p.value by survdiff()
[[4]] = survival quantiles (5%, 25%, 50%, 75%, 95%)

survival_fit <- Survival_adaptable(expression = c(gene), 
                   Eset = mutset,
                   mutation = mutations,
                   average = "median",
                   optimal=T,
                   return_fit = T
                   )

survival_fit[[1]]$strata

Plot the log hazard ration using smoothCoxph_man

smoothCoxph_man(x = c("BAMBI", "CD44"), Eset=mutset,
                #exclude_values=c("pathologic_stage", "Stage II", "Stage I", "Stage IV"),
                logrisk=T, average="median")

How the exclude works:

c("column of clinical data to look for exclusion", "exclude group 1 from this column", "exclude group 2 from this column", ...)

Calculate the Hazard Ratio for many genes from the dataset:

hazard_list(x = c("FOXA2","FOXD1","TSPAN8","ICAM1"), 
            Eset = Eset,
            additional = "GRADE",
            exclude = c("GX"))

Calculate a Multivariate Analysis

To exclude data: Subset the ExpressionSet if anything should be excluded from analysis

x = c("FOXA2","EEF1A2","SPDEF","GENDER")
Multivariate(x = x,
             Eset = Eset,
             factor_list= F,
             average = "mean",
             optimal=T,
             plot_cutpoint=F,
             return_df=F,
             coef=T,
             survival="overall",
             combined=F
)

# Return List of Factors used for the analysis:
Multivariate(x = x, 
             Eset = Eset,
             factor_list= T, 
             average = "mean",
             optimal=T,
             plot_cutpoint=F,
             return_df=F,
             coef=T,
             survival="overall",
             combined=F
             )

w2felix/TCGADataFelix documentation built on May 19, 2019, 8:24 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

w2felix/TCGADataFelix
Analyzing the TCGA Data according to gene expression and survival

In w2felix/TCGADataFelix: Analyzing the TCGA Data according to gene expression and survival

General Purpose

Using Data from Firehose

Using Data from cBioportal

How to use the package:

Importing Firehose mutational data:

Importing cBioportal mutational data:

Load the expression Data using build_TCGA_Eset():

Load the mutation data using fetch_TCGA_mutations()

Newly added columns to the clinical data by the add_mutations() function

Use and analyze the data

Do a survival analysis using Survival_adaptable()

How to continue

Examples for clinical and expression and mutation:

Examples for exclude

Possible Factors

Subsetting the ExpressionSet

Add additional categories, e.g. age:

Modify the output

Return Survival Dataframe

Return Cox Regression Fit

Plot the log hazard ration using smoothCoxph_man

Calculate the Hazard Ratio for many genes from the dataset:

Calculate a Multivariate Analysis

R Package Documentation

Browse R Packages

We want your feedback!

w2felix/TCGADataFelix Analyzing the TCGA Data according to gene expression and survival

In w2felix/TCGADataFelix: Analyzing the TCGA Data according to gene expression and survival

General Purpose

Using Data from Firehose

Using Data from cBioportal

How to use the package:

Importing Firehose mutational data:

Importing cBioportal mutational data:

Load the expression Data using build_TCGA_Eset():

Load the mutation data using fetch_TCGA_mutations()

Newly added columns to the clinical data by the add_mutations() function

Use and analyze the data

Do a survival analysis using Survival_adaptable()

How to continue

Examples for clinical and expression and mutation:

Examples for exclude

Possible Factors

Subsetting the ExpressionSet

Add additional categories, e.g. age:

Modify the output

Return Survival Dataframe

Return Cox Regression Fit

Plot the log hazard ration using smoothCoxph_man

Calculate the Hazard Ratio for many genes from the dataset:

Calculate a Multivariate Analysis

R Package Documentation

Browse R Packages

We want your feedback!

w2felix/TCGADataFelix
Analyzing the TCGA Data according to gene expression and survival