TCGAbiolinks version bump with new functions

knitr::opts_chunk$set(dpi = 300)
knitr::opts_chunk$set(cache=FALSE)
knitr::opts_chunk$set(echo = TRUE)
New functionalities: TCGAbiolinks version bump
**TCGAbiolinks** is providing functions to query and download TARGET data and also get SummarizedExperiments objects ready to use for downstream analysis. Moreover, a new design has been suggested for function **TCGAanalyze_DEA** to allow more user customization and support of **limma** pipeline wzxhzdk:1

Support of Therapeutically Applicable Research To Generate Effective Treatments (TARGET) data:

TARGET applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. Investigators form a collaborative network to facilitate discovery of molecular targets and translate those findings into the clinic. TARGET is managed by NCI's Office of Cancer Genomics and Cancer Therapy Evaluation Program.

Querying, downloading, and preparing TARGET data:

Below is an example to run to query, download, and prepare data from the TARGET initiative through tcgabiolinks package:

Available TARGET projects:

projects <- getGDCprojects()$project_id
as.data.frame(grep("TARGET", projects,value = TRUE))
#Downloading and prepare TARGET CASE
query_Target <- GDCquery(project = "TARGET-AML", 
                         data.category = "Transcriptome Profiling", 
                         data.type = "Gene Expression Quantification", 
                         workflow.type = "HTSeq - Counts")

samplesDown_Target <- getResults(query_Target, cols = c("cases"))

dataSmTB_Target <- TCGAquery_SampleTypes(barcode = samplesDown_Target,
                                         typesample = "TB")

dataSmNB_Target <- TCGAquery_SampleTypes(barcode = samplesDown_Target,
                                         typesample = "TBM")
dataSmTB_short_Target <- dataSmTB_Target[1:10]
dataSmNB_short_Target <- dataSmNB_Target[1:10]

queryDown_Target <- GDCquery(project = "TARGET-AML", 
                             data.category = "Transcriptome Profiling",
                             data.type = "Gene Expression Quantification", 
                             workflow.type = "HTSeq - Counts", 
                             barcode = c(dataSmTB_short_Target, dataSmNB_short_Target))

GDCdownload(queryDown_Target)

### SummarizedExperiment = TRUE
data <- GDCprepare(queryDown_Target)

dataPrep_Target <- TCGAanalyze_Preprocessing(object = data, 
                                             cor.cut = 0.6,
                                             datatype = "HTSeq - Counts")                      

dataNorm_Target <- TCGAanalyze_Normalization(tabDF = dataPrep_Target,
                                             geneInfo = geneInfoHT,
                                             method = "gcContent") 

boxplot(dataPrep_Target, outline = FALSE)

boxplot(dataNorm_Target, outline = FALSE)

dataFilt_Target <- TCGAanalyze_Filtering(tabDF = dataNorm_Target,
                                         method = "quantile", 
                                         qnt.cut =  0.25)   

Preparing BRCA data for downstream analysis: Differential Expression Analysis

CancerProject <- "TCGA-BRCA"
DataDirectory <- paste0("../GDC/",gsub("-","_",CancerProject))
FileNameData <- paste0(DataDirectory, "_","HTSeq_Counts",".rda")

query <- GDCquery(project = CancerProject,
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification", 
                  workflow.type = "HTSeq - Counts")

samplesDown <- getResults(query,cols = c("cases"))

dataSmTP <- TCGAquery_SampleTypes(barcode = samplesDown,
                                  typesample = "TP")

dataSmNT <- TCGAquery_SampleTypes(barcode = samplesDown,
                                  typesample = "NT")
dataSmTP_short <- dataSmTP[1:10]
dataSmNT_short <- dataSmNT[1:10]

queryDown <- GDCquery(project = CancerProject, 
                      data.category = "Transcriptome Profiling",
                      data.type = "Gene Expression Quantification", 
                      workflow.type = "HTSeq - Counts", 
                      barcode = c(dataSmTP_short, dataSmNT_short))

GDCdownload(query = queryDown)

dataPrep1 <- GDCprepare(query = queryDown, 
                        save = TRUE, 
                        save.filename = "TCGA_BRCA_HTSeq_Countds.rda")

dataPrep <- TCGAanalyze_Preprocessing(object = dataPrep1, 
                                      cor.cut = 0.6,
                                      datatype = "HTSeq - Counts")


dataNorm <- TCGAanalyze_Normalization(tabDF = dataPrep,
                                      geneInfo = geneInfoHT,
                                      method = "gcContent")
dataNorm <- TCGAanalyze_Normalization(tabDF = dataPrep,
                                      geneInfo = geneInfoHT,
                                      method = "geneLength")

dataFilt <- TCGAanalyze_Filtering(tabDF = dataNorm,
                                  method = "quantile", 
                                  qnt.cut =  0.25)

UseRaw_afterFilter: Keep raw counts after filtering

After applying filtering and normalization steps, this function reassigns raw counts to the filtered data frame of genesXsamples. This step could be useful for methods requiring to work with raw counts only (i.e edgeR pipeline) Arguments are:

### dataframe will have genes from dataFilt but raw counts from dataPrep
dataPrep_raw <- UseRaw_afterFilter(dataPrep, dataFilt)

TCGA_MolecularSubtype: Query subtypes for cancer data:

This function takes a vector of TCGA barcodes as an input and returns a 2 elements list.

--If more data is needed, please check TabSubtypesCol_merged data frame available in the package data.

##use previously fetched BRCA data:
Pam50.subtypes<-TCGA_MolecularSubtype(colnames(dataFilt))

Differential expression analysis with TCGAanalyze_DEA():

TCGAbiolinks is providing a new design for the TCGAanalyze_DEA() function to perform differential expression analysis (DEA) either contrasting two groups or multiple sets by providing a formula for contrast. Limma pipeline was also added on top of the previously implemented one in EdgeR.

The function takes the following arguments for the version bump:

Limma pipeline

This example will perform differential expression analysis between control and case groups matrices downloaded through tcgabiolinks from GDC portal. Other pipeline option could be edgeR.

dataDEGs <- TCGAanalyze_DEA(mat1 = dataFilt[,dataSmTP_short],
                            mat2 = dataFilt[,dataSmNT_short],
                            pipeline="limma",
                            Cond1type = "Normal",
                            Cond2type = "Tumor",
                            fdr.cut = 0.01 ,
                            logFC.cut = 1,
                            method = "glmLRT", ClinicalDF = data.frame())

Customization of contrast using ---contrast.formula--- argument:

contrast.formula argument is a string formula formatted in a way to be used as an input for DEA using limma pipeline. Elements of contrast in the formula should be one of the unique elements in CondTypes vector. The example below shows the syntax contrast.formula argument should have. We are contrasting here between Pam50 molecular subtypes of some lung cancer samples. N.B: Syntax for the contrast formula should be under this form *formula<-"contrast1=type1-type2, contrast2=type2-type3, contrast3=type1-(type2+type3)/2,..."

CondTypes vector should preferably have no space character in its elements.

### ##download and prepare lung data through GDC### ##
query.lung <- GDCquery(project = "TCGA-LUSC",
                       data.category = "Transcriptome Profiling",
                       data.type = "Gene Expression Quantification", 
                       workflow.type = "HTSeq - Counts")

samplesDown.lusc <- getResults(query.lung,cols=c("cases"))

dataSmTP.lusc <- TCGAquery_SampleTypes(barcode = samplesDown.lusc,
                                       typesample = "TP")

dataSmNT.lusc <- TCGAquery_SampleTypes(barcode = samplesDown.lusc,
                                       typesample = "NT")
dataSmTP_short.lusc <- dataSmTP.lusc[1:10]
dataSmNT_short.lusc <- dataSmNT.lusc[1:10]
length(dataSmTP.lusc)


queryDown.lung <- GDCquery(project = "TCGA-LUSC", 
                           data.category = "Transcriptome Profiling",
                           data.type = "Gene Expression Quantification", 
                           workflow.type = "HTSeq - Counts", 
                           barcode = c(dataSmTP.lusc, dataSmNT.lusc))

GDCdownload(query = queryDown.lung)

dataPrep1.tcga <- GDCprepare(query = queryDown.lung, 
                             save = TRUE, 
                             save.filename = "TCGA_LUSC_HTSeq_Countds.rda")
dataPrep.tcga <- TCGAanalyze_Preprocessing(object = dataPrep1.tcga, 
                                           cor.cut = 0.6,
                                           datatype = "HTSeq - Counts")

dataNorm.tcga <- TCGAanalyze_Normalization(tabDF = dataPrep.tcga,
                                           geneInfo = geneInfoHT,
                                           method = "gcContent")

dataNorm.tcga <- TCGAanalyze_Normalization(tabDF = dataNorm.tcga,
                                           geneInfo = geneInfoHT,
                                           method = "geneLength")
dataFilt.tcga <- TCGAanalyze_Filtering(tabDF = dataNorm.tcga,
                                       method = "quantile", 
                                       qnt.cut =  0.25)

### #Filtering data so all samples have a pam50 subtype for LUSC

diff<-setdiff(colnames(dataFilt.tcga), TCGA_MolecularSubtype(colnames(dataFilt.tcga))$filtered)
TCGA_MolecularSubtype(diff)$subtypes$barcodes

dataFilt.tcga.lusc<-dataFilt.tcga[,diff]
LUSC.subtypes<-TCGA_MolecularSubtype(colnames(dataFilt.tcga.lusc))$subtypes$subtype

### Differential expression analysis after correcting for "Plate" factor.
DEG.lusc <- TCGAanalyze_DEA(MAT=dataFilt.tcga.lusc,
                            pipeline="limma",
                            batch.factors = c("Plate"),
                            Cond1type = "Normal",
                            Cond2type = "Tumor",
                            fdr.cut = 0.01 ,
                            logFC.cut = 1,
                            method = "glmLRT", ClinicalDF = data.frame(),
                            Condtypes = LUSC.subtypes,
                            contrast.formula = "LUSC.basalvsLUSC.classical=LUSC.basal - LUSC.classical")

TCGAbatch_correction: Handle batch correction and lima-voom transformation

This function calls ComBat from sva package to handle batch correction, display some plots for exploratory data analysis purposes. More specifically, it shows a plot with black as a kernel estimate of the empirical batch effect density and red as the parametric.

On top of data exploration through ComBat plots, if no batch factor is provided nothing will be performed. Run the code below to see an example:

batch.corrected <- TCGAbatch_Correction(tabDF = dataFilt.tcga, batch.factor = "Plate")

batch.corrected.adjusted <- TCGAbatch_Correction(tabDF = dataFilt.tcga, 
                                                 batch.factor = "Plate", 
                                                 adjustment=c("TSS"))

TCGAbatch_correction: working with unpublished datasets

This function calls ComBat from sva package to handle batch correction, display some plots for exploratory data analysis purposes. More specifically, it shows a plot with black as a kernel estimate of the empirical batch effect density and red as the parametric.

On top of data exploration through ComBat plots, if no batch factor is provided nothing will be performed. Run the code below to see an example:

The following example shows how to consider two cancer type LUAD and LUSC and normal and tumor samples. The end-user can replace LUSC with unpublished data for example.

#dataFilt is a gene expression matrix from pancancer33 RNASeq data that can be generated following our previous described workflow

sampleTP <- TCGAquery_SampleTypes(barcode = colnames(dataFilt), typesample = "TP")
sampleNT <- TCGAquery_SampleTypes(barcode = colnames(dataFilt), typesample = "NT")
dataFiltTP <- dataFilt[,sampleTP]
dataFiltNT <- dataFilt[,sampleNT]

dataSubt_LUAD <-TCGAquery_subtype(tumor = "LUAD")
sampleStageI_LUAD <- dataSubt_LUAD[dataSubt_LUAD$Tumor.stage %in% c("Stage I","Stage IA","Stage IB"),]
dataFilt.LUAD.stageI.TP <- dataFiltTP[,substr(colnames(dataFiltTP),1,12) %in% sampleStageI_LUAD$patient]
dataFilt.LUAD.stageI.NT <- dataFiltNT[,substr(colnames(dataFiltNT),1,12) %in% sampleStageI_LUAD$patient]

# here for example the end-user can replace LUSC with unpublished data in a dataframe containing counts.
# and adding the information for LUSC.stageI.TP or LUSC.stageI.NT with unpublished data

dataSubt_LUSC <-TCGAquery_subtype(tumor = "LUSC")
sampleStageI_LUSC <- dataSubt_LUSC[dataSubt_LUSC$T.stage %in% c("T1","T1a","T1b"),]
dataFilt.LUSC.stageI.TP <- dataFiltTP[,substr(colnames(dataFiltTP),1,12) %in% sampleStageI_LUSC$patient]
dataFilt.LUSC.stageI.NT <- dataFiltNT[,substr(colnames(dataFiltNT),1,12) %in% sampleStageI_LUSC$patient]

countsTable <-cbind(dataFilt.LUAD.stageI.TP,dataFilt.LUAD.stageI.NT,
                    dataFilt.LUSC.stageI.TP, dataFilt.LUSC.stageI.NT)

AnnotationCounts <- matrix(0,ncol(countsTable),2)
colnames(AnnotationCounts) <- c("Samples","Batch")
rownames(AnnotationCounts) <- colnames(countsTable)

AnnotationCounts <- as.data.frame(AnnotationCounts)
AnnotationCounts$Samples <- colnames(countsTable)

AnnotationCounts[colnames(dataFilt.LUAD.stageI.TP),"Batch"] <- "LUAD.stageI.TP"
AnnotationCounts[colnames(dataFilt.LUAD.stageI.NT),"Batch"] <- "LUAD.stageI.NT"
AnnotationCounts[colnames(dataFilt.LUSC.stageI.TP),"Batch"] <- "LUSC.stageI.TP"
AnnotationCounts[colnames(dataFilt.LUSC.stageI.NT),"Batch"] <- "LUSC.stageI.NT"

countsCorrected <- TCGAbatch_Correction(tabDF =countsTable,
                                        UnpublishedData = TRUE, 
                                        AnnotationDF = AnnotationCounts)

TCGAtumor_purity: Filter TCGA samples according to tumor purity

Function takes TCGA barcodes as input and filter them according to values specified by user. Default value for all of them is 0. Arguments are estimates using different measures imported from the TCGA consortium, and they are as the following (Refer to [1]):

The function returns a list object with "pure_barcodes"" attribute as a vector of pure samples and "filtered"" attribute as filtered samples with no purity info. --For tumor purity info of all measured TCGA samples, please check Tumor.purity data frame available with the package.--

[1] D. Aran, M. Sirota, and A. J. Butte. Systematic pan-cancer analysis of tumour purity. Nat Commun, 6:8971, 2015.

Purity.BRCA <- TCGAtumor_purity(colnames(dataPrep.tcga), 0, 0, 0, 0, 0.7)

Download GTEx data available through the Recount2 project:

Recount2 project has made gene count data available for the scientific community to download. Recount2 is an online resource consisting of RNA-seq gene and exon counts as well as coverage bigWig files for 2041 different studies. It is the second generation of the ReCount project. The raw sequencing data were processed with Rail-RNA as described in the recount2 paper. The purpose of this function is to get data processed under one single pipeline to avoid any technical variabilities affecting any downstream integrative analysis. The returned object will be a list with elements named "project_tissue" and as a SummarizedExperiment (SE) of gene counts per sample. This is particularly useful to study batch-free data or to increase the pool of healthy/normal samples.

Arguments are:

##Brain data through Recount2

brain.rec <- TCGAquery_recount2(project = "gtex", tissue = "brain")


Try the TCGAbiolinks package in your browser

Any scripts or data that you put into this service are public.

TCGAbiolinks documentation built on Nov. 8, 2020, 5:37 p.m.