The TCGAutils
package completes a suite of Bioconductor packages for
convenient access, integration, and analysis of The Cancer Genome Atlas.
It includes:
0. helpers for working with TCGA through the Bioconductor packages
r Biocpkg("MultiAssayExperiment")
(for coordinated representation and
manipulation of multi-omits experiments) and r Biocpkg("curatedTCGAData")
,
which provides unrestricted TCGA data as MultiAssayExperiment
objects,
0. helpers for importing TCGA data as from flat data structures such as
data.frame
or DataFrame
read from delimited data structures provided by
the Broad Institute’s Firehose, GenomicDataCommons, and
0. functions for interpreting TCGA barcodes and for mapping between
barcodes and Universally Unique Identifiers (UUIDs).
if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("TCGAutils")
Required packages for this vignette:
library(TCGAutils) library(curatedTCGAData) library(MultiAssayExperiment) library(RTCGAToolbox) library(BiocFileCache) library(rtracklayer) library(R.utils)
curatedTCGAData
utility functionsFunctions such as getSubtypeMap
and getClinicalNames
provide information
on data inside a r Biocpkg("MultiAssayExperiment")
object downloaded from
r Biocpkg("curatedTCGAData")
. sampleTables
and splitAssays
support useful
operations on these MultiAssayExperiment
objects.
MultiAssayExperiment
objects from curatedTCGAData
For demonstration we download part of the Colon Adenocarcinoma (COAD) dataset
usingcuratedTCGAData
via ExperimentHub
. This command will download any
data type that starts with CN*
such as CNASeq
:
coad <- curatedTCGAData::curatedTCGAData(diseaseCode = "COAD", assays = c("CNASeq", "Mutation", "miRNA*", "RNASeq2*", "mRNAArray", "Methyl*"), dry.run = FALSE)
For a list of all available data types, use dry.run = FALSE
and an
asterisk *
as the assay input value:
curatedTCGAData("COAD", "*")
sampleTables
: what sample types are present in the data?The sampleTables
function gives a tally of available
samples in the dataset based on the TCGA barcode information.
sampleTables(coad)
For reference in interpreting the sample type codes, see the sampleTypes
table:
data("sampleTypes") head(sampleTypes)
splitAssays
: separate the data from different tissue typesTCGA datasets include multiple -omics for solid tumors, adjacent normal
tissues, blood-derived cancers and normals, and other tissue types, which may
be mixed together in a single dataset. The MultiAssayExperiment
object
generated here has one patient per row of its colData
, but each patient may
have two or more -omics profiles by any assay, whether due to assaying of
different types of tissues or to technical replication. splitAssays
separates
profiles from different tissue types (such as tumor and adjacent normal) into
different assays of the MultiAssayExperiment
by taking a vector of sample
codes, and partitioning the current assays into assays with an appended sample
code:
(tnmae <- splitAssays(coad, c("01", "11")))
The r Biocpkg("MultiAssayExperiment")
package then provides functionality to
merge replicate profiles for a single patient (mergeReplicates()
), which
would now be appropriate but would not have been appropriate before
splitting different tissue types into different assays, because that would
average measurements from tumors and normal tissues.
MultiAssayExperiment
also defines the MatchedAssayExperiment
class, which
eliminates any profiles not present across all assays and ensures identical
ordering of profiles (columns) in each assay. In this example, it will match
tumors to adjacent normals in subsequent assays:
(matchmae <- as(tnmae[, , c(4, 6, 7)], "MatchedAssayExperiment"))
Only about 12 participants have both a matched tumor and solid normal sample.
getSubtypeMap
: manually curated molecular subtypesPer-tumor subtypes are saved in the metadata
of the colData
slot of MultiAssayExperiment
objects downloaded from curatedTCGAData
.
These subtypes were manually curated from the supplemental tables of all
primary TCGA publications:
getSubtypeMap(coad)
getClinicalNames
: key "level 4" clinical & pathological dataThe curatedTCGAData
colData
contain hundreds of columns, obtained from
merging all unrestricted levels of clinical, pathological, and biospecimen data.
This function provides the names of "level 4" clinical/pathological variables,
which are the only ones provided by most other TCGA analysis tools.
Users may then use these variable names for subsetting or analysis, and may
even want to subset the colData
to only these commonly used variables.
getClinicalNames("COAD")
Warning: some names may not exactly match the colData
names in the object
due to differences in variable types. These variables are kept separate and
differentiated with x
and y
. For example, vital_status
in this case
corresponds to two different variables obtained from the pipeline. One variable
is interger type and the other character:
class(colData(coad)[["vital_status.x"]]) class(colData(coad)[["vital_status.y"]]) table(colData(coad)[["vital_status.x"]]) table(colData(coad)[["vital_status.y"]])
Such conflicts should be inspected in this manner, and conflicts resolved by choosing the more complete variable, or by treating any conflicting values as unknown ("NA").
This section gives an overview of the operations that can be performed on
a given set of metadata obtained particularly from data-rich objects such
as those obtained from curatedTCGAData
. There are several operations that
work with microRNA, methylation, mutation, and assays that have gene symbol
annotations.
CpGtoRanges
Using the methylation annotations in
IlluminaHumanMethylation450kanno.ilmn12.hg19
and the minfi
package, we
look up CpG probes and convert to genomic coordinates with CpGtoRanges
.
The function provides two assays, one with mapped probes and the other with
unmapped probes. Excluding unmapped probes can be done by setting the
unmapped
argument to FALSE
. This will run for both types of methylation
data (27k and 450k).
methcoad <- CpGtoRanges(coad)
mirToRanges
microRNA assays obtained from curatedTCGAData
have annotated sequences
that can be converted to genomic ranges using the mirbase.db
package.
The function looks up all sequences and converts them to ('hg19') ranges.
For those rows that cannot be found, an 'unranged' assay is introduced
in the resulting MultiAssayExperiment object.
mircoad <- mirToRanges(coad)
qreduceTCGA
The qreduceTCGA
function converts RaggedExperiment
mutation data objects
to RangedSummarizedExperiment
using org.Hs.eg.db
and the qreduceTCGA
utility function from RaggedExperiment
to summarize 'silent' and 'non-silent'
mutations based on a 'Variant_Classification' metadata column in the original
object.
It uses 'hg19' transcript database ('TxDb') package internally to summarize
regions using qreduceAssay
. The current genome build ('hg18') in the data
must be translated to 'hg19'.
In this example, we first set the appropriate build name in the mutation
dataset COAD_Mutation-20160128
according to the
NCBI website
and we then use seqlevelsStyle
to match the UCSC
style in the chain.
rag <- "COAD_Mutation-20160128" # add the appropriate genome annotation genome(coad[[rag]]) <- "NCBI36" # change the style to UCSC seqlevelsStyle(rowRanges(coad[[rag]])) <- "UCSC" # inspect changes seqlevels(rowRanges(coad[[rag]])) genome(coad[[rag]])
Now we use liftOver
from rtracklayer
to translate 'hg18' builds
to 'hg19' using the downloaded chain file.
lifturl <- "http://hgdownload.cse.ucsc.edu/goldenpath/hg18/liftOver/hg18ToHg19.over.chain.gz" bfc <- BiocFileCache() qfile <- bfcquery(bfc, "18to19chain", exact = TRUE)[["rpath"]] cfile <- if (length(qfile) && file.exists(qfile)) { bfcquery(bfc, "18to19chain", exact = TRUE)[["rpath"]] } else { bfcadd(bfc, "18to19chain", lifturl) } chainfile <- file.path(tempdir(), gsub("\\.gz", "", basename(cfile))) R.utils::gunzip(cfile, destname = chainfile, remove = FALSE) chain <- suppressMessages( rtracklayer::import.chain(chainfile) ) ranges19 <- rtracklayer::liftOver(rowRanges(coad[[rag]]), chain)
The same can be done to convert hg19
to hg38
(the same build that the
Genomic Data Commons uses) with the corresponding chain file:
liftchain <- "http://hgdownload.cse.ucsc.edu/goldenpath/hg19/liftOver/hg19ToHg38.over.chain.gz" bfc <- BiocFileCache() q38file <- bfcquery(bfc, "19to38chain", exact = TRUE)[["rpath"]] c38file <- if (length(q38file) && file.exists(q38file)) { bfcquery(bfc, "19to38chain", exact = TRUE)[["rpath"]] } else { bfcadd(bfc, "19to38chain", liftchain) } cloc38 <- file.path(tempdir(), gsub("\\.gz", "", basename(c38file))) R.utils::gunzip(c38file, destname = cloc38, remove = FALSE) chain38 <- suppressMessages( rtracklayer::import.chain(cloc38) ) ## then use the liftOver function using the 'chain38' object ## as above ranges38 <- rtracklayer::liftOver(unlist(ranges19), chain38)
This will give us a list of ranges, each element corresponding to a single row
in the RaggedExperiment
. We remove rows that had no matches in the liftOver
process and replace the ranges in the original RaggedExperiment
with the
replacement method. Finally, we put the RaggedExperiment
object back into the
MultiAssayExperiment
.
re19 <- coad[[rag]][as.logical(lengths(ranges19))] ranges19 <- unlist(ranges19) genome(ranges19) <- "hg19" rowRanges(re19) <- ranges19 # replacement coad[["COAD_Mutation-20160128"]] <- re19 rowRanges(re19)
Now that we have matching builds, we can finally run the qreduceTCGA
function.
coad <- qreduceTCGA(coad, keep.assay = TRUE)
symbolsToRanges
In the cases where row annotations indicate gene symbols, the symbolsToRanges
utility function converts genes to genomic ranges and replaces existing
assays with RangedSummarizedExperiment
objects. Gene annotations are given
as 'hg19' genomic regions.
symbolsToRanges(coad)
A few functions in the package accept either files or classes such as
data.frame
and FirehoseGISTIC
as input and return standard Bioconductor
classes.
makeGRangesListFromExonFiles
The GenomicDataCommons
package can be used to obtain 'legacy' exon
quantification files via:
Note. File downloads disabled on Windows due to long file names.
library(GenomicDataCommons) queso <- files(legacy = TRUE) %>% filter( ~ cases.project.project_id == "TCGA-COAD" & data_category == "Gene expression" & data_type == "Exon quantification") gdc_set_cache(directory = tempdir())
We then use makeGRangesListFromExonFiles
to create a GRangesList
from
vectors of file paths. There are options to provide file names when file names
are too long to download (Windows OS). The nrows
argument only keeps the
first 5 rows in each of the files read in due to invalid character exon ranges.
## FALSE until gdcdata works qu <- manifest(queso) qq <- gdcdata(qu$id[1:4]) makeGRangesListFromExonFiles(qq, nrows = 4)
Note GRangesList
objects must be converted to r Biocpkg("RaggedExperiment")
class to incorporate them into a MultiAssayExperiment
.
Due to file name length, Windows may not be able to read / display all files.
The workaround uses the fileNames
argument from a character vector of file
names and will convert them to TCGA barcodes.
## Load example file found in package pkgDir <- system.file("extdata", package = "TCGAutils", mustWork = TRUE) exonFile <- list.files(pkgDir, pattern = "cation\\.txt$", full.names = TRUE) exonFile ## We add the original file prefix to query for the UUID and get the ## TCGAbarcode filePrefix <- "unc.edu.32741f9a-9fec-441f-96b4-e504e62c5362.1755371." ## Add actual file name manually makeGRangesListFromExonFiles(exonFile, fileNames = paste0(filePrefix, basename(exonFile)))
makeGRangesListFromCopyNumber
Other processed, genomic range-based data from TCGA data can be imported using
makeGRangesListFromCopyNumber
. This tab-delimited data file of copy number
alterations from bladder urothelial carcinoma (BLCA) was obtained from the
Genomic Data Commons and is included in TCGAUtils
as an example:
grlFile <- system.file("extdata", "blca_cnaseq.txt", package = "TCGAutils") grl <- read.table(grlFile) head(grl) makeGRangesListFromCopyNumber(grl, split.field = "Sample") makeGRangesListFromCopyNumber(grl, split.field = "Sample", keep.extra.columns = TRUE)
makeSummarizedExperimentFromGISTIC
This function is only used for converting the FirehoseGISTIC
class of the
r Biocpkg("RTCGAToolbox")
package. It allows the user to obtain thresholded
by gene data, probabilities and peak regions.
tempDIR <- tempdir() co <- getFirehoseData("COAD", clinical = FALSE, GISTIC = TRUE, destdir = tempDIR) selectType(co, "GISTIC") class(selectType(co, "GISTIC")) makeSummarizedExperimentFromGISTIC(co, "Peaks")
mergeColData
: expanding the colData
of a MultiAssayExperiment
This function merges a data.frame
or DataFrame
into the
colData
of an existing MultiAssayExperiment
object. It will match
column names and row names to do a full merge of both data sets. This
convenience function can be used, for example, to add subtype information
available for a subset of patients to the colData
. Here is a simplified
example of adding a column to the colData
DataFrame
:
race_df <- DataFrame(race_f = factor(colData(coad)[["race"]]), row.names = rownames(colData(coad))) mergeColData(coad, race_df)
The TCGA project has generated massive amounts of data. Some data can be
obtained with Universally Unique IDentifiers (UUID) and other
data with TCGA barcodes. The Genomic Data Commons provides a JSON API for
mapping between UUID and barcode, but it is difficult for many people to
understand. TCGAutils
makes simple functions available for two-way
translation between vectors of these identifiers.
Here we translate the first two TCGA barcodes of the previous copy-number alterations dataset to UUID:
(xbarcode <- head(colnames(coad)[["COAD_CNASeq-20160128_simplified"]], 4L)) barcodeToUUID(xbarcode)
Here we have a known case UUID that we want to translate into a TCGA barcode.
UUIDtoBarcode("ae55b2d3-62a1-419e-9f9a-5ddfac356db4", from_type = "case_id")
In cases where we want to translate a known file UUID to the associated TCGA
patient barcode, we can use UUIDtoBarcode
.
UUIDtoBarcode("0001801b-54b0-4551-8d7a-d66fb59429bf", from_type = "file_id")
Translating aliquot UUIDs is also possible by providing a known aliquot UUID to
the function and giving a from_type
, "aliquot_ids":
UUIDtoBarcode("d85d8a17-8aea-49d3-8a03-8f13141c163b", from_type = "aliquot_ids")
Additional UUIDs may be supported in future versions.
We can also translate from file UUIDs to case UUIDs and vice versa as long as
we know the input type. We can use the case UUID from the previous example to
get the associated file UUIDs using UUIDtoUUID
. Note that this translation
is a one to many relationship, thus yielding a data.frame
of file UUIDs for a
single case UUID.
head(UUIDtoUUID("ae55b2d3-62a1-419e-9f9a-5ddfac356db4", to_type = "file_id"))
One possible way to verify that file IDs are matching case UUIDS is to
browse to the Genomic Data Commons webpage with the specific file UUID.
Here we look at the first file UUID entry in the output data.frame
:
https://portal.gdc.cancer.gov/files/0ff55a5e-6058-4e0b-9641-e3cb375ff214
In the page we check that the case UUID matches the input.
Several functions exist for working with TCGA barcodes, the main function being
TCGAbarcode
. It takes a TCGA barcode and returns information about
participant, sample, and/or portion.
## Return participant barcodes TCGAbarcode(xbarcode, participant = TRUE) ## Just return samples TCGAbarcode(xbarcode, participant = FALSE, sample = TRUE) ## Include sample data as well TCGAbarcode(xbarcode, participant = TRUE, sample = TRUE) ## Include portion and analyte data TCGAbarcode(xbarcode, participant = TRUE, sample = TRUE, portion = TRUE)
Based on lookup table values, the user can select certain sample types from a vector of sample barcodes. Below we select "Primary Solid Tumors" from a vector of barcodes, returning a logical vector identifying the matching samples.
## Select primary solid tumors TCGAsampleSelect(xbarcode, "01") ## Select blood derived normals TCGAsampleSelect(xbarcode, "10")
data.frame
representation of barcodeThe straightforward TCGAbiospec
function will take the information contained
in the TCGA barcode and display it in data.frame
format with appropriate
column names.
TCGAbiospec(xbarcode)
We provide a convenience function that investigates metadata within
curatedTCGAData
objects to present a plot of molecular alterations
within a paricular cancer. MultiAssayExperiment
objects are required to
have an identifiable 'Mutation' assay (using text search). The variantCol
argument identifies the mutation type column within the data.
Note. Functionality streamlined from the ComplexHeatmap
package.
oncoPrintTCGA(coad, matchassay = rag)
The TCGAutils
package provides several helper datasets for working with TCGA barcodes.
sampleTypes
As shown previously, the reference dataset sampleTypes
defines sample codes
and their sample types (see ?sampleTypes
for source url).
## Obtained previously sampleCodes <- TCGAbarcode(xbarcode, participant = FALSE, sample = TRUE) ## Lookup table head(sampleTypes) ## Match codes found in the barcode to the lookup table sampleTypes[match(unique(substr(sampleCodes, 1L, 2L)), sampleTypes[["Code"]]), ]
Source: https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/sample-type-codes
clinicalNames
- Firehose pipeline clinical variablesclinicalNames
is a list of the level 4 variable names (the most commonly used
clinical and pathological variables, with follow-ups merged) from each
colData
datasets in curatedTCGAData
. Shipped curatedTCGAData
MultiAssayExperiment
objects merge additional levels 1-3 clinical,
pathological, and biospecimen data and contain many more variables than the ones
listed here.
data("clinicalNames") clinicalNames lengths(clinicalNames)
sessionInfo
sessionInfo()
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.