Installation

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("curatedTCGAData")

Load packages:

library(curatedTCGAData)
library(MultiAssayExperiment)
library(TCGAutils)

Citing curatedTCGAData

Your citations are important to the project and help us secure funding. Please refer to the References section to see the @Ramos2020-ya and @Ramos2017-og citations. For the BibTeX entries, run the citation function (after installation):

citation("curatedTCGAData")
citation("MultiAssayExperiment")

curatedTCGAData uses MultiAssayExperiment to coordinate and represent the data. Please cite the MultiAssayExperiment Cancer Research publication. You can see the PDF of our public data publication on JCO Clinical Cancer Informatics.

Data versions

curatedTCGAData now has a version 2.0.1 set of data with a number of improvements and bug fixes. To access the previous data release, please use version 1.1.38. This can be added to the curatedTCGAData function as:

head(
    curatedTCGAData(
        diseaseCode = "COAD", assays = "*", version = "1.1.38"
    )
)

Version 2.0.1

Here is a list of changes to the data provided in version 2.0.1

Data source

These data were processed by Broad Firehose pipelines and accessed using r BiocStyle::Biocpkg("RTCGAToolbox"). For details on the preprocessing methods used, see the Broad GDAC documentation.

Downloading datasets

To get a neat table of cancer data available in curatedTCGAData, see the diseaseCodes dataset from TCGAutils. Availability is indicated by the Available column in the dataset.

data('diseaseCodes', package = "TCGAutils")
head(diseaseCodes)

Alternatively, you can get the full table of data available using wildcards '*' in the diseaseCode argument of the main function:

curatedTCGAData(
    diseaseCode = "*", assays = "*", version = "2.0.1"
)

To see what assays are available for a particular TCGA disease code, leave the assays argument as a wildcard ('*'):

head(
    curatedTCGAData(
        diseaseCode = "COAD", assays = "*", version = "2.0.1"
    )
)

Caveats for working with TCGA data

Not all TCGA samples are cancer, there are a mix of samples in each of the 33 cancer types. Use sampleTables on the MultiAssayExperiment object along with data(sampleTypes, package = "TCGAutils") to see what samples are present in the data. There may be tumors that were used to create multiple contributions leading to technical replicates. These should be resolved using the appropriate helper functions such as mergeReplicates. Primary tumors should be selected using TCGAutils::TCGAsampleSelect and used as input to the subsetting mechanisms. See the "Samples in Assays" section of this vignette.

ACC dataset example

(accmae <- curatedTCGAData(
    "ACC", c("CN*", "Mutation"), version = "2.0.1", dry.run = FALSE
))

Note. For more on how to use a MultiAssayExperiment please see the MultiAssayExperiment vignette.

Subtype information

Some cancer datasets contain associated subtype information within the clinical datasets provided. This subtype information is included in the metadata of colData of the MultiAssayExperiment object. To obtain these variable names, use the getSubtypeMap function from TCGA utils:

head(getSubtypeMap(accmae))

Typical clinical variables

Another helper function provided by TCGAutils allows users to obtain a set of consistent clinical variable names across several cancer types. Use the getClinicalNames function to obtain a character vector of common clinical variables such as vital status, years to birth, days to death, etc.

head(getClinicalNames("ACC"))

colData(accmae)[, getClinicalNames("ACC")][1:5, 1:5]

Identifying samples in Assays

The sampleTables function gives an overview of sample types / codes present in the data:

sampleTables(accmae)

You can use the reference dataset (sampleTypes) from the TCGAutils package to interpret the TCGA sample codes above. The dataset provides clinically meaningful descriptions:

data(sampleTypes, package = "TCGAutils")
head(sampleTypes)

Separating samples

Often, an analysis is performed comparing two groups of samples to each other. To facilitate the separation of samples, the TCGAsplitAssays function from TCGAutils identifies all sample types in the assays and moves each into its own assay. By default, all discoverable sample types are separated into a separate experiment. In this case we requested only solid tumors and blood derived normal samples as seen in the sampleTypes reference dataset:

sampleTypes[sampleTypes[["Code"]] %in% c("01", "10"), ]

TCGAsplitAssays(accmae, c("01", "10"))

To obtain a logical vector that could be used for subsetting a MultiAsssayExperiment, refer to TCGAsampleSelect. To select only primary tumors, use the function on the colnames of the MultiAssayExperiment:

tums <- TCGAsampleSelect(colnames(accmae), "01")

TCGAprimaryTumors convenience

If interested in only the primary tumor samples, TCGAutils provides a convenient operation to extract primary tumors from the MultiAssayExperiment representation. The TCGAprimaryTumors function will return only samples with primary tumor (either solid tissue or blood) samples using the above operations in the background:

(primaryTumors <- TCGAprimaryTumors(accmae))

To view the results, run sampleTables again on the output:

sampleTables(primaryTumors)

Keeping colData in an extracted Assay

When extracting a single assay from the MultiAssayExperiment the user can conveniently choose to keep the colData from the MultiAssayExperiment in the extracted assay given that the class of the extracted assay supports colData storage and operations. SummarizedExperiment and its derived data representations support this operation. In this example, we extract the mutation data as represented by a RaggedExperiment which was designed to have colData functionality. The default replacement method is to 'append' the MultiAssayExperiment colData to the RaggedExperiment assay. The mode argument can also completely replace the colData when set to 'replace'.

(accmut <- getWithColData(accmae, "ACC_Mutation-20160128", mode = "append"))
head(colData(accmut)[, 1:4])

Example use of RaggedExperiment

The RaggedExperiment representation provides a matrix view of a GRangesList internal representation. Typical use of a RaggedExperiment involves a number of functions to reshape 'ragged' measurements into a matrix-like format. These include sparseAssay, compactAssay, disjoinAssay, and qreduceAssay. See the RaggedExperiment vignette for details. In this example, we convert entrez gene identifiers to numeric in order to show how we can create a sparse matrix representation of any numeric metadata column in the RaggedExperiment.

ragex <- accmae[["ACC_Mutation-20160128"]]
## convert score to numeric
mcols(ragex)$Entrez_Gene_Id <- as.numeric(mcols(ragex)[["Entrez_Gene_Id"]])
sparseAssay(ragex, i = "Entrez_Gene_Id", sparse=TRUE)[1:6, 1:3]

Users who would like to use the internal GRangesList representation can invoke the coercion method:

as(ragex, "GRangesList")

Exporting Data

MultiAssayExperiment provides users with an integrative representation of multi-omic TCGA data at the convenience of the user. For those users who wish to use alternative environments, we have provided an export function to extract all the data from a MultiAssayExperiment instance and write them to a series of files:

td <- tempdir()
tempd <- file.path(td, "ACCMAE")
if (!dir.exists(tempd))
    dir.create(tempd)

exportClass(accmae, dir = tempd, fmt = "csv", ext = ".csv")

This works for all data classes stored (e.g., RaggedExperiment, HDF5Matrix, SummarizedExperiment) in the MultiAssayExperiment via the assays method which converts classes to matrix format (using individual assay methods).

Session Information {.unnumbered}

Click here to expand

sessionInfo()

References {.unnumbered}



waldronlab/curatedTCGAData documentation built on May 6, 2024, 9:16 p.m.