knitr::opts_chunk$set( collapse = TRUE, comment = "#>", warning = FALSE, message = FALSE, cache = TRUE, tidy = TRUE, width.cutoff = 70 )
This overview provides insight into the available datasets (R package version
r utils::packageVersion("curatedPCaData")
) provided via ExperimentHub
cloud
services. The main data class is a MultiAssayExperiment
(MAE) object
compatible with numerous Bioconductor packages.
3 different omics base data types and accompanying clinical/phenotype data are currently available:
gex.*
assays contain gene expression values, with the suffix wildcard
indicating unit or method for gene expressioncna.*
assays contain copy number values, with the suffix wildcard
indicating method for copy number alterationsmut
assays contain somatic mutation callsMultiAssayExperiment::colData(maeobj)
contains the clinical metadata
curated based on a pre-defined templateTheir availability is subject to the study in question, and you will find
coverage of the omics here-in. Furthermore, derived variables based on these
base data types are provided in the constructed MultiAssayExperiment
(MAE)
class objects.
For a comprehensive guide on how to neatly handle such MAE
objects, refer to
the MultiAssayExperiment user guide (or cheat-sheets): [MAE User Guide]
(https://www.bioconductor.org/packages/devel/bioc/vignettes/MultiAssayExperiment/inst/doc/MultiAssayExperiment.html) .
The curatedPCaData
package contains a collection of manually curated datasets
concerning patients diagnosed with prostate cancer. The datasets within this
package have followed uniform processing and naming conventions to allow users
to more easily reproduce similar analyses between datasets and spend less time
concerned with harmonzing data from different sources.
To get a full list of available datasets see the documentation for getPCa
function, or via querying ExperimentHub
directly for the components used to
construct MultiAssayExperiments
for the studies. However, getPCa
is aimed
to comprehensively provide readily usable multi-omics compatible MAE-objects.
Fetching all datasets available in curatedPCaData
:
library(curatedPCaData) # Use a function to extract all known study short identifiers studies <- curatedPCaData::getPCaStudies() studies # List apply across studies to extract all MAE objects corresponding to the # short identifiers maes <- lapply(studies, FUN=\(id) { curatedPCaData::getPCa(id) }) names(maes) <- studies
The datasets were manually selected based on various criteria, such as:
The function getPCa
utilizes the studies' short name for identifying which
data to extract. An overview into the main datasets is as follows:
# Create a summary table depicting key features in available studies studytable <- curatedPCaData::getPCaSummaryStudies(maes)
knitr::kable(studytable, caption = "Key study characteristics")
Please note that the TCGA PCa dataset is a subset of the TCGA pan-cancer initiative. For a package focused on TCGA exclusively beyond the PRAD subset, see the Bioconductor package [curatedTCGAData] (https://bioconductor.org/packages/release/data/experiment/html/curatedTCGAData.html).
The use of curatedPCaData
ought to be cited with [@Laajala2023].
For individual datasets there-in, the following citations are suggested:
The curatedPCaData
-package has been curated with an emphasis on the following
primary clinical metadata, which were extracted and cleaned up always when
available:
data(template_prad) template <- template_prad # Add spaces to |-dividers for linechanges template <- do.call("cbind", lapply(template, FUN = function(x) { gsub("|", " | ", x, fixed = TRUE) })) knitr::kable(template, caption = "Template for prostate adenocarcinoma clinical metadata")
Three primary clinical end-points were utilized and are offered in the clinical metadata in colData for the MAE-objects, if available:
Below are summaries for each of these endpoints for each study. Of note, OS had very few events, thus survival modelling for this end-point may be considered unreliable.
Gleason grades:
# Create a summary table of Gleason grades gleasons <- curatedPCaData::getPCaSummaryTable( maes, var.name = "gleason_grade", vals=5:10 )
knitr::kable(gleasons, caption = "Gleason grades across datasets in curatedPCaData")
Grade groups:
# Create a summary table of grade groups gradegroups <- curatedPCaData::getPCaSummaryTable( maes, var.name = "grade_group", vals=c("<=6", "3+4", "4+3", "7", ">=8") )
knitr::kable(gradegroups, caption = "Grade groups across datasets in curatedPCaData")
Biochemical recurrences:
# Create a summary table of biochemical recurrences recurrences <- curatedPCaData::getPCaSummarySurv( maes, event.name = "disease_specific_recurrence_status", time.name = "days_to_disease_specific_recurrence" )
knitr::kable(recurrences, caption = "Disease recurrence end point across datasets in curatedPCaData")
Overall survival:
# Create a summary table of overall survival survivals <- curatedPCaData::getPCaSummarySurv( maes, event.name = "overall_survival_status", time.name = "days_to_overall_survival" )
knitr::kable(survivals, caption = "Overall survival end point across datasets in curatedPCaData")
The function getPCa
functions as the primary interface with building
MAE-objects from either live download from ExperimentHub
or by loading them
from local cache, if the datasets have been downloaded previously.
The syntax for the function getPCa(dataset, assays, timestamp, verbose, ...)
consists of the following parameters:
dataset
: Primary indicator for which study to query from ExperimentHub
;
notice that this may only be one of the allowed values.assays
: This indicates which MAE-assays are fetched from the candidate
ExperimentList. Two names are always required (and are filled if missing):
colData
which contains information on the clinical metadata, and sampleMap
which maps the rownames of the metadata to columns in the fetched assay data. timestamp
: When data is deposited in the ExperimentHub
resources, they
are time stamped to avoid ambiguity. The timestamps provided in this parameter
are resolved from left to right, and the first deposit stamp is "20230215"
. verbose
: Logical indicator whether additional information should be printed
by getPCa
....
: Further custom parameters passed on to getPCa
.As an example, let us consider querying the TCGA dataset, but suppose only wish to extract the gene expression data, and the immune deconvolution results derived by the method xCell. Further, we'll request risk and AR scores slot. This subset could be retrieved with:
tcga_subset <- getPCa( dataset = "tcga", assays = c("gex.rsem.log", "xcell", "scores"), timestamp = "20230215" ) tcga_subset
The standard way of extracting the latest MAE-object with all available assays is done via querying with just the dataset name:
mae_tcga <- getPCa("tcga") mae_taylor <- getPCa("taylor")
The primary assay names in the MAE objects for gene expression and copy number
alteration will consist of two parts. Mutation data is provided as a
RaggedExperiment
object.
RaggedExperiment
objects as "mut".The standard way for accessing a data slot in MAE could be done for example via:
mae_taylor[["gex.rma"]][1:5, 1:5]
The corresponding clinical variables have an accessor function colData
provided by the MultiAssayExperiment
-package:
MultiAssayExperiment::colData(mae_tcga)[1:2, ]
While it is ideal to make sure user is using the correct namespaces, the
pckgName::
can be omitted as curatedPCaData
imports necessary packages such
as MultiAssayExperiment
and their functions should be available in the
workspace.
In order to access the latest listing of curatedPCaData
related resources
available in ExperimentHub
, consult the metadata.csv
file delivered with
the package:
metadata <- read.csv(system.file("extdata", "metadata.csv", package = "curatedPCaData")) head(metadata)
# Retrieve samples counts across different unique assay names as well as omics # overlap sample counts samplecounts <- curatedPCaData::getPCaSummarySamples(maes)
The sample counts in each 'omics separately is listed below:
knitr::kable(samplecounts$Samples, caption = "Sample N counts in each omics for every MAE object")
However, taking intersections between different omics shows that different samples were analyzed on different platforms - therefore the effective N counts for analyzing multiple 'omics platforms simultaneously is smaller. The overlaps between gene expression (GEX), copy number alteration (CNA), and mutations (MUT) are shown below:
knitr::kable(samplecounts$Overlap, caption = "Sample N counts for intersections between different omics")
In curatedPCaData
we refer to derived variables as further downstream
variables, which have been computed based on primarily data. For most cases,
this was done by extracting key gene information from the gex.*
assays and
pre-computing informative downstream markers as described in their primary
publications.
Tumor progression depends on the immune cell composition in the tumor
microenvironment. The 'immunedeconv'
package consists of different computational methods to computationally estimate
immune cell content using gene expression data. In addition, CIBERTSORTx is
provided externally, as this method required registered access. For user
convenience, it has been run separately and provided as a slot in the MAE
objects. The other methods have been run using the immunedeconv
package
[@Sturm2019] and code for reproducing these derived variables are provided
alongside the package.
In this package, we provide estimates of immune cell content from the following deconvolution methods:
The estimates from each of these methods are stored in the MAE object as a seperate assay as shown for example in the Taylor dataset
mae_taylor
To access the quantiseq results for the Taylor et. al dataset, these pre-computed values can be obtained from the corresponding slot in the MAE-object:
head(mae_taylor[["cibersort"]])[1:5, 1:3]
Similarly to access results from the other immune deconvolution methods, the following assays/experiments are also available:
head(mae_taylor[["quantiseq"]])[1:5, 1:3] head(mae_taylor[["xcell"]])[1:5, 1:3] head(mae_taylor[["epic"]])[1:5, 1:3] head(mae_taylor[["mcp"]])[1:5, 1:3]
Each row of the deconvolution matrix represents the content of a certain immune cell type and the columns represent the patient sample IDs. The variables on the rows are specific for each method. Further, it should be noted that not all methods could be run on all datasets due to lack of overlap in genes of interest.
The slot scores
is used to provide key risk scores or other informative
metrics based on the primary data. These scores can be accessed as a matrix as
if they were variables on an assay with this name:
mae_tcga[["scores"]][, 1:4]
The following PCa risk scores are offered:
(rowname: decipher)
[@Herlemann2019](rowname: oncotype)
[@Knezevic2013](rowname: prolaris)
[@NICE2018]Further, the 20-gene Androgen Receptor (AR) score is calculated as described in the TCGA's Cell 2015 paper:
(rowname: ar_score)
[@TCGA2015]Here, a brief example on how to download and process a single study is provided. The example data is of Taylor et al. [@Taylor2010], also known as the MSKCC dataset.
A character vector with the short study ID is used to download the MAE object; we will focus only on the primary prostate cancer samples and CNA (GISTIC) and GEX:
taylor <- getPCa("taylor", assays = c("gex.rma", "cna.gistic"), sampletypes = "primary" ) class(taylor) taylor
One typical end-point is an object of type Surv
, exported from the
survival
-package. We will create this end-point for biochemical
recurrence:
library(survival) # BCR events taylor_bcr <- colData(taylor)$disease_specific_recurrence_status # BCR events / censoring follow-up time taylor_fu <- colData(taylor)$days_to_disease_specific_recurrence taylor_surv <- Surv(event = taylor_bcr, time = taylor_fu) class(taylor_surv) head(taylor_surv)
With the response vector of type Surv
, one can plot and analyze multiple
survival modelling related tasks.
One of the most common ways to depict survival curves (here BCR events), is a Kaplan-Meier (KM) curve. Gleason grade is known to be a good prognostic factor for BCR, thus a KM curve in respect to biopsy Gleason grade shows differences in prognosis.
For this visualization, package survminer
offers a variety of functions
building on top of ggplot
:
library(survminer) taylor_bcr_gleason <- data.frame(bcr = taylor_surv, gleason = colData(taylor)$gleason_grade) fit <- survfit(bcr ~ gleason, data = taylor_bcr_gleason) ggsurvplot(fit, data = taylor_bcr_gleason, ylab = "Biochemical recurrence free proportion", risk.table = TRUE, size = 1, pval = TRUE, ggtheme = theme_bw() )
For more settings for the KM plots, see documentation for
survminer::ggsurvplot
.
Functions longFormat
and wideFormat
from MultiAssayExperiment
are
essential for extracting multi-omics data and metadata in the right format.
For the purposes of Cox regression, we will utilize wideFormat
:
taylor_coxdat <- MultiAssayExperiment::wideFormat(taylor["PTEN",,], colDataCols = c("age_at_initial_diagnosis", "gleason_grade", "disease_specific_recurrence_status", "days_to_disease_specific_recurrence")) taylor_coxdat <- as.data.frame(taylor_coxdat) taylor_coxdat$y <- Surv( time = taylor_coxdat$days_to_disease_specific_recurrence, event = taylor_coxdat$disease_specific_recurrence_status) head(taylor_coxdat)
We'll construct a simple Cox proportional hazards model with few variables; PTEN is a known tumor suppressor gene, so changes in its copy number or gene expression levels could also play a role in biochemical recurrence.
coxmodel <- coxph(y ~ cna.gistic_PTEN + gex.rma_PTEN + gleason_grade, data = taylor_coxdat) coxmodel
In this case we took the GISTIC normalized PTEN amplification as well as its RMA-normalized gene expression. We notice that both are statistically significant Cox regression coefficients together with Gleason grade.
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.