knitr::opts_chunk$set( echo=TRUE, warning=FALSE, message=FALSE, error=FALSE)
The FacileDataSet
is a reference implementation of a multi-assay datastore
that implements the FacileData API. It uses a SQLite database to store
feature- and sample-level metadata, and an HDF5 file to store any number of
assays over its samples. Both of these technologies allow quick and efficient
access to arbitrary subsets of data stored on disk without having to load all
of the data into RAM, making this datastore ideal for housing data like
the entirety of the data from The Cancer Genome Atlas.
The FacileTcgaDataSet package outlines how to assemble the multiple assays from the TCGA (RNA-seq, miRNA-seq, CNV, etc.) into one FacileDataSet, and provides examples of how to access these data using the FacileData API.
We'll be a bit less ambitious in this vignette, and only assemble two different RNA-seq datasets into one FacileDataSet. This won't exercise our ability to query across different assays, but we will use this dataset for downstream examples here, and in vignettes in other facile packages.
We will use this vignette to assemble these data in a stepwise fashion,
highlighting some best practices for doing so. These maneuvers will also be
encapsulated in the assemble_example_dataset()
in this package, for other
resources to use, as well.
:::note We will be working on a FacileBioc package that will provide the FacileData API directly over native Bioconductor assay containers. This will allow users to work with these objects directly with out explicitly converting them into a FacileDataSet first. Until then, follow the instructions here to convert data you already have into a FacileDataSet. :::
The easiest way to create a FacileDataSet
is to pass a named list of
well-groomed bioconductor assay containers, like a
SummarizedExperiment
or DGEList
(or ExpressionSet
) to the
as.FacileDataSet
function.
A FacileDataSet
is stored on disk as a well-structured directory. You can
look at the FacileDataSet Architecture section
for the full details, but the important information to know is that:
SummarizedExperiment
are stored in
an HDF5 file.fData
and pData
,
respectively) are stored in an SQLite database.pData
/colData
data.frames of each SummarizedExperiment
are stored
in an SQLite table in an entity-attribute-value-table (EAV) format in order to
support covariates that are both common to all SummarizedExperiment
s and
unique to others.meta.yaml
file that stores important information about the
dataset itself. Perhaps most important is the sample_covariates
section
that dictates how each covariate stored in the EAV table is decoded back
into its native R-representation. More details are provided in the
Entity-Attribute-Value Table section.Here we will use two already-assembled (but unrelated) RNA-seq datasets easily available within the bioconductor ecosystem for data assembly:
The airway dataset. This package provides a RangedSummarizedExperiment object of read counts in genes for an RNA-Seq experiment on four human airway smooth muscle cell lines treated with dexamethasone.
The parathyroidSE dataset. This package provides RangedSummarizedExperiment objects of read counts in genes and exonic parts for paired-end RNA-Seq data from experiments on primary cultures of parathyroid tumors.
:::tip If you are looking for help wrangling raw RNA-seq data into something like a SummarizedExperiment object, we currently favor processing the reads using either salmon or kallisto, then importing these results using the tximport package. :::
library(SummarizedExperiment) # Note that you will need the airway and parathyroidSE packages installed. # If you don't have those, install them like so: # # BiocManager::install(c("airway", "parathyroidSE")) data("airway", package = "airway") data("parathyroidGenesSE", package = "parathyroidSE") # We load this after the Bioconductor stuff, because there are some facile # generic functions that collide w/ Bioconductor ones. `samples` is defined # in Biocbase, for instance. library(FacileData) library(dplyr)
Keep in mind that we will be assembling these different datasets into one container that will combine the assay data from both experiments into one, as well as their gene- and sample- level metadata.
An often under-valued part of the high-througput experiments we tend
to generate is how we document the phenotypic variables over our samples, ie.
the things that populate the colData()
of a SummarizedExperiment. This
part is particularly important when assembling different datasets into one.
Put simply, we will want to ensure that where there is overlap in the "pnenotypic space" between different SummarizedExperiments, their variable (column) names are the same, and their encodings (values in the columns) are the same when they are talking about the same thing. We'll call this process "harmonization".
As an example, one batch of experiments might have a "sex"
covariate, encoded
with "male"
and "female"
entries. Another might encode "sex"
using "m"
and "f"
values. Yet another might have labelled "sex"
as "gender"
, and
encoded the values as "MALE"
and "FEMALE"
. When assembling these different
experiment togheter, you want to ensure that the "sex"
covariate is named
the same across the colData()
objects of the experiments you are assembling,
and further that all of its entires are encoded uniformly as "male"
and
"female"
.
:::note
There is no requirement in this sample-data harmonization process to ensure
that every experiment has the same covariates. Some experiments do not track
the same phenotypic variables as others. This is fine: every
SummarizedExperiment does not need the same elements in its colData()
. We
just want to ensure that when there is overlap, the column names and encodings
match across datasets.
:::
Another thing to consider when encoding covariate values is to ensure that
they are, themselves, encoded as valid R variable names. For instance,
suppose we have a time
factor, with levels like "24h"
, and "48h"
, these
should be re-encoded to something like "h24"
and "h48"
. Here's why:
often times while you're running different sorts of exploratory data analyses,
through some sequence of "gather"
-ing, "spread"
-ing, and joining
sample-level data, it is not uncommon for factor levels to end up as columns of
a data.frame. Ensuring that the factor loevels are "proper R variable" names up
front avoids problems you might run into when using non standard evaluation with
whatever dplyr mojo, model.matrix(...)
, or whatever else these data often find
themselves subject to.
Let's take a look at the colData()
from our two datasets, and see how we might
harmonize them:
as.data.frame(colData(airway))
as.data.frame(colData(parathyroidGenesSE))
Here are some things to consider:
There are only a subset of columns from the two experiments that tell us something about the characteristics of their samples, such as the "cell", and "dex" columns of the airway dataset, and patient, treatment, and time, column of the parathyroid dataset. The "albut" column in the airway dataset is non-informative in this subset of the data, since it is always the same value, so we'll drop that, too.
When combining these two experiments, we can probably consider the "dex" column in the airway dataset as a "treatment", so we can combine them with the same column in the parathyroid dataset. Further, the encodings of the dex column might be transformed from "untrt" and "trt" to "control" and "dex".
The "time" column in the parathyroid experiment should be translated into
valid R variable names, so we'll convert 24h
to hrs24
. The entries
in the parathyroid "patient" column will also be munged in a similar fashion.
We've found naming covariates of this type as "subject_id" covers many cases
when working across different experiments, so we'll rename it to that.
The airway dataset is from cell lines, and the parathyroid dataset is from people. We will introduce a "sample_type" covariate encoded with "cell_line" and "primary" values.
The rest of the colData()
columns will be dropped from both experiments.
With these things in mind, let's create new SummarizedExperiment objects, with
harmonized colData()
.
airway
se.airway <- airway colData(se.airway) <- local({ cd <- colData(airway) |> as.data.frame() |> transmute( sample_type = "cell_line", cell_line = cell, treatment = ifelse(dex == "untrt", "control", "dex")) |> DataFrame() rownames(cd) <- colnames(se.airway) cd })
parathyroid
se.parathyroid <- parathyroidGenesSE colData(se.parathyroid) <- local({ cd <- colData(parathyroidGenesSE) |> as.data.frame() |> transmute( sample_type = "primary", subject_id = paste0("patient_", patient), treatment = tolower(as.character(treatment)), time = paste0("hrs", sub("h$", "", time))) |> DataFrame() rownames(cd) <- colnames(se.parathyroid) cd })
When you are assembling assay data together across different experiments, you need to decide which assays are the same, and which are different. For instance, if you have a series of experiments with paired-end RNA-seq data, and another set of experiments with single-end RNA-seq data: these aren't the same assays, are they? ... or?
You might decide to assemble these data together using two different assay
types, one named "gene_count_PE"
and the other "gene_count_SE"
. There will
be a disjoint set of samples that have assay data from one or the other. You
then might decide to make a third assay type, "gene_count_norm"
which somehow
normalizes these data together and make that available for exploratory analysis.
In this example, for the sake of simplicity we will assume that the airway
and
parathyroid
experiments were run on the same assay, and load them up together.
This also means that they must share the same feature space for the assay, and
we need to ensure that the rownames()
of the two objects are equal.
Neither dataset provides gene information for its rows aside from the Ensembl
identifiers. Because these were both quantitated against v75 (ish) of Ensembl's
gene annotations, we'll iuse the biomaRt package to retrieve the
information from the assembly close to that (biomaRt::listEnsemblArchives()
points us to the URL we should query).
bm <- loadNamespace("biomaRt") mart <- bm$useMart( host = "feb2014.archive.ensembl.org", biomart = "ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl") mart.info <- bm$getBM( attributes = c("ensembl_gene_id", "hgnc_symbol", "gene_biotype"), filters = "ensembl_gene_id", values = rownames(se), mart = mart)
The query above retrieves the data we need, but it can take a long time to run. To save time while compiling this vignette, we have serialized the result of the query, and will use that here.
mart.info <- local({ fn <- system.file("extdata", "ensembl-v75-gene-info.csv.gz", package = "FacileData") read.csv(gzfile(fn, "rt"), stringsAsFactors = FALSE) })
The column names of the gene.info
table need to be tweaked, and we will only
keep the rows (genes) that are shared between our two datasets.
:::note This mismatched gene matching wouldn't be an issue if these data were from the same assay and processed in the same way. When you are assembling assay data into single assay of a FacileDataSet, the implicit assumption is that all of the data were processed in the same way. :::
shared.ids <- intersect(rownames(se.airway), rownames(se.parathyroid)) gene.info <- mart.info |> transmute(feature_id = ensembl_gene_id, feature_type = "ensgid", name = hgnc_symbol, meta = gene_biotype, source = "Ensembl_v75") |> filter(feature_id %in% shared.ids) |> distinct(feature_id, .keep_all = TRUE) |> DataFrame() rownames(gene.info) <- gene.info[["feature_id"]]
Now we can cleanup the rowdata of our SummarizedExperiment and make them row-wise concordant.
se.airway <- se.airway[rownames(gene.info),] rowData(se.airway) <- gene.info se.parathyroid <- se.parathyroid[rownames(gene.info),] rowData(se.parathyroid) <- gene.info
Like we've been saying all along ... all you need is a named list of well-curated bioconductor experiments, and here we are!
We initialize a new FacileDataSet with our list of SummarizedExperiments (you can even just have one SummarizedExperiment), and specify which assay you are dropping in first.
se.all <- list(airway = se.airway, parathyroid = se.parathyroid) xfds <- as.FacileDataSet( se.all, path = tempfile(), dataset_name = "ExampleRnaFacileDataSet", assay_name = "gene_counts", assay_description = "Gene counts from their respective Bioconductor packages", assay_type = "rnaseq", organism = "Homo sapiens")
Let's look at how the correlation of TLN1 and ITGB1 change across datasets
library(ggplot2) theme_set(theme_bw()) dat <- samples(xfds) |> with_assay_data(c("ENSG00000137076", "ENSG00000150093")) |> with_sample_covariates("treatment") ggplot(dat, aes(ITGB1, TLN1)) + geom_point(aes(color = treatment)) + facet_wrap(~ dataset)
bdat <- samples(xfds) |> fetch_assay_data(c("ENSG00000137076", "ENSG00000150093"), normalized = TRUE) |> with_sample_covariates("treatment") ggplot(bdat, aes(x = feature_name, y =value)) + geom_boxplot(outlier.shape = NA) + geom_jitter(aes(color = treatment), width = 0.2) + facet_wrap(~ dataset)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.