In facilebio/FacileData: A fluent API for accessing multi-assay high-throughput genomics data

The content here was taken from a pre-release version of the FacileDataSet-assembly.Rmd vignette.

I need to pick through the text below to see if there are informative tidbits I missed in the new version of the FacileDataSet-assembly vignette before nuking this text outright.

A FacileDataSet should be prepared as a named list of SummarizedExperiments (or DGELists, ExpressionSets, etc). It is expected that these objects will share some, but not necessarily all, of their sample annotation columns (colData or pData). Columns with the same names should also have the same encoding, factor levels, etc.

Annotations on the SummarizedExperiments themselves and on the colData columns are also used to fill out the meta.yaml file described above.

ExpressionSet pData data.frames should have an attribute named 'label', which will be a named character vector with a description for each column. In the case of a SummarizedExperiment, the colData should have named list in the metadata slot with a character description of each column.

ExpressionSets should have a short textual description of the facet/dataset in the annotation slot. Similarly, SummarizedExperiments should have a list in the metadata slot with url and description for the facet/dataset.

The fData or mcols for the ExpressionSet or SummarizedExperiment, respectively, should have feature_type, feature_id, name, meta, effective_length and source columns. See the manpage for as.FacileDataSet for more details.

Well Groomed pData

Given a list of SummarizedExperiments, we collect the individual pData()/mcols(). When colnames() match across these data.frames, we assume that the covariates mean the same thing and are encoded using the same scheme (ie. factor levels match).

Where colnames() differ between pData data.frames, even if it is just by a single letter, or their case, are treated as being different.

FacileDataSet Architecture

Hybrid Data Storage

The FacileDataSet stores the following data:

A data.sqlite SQLite database that stores feature- and sample-level metadata.
A data.h5 HDF5 file that stores a multitude of dense assay matrices that are generated from the assays performed on the samples in the FacileDataSet.
A meta.yaml file tha contains information about the FacileDataSet. To better understand the structure and contents of this file, you can refer to the following: a. The included testdata/expected-meta.yaml file for, which is an exemplar file for the testdata/TestFacileTcgaDataSet, which consists of data extracted from two datasets (BLCA and BRCA) from the TCGA. b. The help file provided by the eav_metadata_create function, which describes in greater detail how we track a dataset's sample-level covariates (aka, "pData" in the bioconductor world). In the meantime, a short description of the entries found in the meta.yaml file is provided here:
- name: the name of the dataset (i.e. "FacileTCGADataSet")
- organism: "Homo sapiens", "Mus musculus", etc.
- default_assay: the name of the assay to use by default if none is specified in calls to [fetch_assay_data()], [with_assay_data()], etc. (kind of like how "exprs" is the default assay used when working with a [Biobase::ExpressionSet])
- datasets: a section tha enumerates the datasets included internally. The datasets are further enumerated.
- sample_covariates: a section that enumerates the covariates that are tracked over the samples inside the FacileDataSet (ie. a mapping of the pData for the samples). Reference ?create_eav_metadata for more information.
A custom-annotation directory, which stores custom sample_covariate (aka "pData") information that analysts can identify and describe during the course of an analysis, or even add from external sources. Although this directory is required in the directory structure of a valid FacileDataSet, the FacileDataSet() constructor can be called with a custom anno.dir parameter so that custom annotations are stored elsewhere.

SQLite Schema

Entity-Attribute-Value Table

Sample covariates (aka pData) are encoded in an entity-attribute-value (EAV) table. Metadata about these covariates are stored in a meta.yaml file in the FacileDataSet directory which enables the FacileDataSet to cast the value stored in the EAV table to its native R type. This function generates the list-of-list structure to represent the sample_covariates section of the meta.yaml file.

For simple pData covariates, each column is treated independently from the rest. There are some types of covariates which require multiple columns for proper encoding, such as encoding of survival information, which requires a pair of values that indicate the "time to event" and the status of the event (death or censored). In these cases, the caller needs to provide an entry in the covariate_def list that describes which pData columns (varname) goes into the single facile covariate value.

Please refer to the Encoding Survival Covariates section for a more detailed description of how to define encoding survival information into the EAV table using the covariate_def parameter. Further examples of how to encode other complex atributes will be added as they are required, but you can reference the Encoding Arbitrarily Complex Covariates section for some more information.

Encoding Survival Covariates:

UPDATE: Survival covariates can now be encoded simply as a survival::Surv object and provided as a column in the pData data.frame. The following describes the original, and still supported, method.

Survival data in R is typically encoded by two vectors. One vector that indicates the "time to event" (tte), and a second to indicate whether or not the denoted tte is an "event" (1) or "censored" (0).

Normally these vectors appear as two columns in an experiment's pData, and therefore need to be encoded into the FacileDataSet's EAV table. To do so, the pair of vectors are turned into a signed numeric value. The absolute value of the numeric indicates the "time to event" and the sign of the value indicates its censoring status.

Let's assume we have tte_OS and event_OS column that are used to encode a patient's overall survival (time and censor status). To store this as an "OS" covariate in the EAV table, a covariate_def list-of-list definition that captures this encoding would look like this:

covariate_def <- list(
  OS=list(
    class="right_censored",
    arguments=c(time="tte_OS", event="event_OS"),
    label="Overall Survival",
    type="clinical",
    description="Overall survival in days"))

Note how the name of the list-entry in covariate_def defines the name of the covariate in the FacileDataSet. The class entry for the OS definition indicates the type of variable this is. The varname entry lists the columns in the pData that are combined to make this value. The names(varnames) correspond to the parameters in the [eav_encode_right_censored()] function. The analagous meta.yaml entry in the sample_covariates section for the "OS" covariate_def entry looks like so:

sample_covariates:
  OS:
    class: right_censored
    label: "Overall Survival"
    type: "clinical"
    description: "Overall survival in days"
    colnames: ["tte_OS", "event_OS"]
    argnames: ["time", "event"]

Encoding Arbitrarily Complex Covariates:

To encode a new type of complex covariate from a wide pData data.frame, we need to:

Specify a new class (like "right_censored") for use within a FacileDataSet.
Define an eav_encode_<class>(arg1, arg2, ...) function which takes the R data vectors (arg1, arg2) and converts them into a single value for the EAV table.
Define a eav_decode_<class>(x, attrname, def, ...) function which takes the single value in the EAV table and casts it back into the R-native data vector(s).
- x is the vector of (character) values from the EAV table
- attrname is the name of the covariate in the EAV table
- def is the definition-list for this covariate.
- ... allows each decode function to be further customized.

HDF5 File Assay Data

The HDF5 file has one directory per assay. These directories have one matrix per dataset for the given assay.

For instance, the FacileTCGADataSet HDF5 file has this structure:

. data.h5
├── rnaseq
│   ├── ACC
│   ├── BLCA
│   ├── BRCA
│   ├── CESC
│   ├── ...
├── cnv_score
│   ├── ACC
│   ├── BLCA
│   ├── BRCA
│   ├── CESC
│   ├── ...
├── mirnaseq
│   ├── ACC
│   ├── BLCA
│   ├── BRCA
│   ├── CESC
│   ├── ...

facilebio/FacileData documentation built on Feb. 23, 2024, 9:14 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

facilebio/FacileData
A fluent API for accessing multi-assay high-throughput genomics data

In facilebio/FacileData: A fluent API for accessing multi-assay high-throughput genomics data

Well Groomed pData

FacileDataSet Architecture

Hybrid Data Storage

SQLite Schema

Entity-Attribute-Value Table

Encoding Survival Covariates:

Encoding Arbitrarily Complex Covariates:

HDF5 File Assay Data

R Package Documentation

Browse R Packages

We want your feedback!

facilebio/FacileData A fluent API for accessing multi-assay high-throughput genomics data

In facilebio/FacileData: A fluent API for accessing multi-assay high-throughput genomics data

Well Groomed pData

FacileDataSet Architecture

Hybrid Data Storage

SQLite Schema

Entity-Attribute-Value Table

Encoding Survival Covariates:

Encoding Arbitrarily Complex Covariates:

HDF5 File Assay Data

R Package Documentation

Browse R Packages

We want your feedback!

facilebio/FacileData
A fluent API for accessing multi-assay high-throughput genomics data