Although there are several good classes for storing gene expression data, in many cases it's not ideal to share serialized versions of these classes. The package is meant for intermediate storage/exchange of gene expression data, sample, feature covariates and pointers to raw data. It maybe (eventually) will support idempotent conversion to/from HDF5, text files and GEO formats. Emphasis is placed on harmonizing covariates between studies, so a controlled vocabulary is available and use encouraged. Its design case was for single cell gene expression experiments, but is hoped that it will be useful in other contexts.
Good question. MAGE-TAB is getting pretty creaky. GEO/SOFT format almost works, but only for sample-level covariates. It requires some abuse to model cells, and many datasets only offer a link to the processed data now. Both of these are also missing some important (to me) experiment and sample-level covariates.
Intended to describe:
- technical aspects of the assay such as platform, chemistry in greater detail than GEO
- upstream computational aspects, such as aligner, read trimming, deduplication. These two might be scrapable from the Protocol.
- cellular covariates such as batch, treatment, sort info
- sample covariates such as organism, tissue, cell line, age, sex. Many available in GEO. Use MeSH/EFO where appropriate. Package ontoCat
can read them. Key:value rather than tabular?
- feature covariates: Genome/transcriptome, id type (ENTREZGENE, ENSEMBL, ...)
These should be dynamically crowd-sourced with a google docs sheet. Then validated and incorporated into namespace as data when package is built using data-raw.
ReadPGEX -> .txt, .hdf5 WritePGEX -> .txt, .hdf5 GuessPlatform(character; vocab) GuessSample(character, vocab) GuessCell(character, vocab)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.