suppressPackageStartupMessages(library("tidies"))
suppressPackageStartupMessages(library("BiocStyle"))
suppressPackageStartupMessages(library("MSnbase"))

Introduction

The tidies package (from to contraction of tidy eSet) implements tidy principles as defined in the tidyverse packages to omics-type data based on the eSet class, with (currently at least for now), an emphasis on quantitative proteomics data.

High throughput data and the eSet class

The motivation to store omics data in dedicated containers is to coordinate the high throughput data (e.g., gene or protein expression), the sample annotation (phenotype data) and the feature annotation (feature data).

A typical omics data structure, as defined by the eSet class, is represented below. It's main features are

A typical omics data object

The coordinated nature of the high throughput data guarantees that the dimensions of the different slots will always match (i.e the columns in the expression data and then rows in the sample metadata, as well as the rows in the expression data and feature metadata) during data manipulation. The metadata slots can grow additional co-variates (columns) without affecting the other structures.

To illustrate such an omics data container, we'll make use of the msnset object that comes with the r Biocpkg("MSnbase") package, which contains data for r nrow(msnset) features and r ncol(msnset) samples.

library("MSnbase")
data(msnset)

## Some test sample groups
msnset$group <- c("A", "A", "B", "B")
dim(msnset)

The expression data:

head(exprs(msnset))

The sample metadata:

pData(msnset)

The feature metadata:

fData(msnset)[1:10, 1:5]
## all feature variables
fvarLabels(msnset)

Tidy tools

The tidy data definition and tidy tools manifesto lay out the principles that packages in the tidyverse package. tidies isn't part of the tidyverse, but aims at applying these same principles. The concepts that are relevant for the application to omics data are

The tidies package

The goal of this package is to support the dplyr function such as select, filter, group_by, summarise, ... direcly on omics data containers described above. The function act on the respective variable and expression slots and preserve the object's class.

Another approach is to convert the omics objects to tidy tibbles and work direcly with these. This can easily be done with the Bioconductor r Biocpkg("biobroom") package, that will convert the r nrow(msnset) features by r ncol(msnset) samples into a r prod(dim(msnset)) tibble:

library("biobroom")
tidy(msnset, addPheno = TRUE)

Using biobroom::tidy drops the feature data that, sometimes, is important. It is possible to create dataframe that contains these metadata using the following approach. First, combine the expression data and the feature data into a single wide dataframe using ms2df

x <- MSnbase::ms2df(msnset)

which then can be converted to a long, tidy, form with

fv <- fvarLabels(msnset)
library("tidyr")
x <- tidyr::pivot_longer(x,
                         names_to = "sample",
                         values_to = "exprs",
                         -fv)

x

to produce a dataframe with r nrow(x) rows corresponding to the original r nrow(msnset) features for the r ncol(exprs(msnset)) samples, and, for each of these, the corresponding feature metadata.

Given that this coercion is often useful, it is implemented in as_tibble (see the example below).

Using tidies

We start by loading the tidies package (which also automatically loads and attaches r CRANpkg("magrittr") for the %>% operator).

library("tidies")

Select feature or sample variables

Select sample variables (updates only the phenotypic data)

msnset %>%
    select(group) %>%
    pData

Note that the output of select(group) is an MSnSet - we pipe it directly into pData to demonstrate that only that variable was retained.

Select feature variables (updates only the feature data)

## All feature variables
fvarLabels(msnset)

## Select a single feature variable
msnset %>%
    select(charge) %>%
    fvarLabels

## Select features using a pattern
msnset %>%
    select(starts_with("Protein")) %>%
    fvarLabels

Select sample and feature variables

msnset %>%
    select(group) %>%
    select(starts_with("Prot"))

Order data by it feature of sample variables

Arrange columns/samples

msnset %>%
    arrange(desc(group)) %>%
    pData

Arrange rows/features and select feature variables

msnset %>%
    arrange(charge) %>%
    select(charge) %>%
    fData %>%
    head

Return features and samples with matching conditions

Filter using feature variables

msnset %>%
    filter(ProteinAccession == "ENO") %>%
    exprs

Filter using phenotypic (samples) variables

msnset %>%
    filter(group == "A") %>%
    exprs %>%
    head

Filter on both feature and sample variables

msnset %>%
    filter(group == "A") %>%
    filter(ProteinAccession == "ENO") %>%
    exprs

Group by one or more feature or sample variables

Group by features

msnset %>%
    group_by(ProteinAccession) %>%
    show

Group by samples

msnset %>%
    group_by(group) %>%
    show

Group by features and samples

msnset %>%
    group_by(ProteinAccession) %>%
    group_by(group) %>%
    show

Summarise the expression values of a dataset

Grouping and summarising by features

msnset %>%
    group_by(charge) %>%
    summarise(median(exprs, na.rm = TRUE)) %>%
    exprs

msnset %>% group_by(ProteinAccession) %>%
    summarise(median(exprs, na.rm = TRUE)) %>%
    exprs %>%
    head

Grouping and summarising by samples

msnset %>% group_by(group) %>%
    summarise(mean(exprs, na.rm = TRUE)) %>%
    exprs %>%
    head

Grouping by features and samples

msnset %>%
    group_by(charge) %>%
    summarise(mean(exprs)) %>%
    group_by(group) %>%
    summarise(max(exprs, na.rm = TRUE)) %>%
    exprs

In the following example, we show how dplyr and MSnbase (here we using filterNA, combineFeatures and normalise) functions oberate seamlessly and can be mixed and matched with in chain of operations:

msnset %>%
    filterNA() %>%
    combineFeatures(method = "median", fcol = "ProteinAccession") %>%
    group_by(group) %>%
    summarise(mean(exprs)) %>%
    normalise(method = "quantiles") %>%
    filter(ProteinAccession %in% c('ENO', 'BSA')) %>%
    exprs

In this last example, we use as_tibble and pipe the mutated data directly into ggplot2:

library("ggplot2")
msnset %>% as_tibble %>%
    mutate(rt = cut(retention.time, 7)) %>%
    ggplot(aes(x = sample, y = exprs)) +
    geom_boxplot() + facet_grid(charge ~ rt)

Future work

See issues, and the TODO issue in particular for current and future work. Depending on interest, the functionality presented here could be extended to other data types such as SummarizedExperiments.

A technical issue is with the dependency on both Bioconductor and the tidyverse is the recurrent name clashes:

Bioconductor and tidyverse conflicts

The combine function, for example, is defined as a generic with signature combine(x, y, ...) in r Biocpkg("BiocGenerics"), while it is combine(...) in r CRANpkg("dplyr").



lgatto/tidyms documentation built on May 11, 2020, 9:30 a.m.