In martijnvanattekum/cleanse: Tidyverse functions for SummarizedExperiment

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

The SummarizedExperiment (se) class offers a useful way to store multiple row and column data along with the values from an experiment and is widely used in computational biology. Although subsetting se's is possible with base R notation (ie using []), se's lack methods for dplyr functions such as filter and select, which excludes se's from easily being used in pipes. This package offers methods for the dplyr functions and will automatically dispatch your se to the relevant function.

As se's contain a data frame for both the rowData and colData, a major difference using these functions in cleanse, is that we need to specify whether we apply our function to the row or the col of the se. Cleanse will then take care of updating the se.

The package contains an example se called seq_se. The example contains dummy data of expression and copy number values for 10 genes and 48 different conditions. To get an overview of the different options available in the se, use print_options:

library(cleanse)

data(seq_se)
print_options(seq_se)

As you can see, seq_se contains expression and copy_number data for samples taken from 3 different sites, which were treated with either treatment A or B for 0 or 4 hours for 4 different patients. The genes that were sequenced are from 3 different gene groups: IL, NOTCH, and TLR. Each of the rowData / colData columns (eg patient, site, gene_group) is refered to as a variable henceforth.

Using dplyr functions to subset the se

These functions either return a subset of the se or a rearranged se. For each of them, the underlying assay values are updated accordingly, so everything is kept in sync.

The filter function subsets rows / cols from the se based on conditional filtering of a variable contained in either rowData or colData respectively. Experiment values in the assay are dropped concurrently with the update of the colData/rowData.

genes_subset_se <- seq_se %>% filter(row, gene_group == "IL")
print_options(genes_subset_se)  #note the change in available gene_groups
dim(seq_se)
dim(genes_subset_se)

sample_subset_se <- seq_se %>% filter(col, treatment == "B", site %in% c("brain", "skin"))
print_options(sample_subset_se) #note the change in available treatments and sites
dim(seq_se)
dim(sample_subset_se)

To subset a se by position, slice can be used:

seq_se %>% cleanse::slice(col, 1:10) #select the first 10 columns

arrange is used to consecutively sort a data frame by a variable. In this case, we arrange the rows, first by gene_name and next by gene_group.

seq_se_reordered <- seq_se %>% arrange(row, gene_name, gene_group)

The slice_sample() function behaves similar to dplyr's equivalent by selecting random rows or cols:

slice_sample(seq_se, col, n=3)#note the change in dim

slice_sample(seq_se, row, prop=.2) #note the change in dim

Using dplyr functions to change the se's metadata

These functions return a se with updated rowData or colData. The dimensions and assay values of the se will not be changed by these functions.

select can be used to select variables, and rename will rename these variables.

# remove the time variable after filtering for time == 0
seq_se_min_time <- seq_se %>% cleanse::filter(col, time == 0) %>% cleanse::select(col, -time)
print_options(seq_se_min_time)  #note the time variable has disappeared from the colData

# rename the time variable after changing it to minutes
seq_se_ren_time <- seq_se %>% cleanse::mutate(col, time = (time * 60)) %>% cleanse::rename(col, time_mins = time)
print_options(seq_se_ren_time)  #note the time variable is now called time_mins

mutate adds or changes variables. If we for instance want to change the time for the samples from hours to minutes, we can do

seq_se_mins <- seq_se %>% mutate(col, time = (time * 60))
seq_se$time
seq_se_mins$time

or to combine the gene groups and gene names to one new variable in the rowData

seq_se_gene_comb <- seq_se %>% 
  mutate(row, group_and_name = paste(gene_group, gene_name, sep = "_"))

A non-dplyr function that will change the metadata is drop_metadata: This function will drop all rowData and colData variables that have only 1 value. Typically used after subsetting:

seq_se_dropped <- seq_se %>% 
  filter(col, time == 4) %>% 
  drop_metadata()
print_options(seq_se_dropped) #note the time variable from colData is dropped as all values were 4