In robertamezquita/distillr: What the Package Does (One Line, Title Case)

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

The distillr package compares epigenomic regions using a non-parametric test statistic.

Creating Simulated Data

Here we create some simulated data that spans many regions across three groups of samples with differing numbers of replicates within them.

While the data can be most simply across separate tables - the counts in a matrix (mat), sample (column) groupings in a data frame (cdf), and a feature (row) groupings/coordinates in another object such as a GRanges object (rr) - these three can be combined into a singular representation via SingleCellExperiment. This eases manipulation of the data by keeping all relationships in-tact, no matter any downstream subsetting operations that may take place.

Note however that the distillr methods take both matrices as well as SingleCellExperiment objects.

library(distillr) # devtools::load_all()
library(SingleCellExperiment)
library(tidyverse)

## Simulate a dataset ==========================================================
## create different numbers per group in this case
## todo: vary nbin per region

## data sim parameters
n_groups <- 3
nbin_per_region <- rep(100, n_groups)
nregion <- rep(50, n_groups)
n_per_group <- c(3, 4, 5)
lambda_per_group <- c(10, 2, 5)
labels_per_group <- c('A', 'B', 'C')

## Simulator
.sim_data <- function(nbin_per_region, nregion, nsample_per_group, lambda,
                      group_id, row_id = 'bin') {
    mat <- matrix(rpois(nbin_per_region * nregion * nsample_per_group, lambda),
                  nrow = nbin_per_region * nregion,
                  ncol = nsample_per_group)
    colnames(mat) <- paste0(group_id, '_', 1:nsample_per_group)
    rownames(mat) <- paste0(row_id, '_', 1:(nbin_per_region * nregion))
    return(mat)
}

## Create assay matrix
mat_l <- purrr::pmap(list(nbin_per_region, nregion, n_per_group,
                          lambda_per_group, labels_per_group),
                     .sim_data)
mat <- do.call(cbind, mat_l)

## coldata
cdf <- DataFrame(sample = colnames(mat),
                 group = rep(labels_per_group, times = n_per_group))


## rowranges
rr <- makeGRangesFromDataFrame(data.frame(
    seqnames = 'chr1',
    start = seq(1, by = 25, length.out = nbin_per_region[1] * nregion[1]),
    end = seq(25, by = 25, length.out = nbin_per_region[1] * nregion[1]),
    region_id = paste0('region_', rep(1:nregion[1], each = nbin_per_region[1]))
), keep.extra.columns = TRUE)

## rowdata - specify rowRanges groups (region IDs)
rd <- DataFrame(
    region_id = paste0('region_', rep(1:nregion[1], each = nbin_per_region[1]))
)

## Create SCE
sce <- SingleCellExperiment(
    assays = list(counts = mat),
    colData = cdf,
    rowRanges = rr
)

Testing `distillr`

To test out the main package functionality in comparing regions, we set up the following test run comparing groups 'A' and 'B'.

First we set the key parameters to the method - the band and quantile, and the groups to compare.

## Parameters
band <- 5
quantile = 0.9
compare <- c('A', 'B')

Then we can test out the method with just a simple matrix input and character vectors describing the column and row groupings via the groups and regions input.

## Create inputs for regular ANY method ----------------------------------------
x <- counts(sce) # or x = sce
groups <- colData(sce)$group
regions <- rowData(sce)$region_id

out <- distill(x,
               compare, groups,
               regions,
               band, quantile)

Finally, we can test the method against the SingleCellExperiment input, specifying the groupings that are already defined within the objects rowData and colData components.

## Create input for SCE method -------------------------------------------------
group_by <- 'group'
region_by <- 'region_id'
assay.type <- 'counts'

## Running and rerunning with SCE, saving results into metadata
out_sce <- distill(sce,
                   compare = compare, group_by = group_by,
                   region_by = region_by,
                   band = 10, quantile = 0.8,
                   assay = assay.type)

Testing with Previous Data and Method `ChIPtest`

To compare with the previously published method, below is the verification of the similarity between the two methods.

library(ChIPtest)
data(data1)
data(data4)

## Convert data to standard "ANY" form
.convert_data <- function(d, s = 'Sample', r_prefix = 'bin') {
    dr <- c()
    for (i in 1:nrow(d)) {
        dr <- c(dr, as.vector(d[i, ]))
    }
    df <- data.frame(dr)
    colnames(df) <- s
    rownames(df) <- paste0(r_prefix, '_', 1:nrow(df))
    return(df)
}

d1 <- .convert_data(data1, 's1')
d4 <- .convert_data(data4, 's4')
dm <- cbind(d1, d4) 
rd <- data.frame(region_id = paste0('region_', rep(1:nrow(data1), each = ncol(data1))))
cd <- data.frame(sample = c('s1', 's4'), group = c('g1', 'g2'))


stat <- distill(dm, c('g1', 'g2'), cd$group, rd$region_id, band = 5, quantile = 0.8)
stat_ct <- TS_kernel(data4 - data1, 5, 0.8) # save as above except for sign