binNDimensions: Generating and Aggregating Data Within N-dimensional Bins
In BRGenomics: Tools for the Efficient Analysis of High-Resolution Genomics Data

Description Usage Arguments Details Value Author(s) Examples

Divide data along different dimensions into equally spaced bins, and summarize the datapoints that fall into any of these n-dimensional bins.

binNdimensions(
  dims.df,
  nbins = 10,
  use_bin_numbers = TRUE,
  ncores = getOption("mc.cores", 2L)
)

aggregateByNdimBins(
  x,
  dims.df,
  nbins = 10,
  FUN = mean,
  ...,
  ignore.na = TRUE,
  drop = FALSE,
  empty = NA,
  use_bin_numbers = TRUE,
  ncores = getOption("mc.cores", 2L)
)

densityInNdimBins(
  dims.df,
  nbins = 10,
  use_bin_numbers = TRUE,
  ncores = getOption("mc.cores", 2L)
)

`dims.df`	A dataframe containing one or more columns of numerical data for which bins will be generated.
`nbins`	Either a number giving the number of bins to use for all dimensions (default = 10), or a vector containing the number of bins to use for each dimension of input data given.
`use_bin_numbers`	A logical indicating if ordinal bin numbers should be returned (`TRUE`), or if in place of the bin number, the center value of that bin should be returned. For instance, if the first bin encompasses data from 1 to 3, with `use_bin_numbers = TRUE`, a 1 is returned, but when `FALSE`, 2 is returned.
`ncores`	Number of cores to use for computations.
`x`	The name of the dimension in `dims.df` to aggregate, or a separate numerical vector or dataframe of data to be aggregated. If `x` is a numerical vector, each value in `x` corresponds to a row of `dims.df`, and so `length(x)` must be equal to `nrow(dims.df)`. Likewise, if `x` is a dataframe, `nrow(x)` must equal `nrow(dims.df)`. Supplying a dataframe for `x` has the advantage of simultaneously aggregating different sets of data, and returning a single dataframe.
`FUN`	A function to use for aggregating data within each bin.
`...`	Additional arguments passed to `FUN`.
`ignore.na`	Logical indicating if `NA` values of `x` should be ignored. Default is `TRUE`.
`drop`	A logical indicating if empty bin combinations should be removed from the output. By default (`FALSE`), all possible combinations of bins are returned, and empty bins contain a value given by `empty`.
`empty`	When `drop = FALSE`, the value returned for empty bins. By default, empty bins return `NA`. However, in many circumstances (e.g. if `FUN = sum`), the empty value should be `0`.

These functions take in data along 1 or more dimensions, and for each dimension the data is divided into evenly-sized bins from the minimum value to the maximum value. For instance, if each row of dims.df were a gene, the columns (the different dimensions) would be various quantitative measures of that gene, e.g. expression level, number of exons, length, etc. If plotted in cartesian coordinates, each gene would be a single datapoint, and each measurement would be a separate dimension.

binNdimensions returns the bin numbers themselves. The output dataframe has the same dimensions as the input dims.df, but each input data has been replaced by its bin number (an integer). If codeuse_bin_numbers = FALSE, the center points of the bins are returned instead of the bin numbers.

aggregateByNdimBins summarizes some input data x in each combination of bins, i.e. in each n-dimensional bin. Each row of the output dataframe is a unique combination of the input bins (i.e. each n-dimensional bin), and the output columns are identical to those in dims.df, with the addition of one or more columns containing the aggregated data in each n-dimensional bin. If the input x was a vector, the column is named "value"; if the input x was a dataframe, the column names from x are maintained.

densityInNdimBins returns a dataframe just like aggregateByNdimBins, except the "value" column contains the number of observations that fall into each n-dimensional bin.

A dataframe.

Mike DeBerardine

data("PROseq") # import included PROseq data
data("txs_dm6_chr4") # import included transcripts

#--------------------------------------------------#
# find counts in promoter, early genebody, and near CPS
#--------------------------------------------------#

pr <- promoters(txs_dm6_chr4, 0, 100)
early_gb <- genebodies(txs_dm6_chr4, 500, 1000, fix.end = "start")
cps <- genebodies(txs_dm6_chr4, -500, 500, fix.start = "end")

df <- data.frame(counts_pr = getCountsByRegions(PROseq, pr),
                 counts_gb = getCountsByRegions(PROseq, early_gb),
                 counts_cps = getCountsByRegions(PROseq, cps))

#--------------------------------------------------#
# divide genes into 20 bins for each measurement
#--------------------------------------------------#

bin3d <- binNdimensions(df, nbins = 20, ncores = 1)

length(txs_dm6_chr4)
nrow(bin3d)
bin3d[1:6, ]

#--------------------------------------------------#
# get number of genes in each bin
#--------------------------------------------------#

bin_counts <- densityInNdimBins(df, nbins = 20, ncores = 1)

bin_counts[1:6, ]

#--------------------------------------------------#
# get mean cps reads in bins of promoter and genebody reads
#--------------------------------------------------#

bin2d_cps <- aggregateByNdimBins("counts_cps", df, nbins = 20,
                                 ncores = 1)

bin2d_cps[1:6, ]

subset(bin2d_cps, is.finite(counts_cps))[1:6, ]

#--------------------------------------------------#
# get median cps reads for those bins
#--------------------------------------------------#

bin2d_cps_med <- aggregateByNdimBins("counts_cps", df, nbins = 20,
                                     FUN = median, ncores = 1)

bin2d_cps_med[1:6, ]

subset(bin2d_cps_med, is.finite(counts_cps))[1:6, ]