shrinkDataFrame: Shrink data.frame by row groups
In jmw86069/jamses: Jam SummarizedExperiment Stats

shrinkDataFrame

R Documentation

Shrink data.frame by row groups

Description

Shrink data.frame by row groups

Usage

shrinkDataFrame(
  x,
  groupBy,
  na.rm = TRUE,
  string_func = function(x) jamba::cPasteSU(x, na.rm = TRUE),
  num_func = function(x) {
     mean(x, na.rm = TRUE)
 },
  add_string_cols = NULL,
  num_to_string_func = as.character,
  keep_na_groups = TRUE,
  include_num_reps = FALSE,
  collapse_method = 2,
  verbose = FALSE,
  ...
)

Arguments

`x`	`data.frame` (or equivalent)
`groupBy`	`character` vector with one of the following: one or more columns in `colnames(x)`. The values in these columns will define the row groups used. `character` or `factor` with length equal to `nrow(x)`. These values will define the row groups used.
`string_func`	`function`, default uses `jamba::cPasteSU()`, used for `character` or `factor` columns. Note that string columns are handled differently than `numeric` columns by applying vectorized operations across the complete set of rows in one step, rather than calling `data.table` on each subgroup.
`num_func`	`function`, default `function(x)mean(x, na.rm=TRUE)`, used for `numeric` columns. Note that this function is applied to each row group by `data.table`, and is typically very efficient for `numeric` values.
`add_string_cols`	`character` with optional `numeric` columns that should be handled as if they were `character` columns. Default `NULL`.
`num_to_string_func`	`function` used for `add_string_cols` when converting `numeric` columns to `character`. Default `as.character()` retains the full `numeric` value, however it may be useful to use something like `function(x)signif(x, digits=3)` to limit the output to only three significant digits, or `function(x)format(x, digits=3)`.
`keep_na_groups`	`logical`, default TRUE, whether to convert `NA` values in row groups to `""` so they are retained in the output. You may want to use `keep_na_groups=FALSE` when there are a large number of un-annotated rows that should not be aggregated together. This situation may occur if converting a probe to a gene symbol, where a subset of probes cannot be converted to a gene symbol and instead receive `NA`.
`include_num_reps`	`logical` indicating whether to add a column `"num_reps"` to the output, with the `integer` number of rows in each row group.
`collapse_method`	`integer` default 2, indicating the internal collapse method used. Experimental. `1` collapses each `numeric` column independently. `2` collapses each set of `numeric` columns that use the same numeric shrink function. When all `numeric` columns use the same shrink function, they are all calculated in a single step, which is typically much faster.
`verbose`	`logical` indicating whether to print verbose output.
`...`	additional arguments are ignored.

Details

Purpose is to shrink a data.frame to have one row per row grouping. The row grouping can use a single column of identifiers, or multiple columns. The challenge is to apply a relevant function to each column, expecting there will be columns with numeric, character, or factor types.

The default behavior:

numeric columns are summarized with mean(x, na.rm=TRUE), so that NA values are ignored when there are non-NA values present.
character columns are combined using unique, sorted character strings.
- This step uses jamba::cPasteSU() where the S activates sorting using jamba::mixedSort(), and U calls unique().
- To retain all values, remove the U and call jamba::cPasteS()
- To skip the sort, remove the S and call jamba::cPasteU()
- To keep all values, and skip sorting, call jamba::cPaste()

Examples

testdf <- data.frame(check.names=FALSE,
   SYMBOL=rep(c("ACTB", "GAPDH", "PPIA"), c(2, 3, 1)),
   `logFC B-A`=c(1.4, 1.4, 2.3, NA, 2.5, 5.1),
   probe=paste0("probe", 1:6))
shrink_df(testdf, by="SYMBOL")

shrink_df(testdf, by="SYMBOL", num_func=mean)

shrink_df(testdf, by="SYMBOL", add_string_cols="logFC B-A")

testdftall <- do.call(rbind, lapply(1:10000, function(i){
   idf <- testdf;
   idf$SYMBOL <- paste0(idf$SYMBOL, "_", i);
   idf;
}))
shrunk_tall <- shrink_df(testdftall,
   by="SYMBOL")
head(shrunk_tall, 6)

shrunk_tall2 <- jamses::shrinkDataFrame(testdftall,
   groupBy="SYMBOL")
head(shrunk_tall2, 6)

jmw86069/jamses documentation built on Nov. 4, 2024, 9:25 p.m.