autoSquash: Automated data squashing

autoSquashR Documentation

Automated data squashing

Description

autoSquash squashes data by calling squashData once for each count (N), removing the need to repeatedly squash the same data set.

Usage

autoSquash(
  data,
  keep_pts = c(100, 75, 50, 25),
  cut_offs = c(500, 1000, 10000, 1e+05, 5e+05, 1e+06, 5e+06),
  num_super_pts = c(50, 75, 150, 500, 750, 1000, 2000, 5000)
)

Arguments

data

A data frame (typically from processRaw) containing columns named N, E, and (possibly) weight. Can contain additional columns, which will be ignored.

keep_pts

A vector of whole numbers for the number of points to leave unsquashed for each count (N). See the 'Details' section.

cut_offs

A vector of whole numbers for the cutoff values of unsquashed data used to determine how many "super points" to end up with after squashing each count (N). See the 'Details' section.

num_super_pts

A vector of whole numbers for the number of "super points" to end up with after squashing each count (N). Length must be 1 more than length of cut_offs. See the 'Details' section.

Details

See squashData for details on squashing a given count (N).

The elements in keep_pts determine how many points are left unsquashed for each count (N). The first element in keep_pts is used for the smallest N (usually 1). Each successive element is used for each successive N. Once the last element is reached, it is used for all other N.

For counts that are squashed, cut_offs and num_super_pts determine how the points are squashed. For instance, by default, if a given N contains less than 500 points to be squashed, then those points are squashed to 50 "super points".

Value

A data frame with column names N, E, and weight containing the reduced data set.

References

DuMouchel W, Pregibon D (2001). "Empirical Bayes Screening for Multi-item Associations." In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '01, pp. 67-76. ACM, New York, NY, USA. ISBN 1-58113-391-X.

See Also

processRaw for data preparation and squashData for squashing individual counts

Examples

data.table::setDTthreads(2)  #only needed for CRAN checks
data(caers)
proc <- processRaw(caers)
table(proc$N)

squash1 <- autoSquash(proc)
ftable(squash1[, c("N", "weight")])

## Not run: squash2 <- autoSquash(proc, keep_pts = c(50, 5))
## Not run: ftable(squash2[, c("N", "weight")])

## Not run: 
  squash3 <- autoSquash(proc, keep_pts = 100,
                        cut_offs = c(250, 500),
                        num_super_pts = c(20, 60, 125))

## End(Not run)
## Not run: ftable(squash3[, c("N", "weight")])


openEBGM documentation built on Sept. 15, 2023, 1:08 a.m.