Description Usage Arguments Details Value References See Also Examples

View source: R/f_dataSquashing.R

`squashData`

squashes data by binning expected counts, *E*, for a
given actual count, *N*, using bin means as the expected counts for
the reduced data set. The squashed points are weighted by bin size. Data
can be squashed to reduce computational burden (see DuMouchel et al.,
2001) when estimating the hyperparameters.

1 2 | ```
squashData(data, count = 1, bin_size = 50, keep_pts = 100,
min_bin = 50, min_pts = 500)
``` |

`data` |
A data frame (typically from |

`count` |
A non-negative scalar whole number for the count size, |

`bin_size` |
A scalar whole number (>= 2) |

`keep_pts` |
A nonnegative scalar whole number for number of points with the largest expected counts to leave unsquashed. Used to help prevent “oversquashing”. |

`min_bin` |
A positive scalar whole number for the minimum number of bins needed. Used to help prevent “oversquashing”. |

`min_pts` |
A positive scalar whole number for the minimum number of original (unsquashed) points needed for squashing. Used to help prevent “oversquashing”. |

Can be used iteratively (count = 1, then 2, etc.).

The *N* column in `data`

will be coerced using
`as.integer`

, and *E* will be coerced using
`as.numeric`

. Missing data are not allowed.

Since the distribution of expected counts, *E*, tends to be
skewed to the right, the largest *E*s are not squashed by default.
This behavior can be changed by setting the `keep_pts`

argument to
zero (0); however, this is not recommended. Squashing the largest *E*s
could result in a large loss of information, so it is recommended to use a
value of 100 or more for `keep_pts`

.

Values for `keep_pts`

, `min_bin`

, and `min_pts`

should typically be at least as large as the default values.

A data frame with column names *N*, *E*, and
*weight* containing the reduced data set.

DuMouchel W, Pregibon D (2001). "Empirical Bayes Screening for
Multi-item Associations." In *Proceedings of the Seventh ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining*, KDD '01,
pp. 67-76. ACM, New York, NY, USA. ISBN 1-58113-391-X.

`processRaw`

for data preparation and
`autoSquash`

for automatically squashing an entire data set in
one function call

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | ```
set.seed(483726)
dat <- data.frame(var1 = letters[1:26], var2 = LETTERS[1:26],
N = c(rep(0, 11), rep(1, 10), rep(2, 4), rep(3, 1)),
E = round(abs(c(rnorm(11, 0), rnorm(10, 1), rnorm(4, 2),
rnorm(1, 3))), 3)
)
(zeroes <- squashData(dat, count = 0, bin_size = 3, keep_pts = 1,
min_bin = 2, min_pts = 2))
(ones <- squashData(zeroes, bin_size = 2, keep_pts = 1,
min_bin = 2, min_pts = 2))
(twos <- squashData(ones, count = 2, bin_size = 2, keep_pts = 1,
min_bin = 2, min_pts = 2))
squashData(zeroes, bin_size = 2, keep_pts = 0,
min_bin = 2, min_pts = 2)
squashData(zeroes, bin_size = 2, keep_pts = 1,
min_bin = 2, min_pts = 2)
squashData(zeroes, bin_size = 2, keep_pts = 2,
min_bin = 2, min_pts = 2)
squashData(zeroes, bin_size = 2, keep_pts = 3,
min_bin = 2, min_pts = 2)
``` |

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.