View source: R/f_dataSquashing.R
squashData | R Documentation |
squashData
squashes data by binning expected counts, E, for a
given actual count, N, using bin means as the expected counts for
the reduced data set. The squashed points are weighted by bin size. Data
can be squashed to reduce computational burden (see DuMouchel et al.,
2001) when estimating the hyperparameters.
squashData(
data,
count = 1,
bin_size = 50,
keep_pts = 100,
min_bin = 50,
min_pts = 500
)
data |
A data frame (typically from |
count |
A non-negative scalar whole number for the count size, N, used for binning |
bin_size |
A scalar whole number (>= 2) |
keep_pts |
A nonnegative scalar whole number for number of points with the largest expected counts to leave unsquashed. Used to help prevent “oversquashing”. |
min_bin |
A positive scalar whole number for the minimum number of bins needed. Used to help prevent “oversquashing”. |
min_pts |
A positive scalar whole number for the minimum number of original (unsquashed) points needed for squashing. Used to help prevent “oversquashing”. |
Can be used iteratively (count = 1, then 2, etc.).
The N column in data
will be coerced using
as.integer
, and E will be coerced using
as.numeric
. Missing data are not allowed.
Since the distribution of expected counts, E, tends to be
skewed to the right, the largest Es are not squashed by default.
This behavior can be changed by setting the keep_pts
argument to
zero (0); however, this is not recommended. Squashing the largest Es
could result in a large loss of information, so it is recommended to use a
value of 100 or more for keep_pts
.
Values for keep_pts
, min_bin
, and min_pts
should typically be at least as large as the default values.
A data frame with column names N, E, and weight containing the reduced data set.
DuMouchel W, Pregibon D (2001). "Empirical Bayes Screening for Multi-item Associations." In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '01, pp. 67-76. ACM, New York, NY, USA. ISBN 1-58113-391-X.
processRaw
for data preparation and
autoSquash
for automatically squashing an entire data set in
one function call
set.seed(483726)
dat <- data.frame(
var1 = letters[1:26], var2 = LETTERS[1:26],
N = c(rep(0, 11), rep(1, 10), rep(2, 4), rep(3, 1)),
E = round(abs(c(rnorm(11, 0), rnorm(10, 1), rnorm(4, 2), rnorm(1, 3))), 3),
stringsAsFactors = FALSE
)
(zeroes <- squashData(dat, count = 0, bin_size = 3, keep_pts = 1,
min_bin = 2, min_pts = 2))
(ones <- squashData(zeroes, bin_size = 2, keep_pts = 1,
min_bin = 2, min_pts = 2))
(twos <- squashData(ones, count = 2, bin_size = 2, keep_pts = 1,
min_bin = 2, min_pts = 2))
squashData(zeroes, bin_size = 2, keep_pts = 0,
min_bin = 2, min_pts = 2)
squashData(zeroes, bin_size = 2, keep_pts = 1,
min_bin = 2, min_pts = 2)
squashData(zeroes, bin_size = 2, keep_pts = 2,
min_bin = 2, min_pts = 2)
squashData(zeroes, bin_size = 2, keep_pts = 3,
min_bin = 2, min_pts = 2)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.