knitr::opts_chunk$set(echo = TRUE) library(ggplot2) library(reshape) library(rbtt) library(microbenchmark) library(parallel) options(digits = 4)
rbtt is an alternative bootstrap-based $t$-test aiming to reduce type-I error for non-negative, zero-inflated data
Tu & Zhou (1999) showed that comparing the means of populations whose data-generating distributions are non-negative with excess zero observations is a problem of great importance in the analysis of medical cost data. In the same study, Tu & Zhou discuss that it can be difficult to control type-I error rates of general-purpose statistical tests for comparing the means of these particular data sets. This package allows users to perform a modified bootstrap-based t-test that aims to better control type-I error rates in these situations.
Let's say we have some non-negative data with clumping at zero:
x <- rbinom(50, 1, 0.5) * rlnorm(50, 0, 1) y <- rbinom(150, 1, 0.3) * rlnorm(150, 2, 1)
names(x) <- rep("x", length(x)) names(y) <- rep("y", length(y)) data <- c(x, y) ggdf <- data.frame("value" = data, "variable" = names(data)) ggplot(ggdf, aes(x = value, fill = variable)) + geom_density(alpha = 0.5, position = "identity") + xlim(0, 30) + ggtitle("Densities of x and y (truncated plot at value = 30)")
Then we may compute rbtt-based $t$-tests to compare the means:
# Use ‘method = 1’ for a two-sample, two-sided rbtt under the equal variance assumption, rbtt(x, y, n.boot=999, method = 1) # Use ’method = 2' for a two-sample, one-sided rbtt without the equal variance assumption rbtt(x, y, n.boot=999, method = 2)
Alternatively, you can specify method = "both"
to perform both methods simultaneously (this is also done by default).
# Compare speed when using single-core versus multiple-core rbtt on 99999 bootstrap resamples system.time(rbtt(x, y, n.boot = 99999, method = 1, n.cores = 1)) system.time(rbtt(x, y, n.boot = 99999, method = 1, n.cores = 3))
First, we perform some simulations.
n.sim <- 999 t.test.results <- numeric(n.sim) rbtt.results <- numeric(n.sim) pval.table.list <- mclapply(1:n.sim, function(i) { # True means are equal x <- rbinom(50, 1, 0.5) * rlnorm(50, 1.15, 1) y <- rbinom(150, 1, 0.5) * rlnorm(150, 1.15, 1) t.test.result <- t.test(x, y)$p.value rbtt.result <- rbtt(x, y, n.boot = 999, method = 1)$p.value return(c(t.test.result, rbtt.result)) }, mc.cores = 4) pval.table <- do.call(rbind, pval.table.list)
Now, let's evaluate the type-I error of these simulations using a significance level of 0.05.
# t.test type-I error with significance level of 0.05: sum(pval.table[,1] < 0.05) / n.sim # rbtt type-I error with significance level of 0.05: sum(pval.table[,2] < 0.05) / n.sim
More accurate p-values and type-I error estimates can be obtained by increasing n.boot
and n.sim
, respectively
Ian Waudby-Smith (University of Waterloo)
Dr. Pengfei Li (University of Waterloo)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.