sim_nb: Simulate data from a NB distribution

View source: R/sim_nb.r

sim_nbR Documentation

Simulate data from a NB distribution

Description

Simulate data from two independent negative binomial (NB) distributions. For paired data, see sim_bnb().

Usage

sim_nb(
  n1,
  n2 = n1,
  mean1,
  mean2,
  ratio,
  dispersion1,
  dispersion2 = dispersion1,
  nsims = 1L,
  return_type = "list",
  max_zeros = 0.99,
  ncores = 1L
)

Arguments

n1

(integer: ⁠[2, Inf)⁠)
The sample size(s) of group 1.

n2

(integer: n1; ⁠[2, Inf)⁠)
The sample size(s) of group 2.

mean1

(numeric: ⁠(0, Inf)⁠)
The mean(s) of group 1 (\mu_1).

mean2, ratio

(numeric: ⁠(0, Inf)⁠)
Only specify one of these arguments.

  • mean2: The mean(s) of group 2 (\mu_2).

  • ratio: The ratio(s) of means for group 2 with respect to group 1 \left( r = \frac{\mu_2}{\mu_1} \right).

mean2 = ratio * mean1

dispersion1

(numeric: ⁠(0, Inf)⁠)
The dispersion parameter(s) of group 1 (\theta_1). See 'Details' and 'Examples'.

dispersion2

(numeric: dispersion1; ⁠(0, Inf)⁠)
The dispersion parameter(s) of group 2 (\theta_2). See 'Details' and 'Examples'.

nsims

(Scalar integer: 1L; ⁠[1,Inf)⁠)
The expected number of simulated data sets. If nsims > 1, the data is returned in a list-column of a depower simulation data frame. nsims may be reduced depending on max_zeros.

return_type

(string: "list"; c("list", "data.frame"))
The data structure of the simulated data. If "list" (default), a list object is returned. If "data.frame" a data frame in tall format is returned. The list object provides computational efficiency and the data frame object is convenient for formulas. See 'Value'.

max_zeros

(Scalar numeric: 0.99; ⁠[0, 1]⁠)
The maximum proportion of zeros each group in a simulated dataset is allowed to have. If the proportion of zeros is greater than this value, the corresponding data is excluded from the set of simulations. This is most likely to occur when the sample size is small and the dispersion parameter is small.

ncores

(Scalar integer: 1L; ⁠[1,Inf)⁠)
The number of cores (number of worker processes) to use. Do not set greater than the value returned by parallel::detectCores(). May be helpful when the number of parameter combinations is large and nsims is large.

Details

The negative binomial distribution has many parameterizations. In the regression modeling context, it is common to specify the distribution in terms of its mean and dispersion. We use the following probability mass function:

\begin{aligned} P(X = x) &= \dbinom{x + \theta - 1}{x} \left( \frac{\theta}{\theta + \mu} \right)^{\theta} \left( \frac{\mu}{\mu + \theta} \right)^x \\ &= \frac{\Gamma(x + \theta)}{x! \Gamma(\theta)} \left( \frac{\theta}{\theta + \mu} \right)^{\theta} \left( \frac{\mu}{\mu + \theta} \right)^{x} \\ &= \frac{\Gamma(x + \theta)}{(\theta + \mu)^{\theta + x}} \frac{\theta^{\theta}}{\Gamma(\theta)} \frac{\mu^{x}}{x!} \end{aligned}

where x \in \mathbb{N}^{\geq 0}, \theta \in \mathbb{R}^{> 0} is the dispersion parameter, and \mu \in \mathbb{R}^{> 0} is the mean. This is analogous to the typical formulation where X is counting x failures given \theta successes and p = \frac{\theta}{\theta + \mu} is the probability of success on each trial. It follows that E(X) = \mu and Var(X) = \mu + \frac{\mu^2}{\theta}. The \theta parameter describes the 'dispersion' among observations. Smaller values of \theta lead to overdispersion and larger values of \theta decrease the overdispersion, eventually converging to the Poisson distribution.

Described above is the 'indirect quadratic parameterization' of the negative binomial distribution, which is commonly found in the R ecosystem. However, it is somewhat counterintuitive because the smaller \theta gets, the larger the overdispersion. The 'direct quadratic parameterization' of the negative binomial distribution may be found in some R packages and other languages such as SAS and Stata. The direct parameterization is defined by substituting \alpha = \frac{1}{\theta} (\alpha > 0) which results in Var(X) = \mu + \alpha\mu^2. In this case, the larger \alpha gets the larger the overdispersion, and the Poisson distribution is a special case of the negative binomial distribution where \alpha = 0.

A general class of negative binomial models may be defined with mean \mu and variance \mu + \alpha\mu^{p}. The 'linear parameterization' is then found by setting p=1, resulting in Var(X) = \mu + \alpha\mu. It's common to label the linear parameterization as 'NB1' and the direct quadratic parameterization as 'NB2'.

See 'Details' in sim_bnb() for additional information on the gamma-Poisson mixture formulation of the negative binomial distribution.

Value

If nsims = 1 and the number of unique parameter combinations is one, the following objects are returned:

  • If return_type = "list", a list:

    Slot Name Description
    1 Simulated counts from group 1.
    2 Simulated counts from group 2.
  • If return_type = "data.frame", a data frame:

    Column Name Description
    1 item Subject/item indicator.
    2 condition Group/condition indicator.
    3 value Simulated counts.

If nsims > 1 or the number of unique parameter combinations is greater than one, each object described above is returned in a list-column named data in a depower simulation data frame:

Column Name Description
1 n1 Sample size of group 1.
2 n2 Sample size of group 2.
3 mean1 Mean for group 1.
4 mean2 Mean for group 2.
5 ratio Ratio of means (group 2 / group 1).
6 dispersion1 Dispersion parameter for group 1.
7 dispersion2 Dispersion parameter for group 2.
8 nsims Number of valid simulation iterations.
9 distribution Distribution sampled from.
10 data List-column of simulated data.

References

\insertRef

yu_2017depower

\insertRef

rettiganti_2012depower

\insertRef

aban_2009depower

\insertRef

hilbe_2011depower

\insertRef

hilbe_2014depower

\insertRef

cameron_2013depower

See Also

sim_bnb(), stats::rnbinom()

Examples

#----------------------------------------------------------------------------
# sim_nb() examples
#----------------------------------------------------------------------------
library(depower)

# Independent two-sample NB data returned in a data frame
sim_nb(
  n1 = 10,
  mean1 = 5,
  ratio = 1.6,
  dispersion1 = 0.5,
  dispersion2 = 0.5,
  nsims = 1,
  return_type = "data.frame"
)

# Independent two-sample NB data returned in a list
sim_nb(
  n1 = 10,
  mean1 = 5,
  ratio = 1.6,
  dispersion1 = 0.5,
  dispersion2 = 0.5,
  nsims = 1,
  return_type = "list"
)

# Two simulations of independent two-sample data
# returned as a list of data frames
sim_nb(
  n1 = 10,
  mean1 = 5,
  ratio = 1.6,
  dispersion1 = 0.5,
  dispersion2 = 0.5,
  nsims = 2,
  return_type = "data.frame"
)

# Two simulations of independent two-sample data
# returned as a list of lists
sim_nb(
  n1 = 10,
  mean1 = 5,
  ratio = 1.6,
  dispersion1 = 0.5,
  dispersion2 = 0.5,
  nsims = 2,
  return_type = "list"
)

#----------------------------------------------------------------------------
# Visualization of the NB distribution as dispersion varies between groups.
#----------------------------------------------------------------------------
disp <- expand.grid(c(1, 10, 100), c(1, 10, 100))
set.seed(1234)
data <- mapply(
  FUN = function(disp1, disp2) {
    d <- sim_nb(
      n1 = 1000,
      mean1 = 10,
      ratio = 1.5,
      dispersion1 = disp1,
      dispersion2 = disp2,
      nsims = 1,
      return_type = "data.frame"
    )
    cbind(dispersion1 = disp1, dispersion2 = disp2, d)
  },
  disp1 = disp[[1]],
  disp2 = disp[[2]],
  SIMPLIFY = FALSE
)

data <- do.call(what = "rbind", args = data)

ggplot2::ggplot(
  data = data,
  mapping = ggplot2::aes(x = value, fill = condition)
) +
  ggplot2::facet_grid(
    rows = ggplot2::vars(.data$dispersion2),
    cols = ggplot2::vars(.data$dispersion1),
    labeller = ggplot2::labeller(
      .rows = ggplot2::label_both,
      .cols = ggplot2::label_both
    )
  ) +
  ggplot2::geom_density(alpha = 0.3) +
  ggplot2::coord_cartesian(xlim = c(0, 50)) +
  ggplot2::labs(
    x = "Value",
    y = "Density",
    fill = "Condition",
    caption = "Mean1=10, Mean2=15, ratio=1.5"
  )


depower documentation built on April 3, 2025, 9:23 p.m.