poisthin: Apply Poisson thinning to a matrix of count data.
In seqgendiff: RNA-Seq Generation/Modification for Simulation

poisthin

R Documentation

Apply Poisson thinning to a matrix of count data.

Description

This is now defunct. Please try out select_counts and thin_2group.

Usage

poisthin(
  mat,
  nsamp = nrow(mat),
  ngene = ncol(mat),
  gselect = c("max", "random", "rand_max", "custom", "mean_max"),
  gvec = NULL,
  skip_gene = 0L,
  signal_fun = stats::rnorm,
  signal_params = list(mean = 0, sd = 1),
  prop_null = 1,
  alpha = 0,
  group_assign = c("frac", "random", "cor"),
  group_prop = 0.5,
  corvec = NULL
)

Arguments

`mat`	A matrix of count data. The rows index the individuals and the columns index the genes.
`nsamp`	The number of samples to select from `mat`.
`ngene`	The number of genes to select from `mat`.
`gselect`	How should we select the subset of genes? Should we choose the `ngene` most median expressed genes (`"max"`), a random sample of the genes (`"random"`), a random sample of the most expressed genes (`"rand_max"`), a user-provided list (`"custom"`), or by maximum mean expression level (`"mean_max"`)? If `"custom"`, then `gvec` should be specified. Expression levels of a gene are measured by median expression across individuals with ties broken by mean expression.
`gvec`	A logical of length `ncol(mat)`. A `TRUE` in position `i` indicates inclusion into the smaller dataset. Hence, `sum(gvec)` should equal `ngene`.
`skip_gene`	The number of maximally expressed genes to skip. Not used if `gselect = "custom"`.
`signal_fun`	A function that returns the signal. This should take as input `n` for the number of samples to return and then return only a vector of samples.
`signal_params`	A list of additional arguments to pass to `signal_fun`.
`prop_null`	The proportion of genes that are null.
`alpha`	If `b` is an effect and `s` is an empirical standard deviation, then we model `b/s^\alpha` as being exchangeable.
`group_assign`	How should we assign groups? Exactly specifying the proportion of individuals in each group (`"frac"`), with a Bernoulli distribution (`"random"`), or correlated with latent factors (`"cor"`)? If `group_assign = "cor"`, then you have to specify `corvec`. If `group_assign = "frac"` or `group_assign = "random"`, then the proportion of samples in each group is specified with the `group_prop` argument.
`group_prop`	The proportion of individuals that are in group 1. This proportion is deterministic if `group_assign = "frac"`, and is the expected proportion if `group_assign = "random"`. This argument is not used if `group_assign = "cor"`.
`corvec`	A vector of correlations. `corvec[i]` is the correlation of the latent group assignment vector with the ith latent confounder. Only used if `group_assign = "cor"`. This vector is constrained so that `crossprod(corvec) < 1`. The number of latent factors is taken to be the length of corvec. Note that the correlations of the latent factors with the observed group-assignment vector (instead of the latent group-assignment vector) will be `corvec * sqrt(2 / pi)`.

Details

Given a matrix of RNA-seq counts, this function will randomly select two groups of samples and add signal to a known proportion of the genes. This signal is the log (base 2) effect size of the group indicator in a linear model. The user may specify the distribution of the effects.

The Poisson thinning approach first randomly assigns samples to be in one of two groups. Then, given this assignment, will Binomially sample counts with a sample size of the gene expression counts and a probability that is a function of the effect size. For details, see Gerard and Stephens (2021).

Value

A list with the following elements:

Y: A matrix of altered counts with nsamp rows and ngene columns.
X: A design matrix. The first column contains a vector ones (for an intercept term) and the second column contains an indicator for group membership.
beta: The approximately true effect sizes of log(Y) ~ X\beta.
corassign: The output from the call to corassign. Only returned if group_assign = "cor".

Author(s)

David Gerard

References

Gerard, D., and Stephens, M. (2021). "Unifying and Generalizing Methods for Removing Unwanted Variation Based on Negative Controls." Statistica Sinica, 31(3), 1145-1166 \Sexpr[results=rd]{tools:::Rd_expr_doi("10.5705/ss.202018.0345")}.

Examples

## Simulate data from given matrix of counts
## In practice, you would obtain Y from a real dataset, not simulate it.
set.seed(1)
nsamp <- 10
ngene <- 1000
Y <- matrix(stats::rpois(nsamp * ngene, lambda = 50), nrow = ngene)


## Apply thinning
poisout <- poisthin(mat           = t(Y),
                    nsamp         = 9,
                    ngene         = 999,
                    signal_fun    = stats::rnorm,
                    signal_params = list(mean = 0, sd = 1),
                    prop_null     = 0.9)

## Dimension of count matrix is smaller.
dim(poisout$Y)

## Can verify signal was added by estimating it with lm().
betahat <- coef(lm(log2(poisout$Y + 1) ~ poisout$X[, 2]))[2, ]
plot(poisout$beta, betahat, xlab = "Coefficients", ylab = "Estimates")
abline(0, 1, col = 2, lty = 2)

seqgendiff documentation built on June 22, 2024, 7 p.m.