error_hierarchicell: Compute Type 1 Error for Single Cell Expression Case-Control...

Description Usage Arguments Details Value Note Examples

View source: R/04_compute_error.R

Description

Computes type 1 error for single cell data that is cell-type specifc, hierarchical, and compositonal. This function computes type 1 error with the single-cell differential expression analysis tool 'MAST', using random effects to account for the correlation structure that exists among measures from cells within an individual. The type 1 error calculations will borrow information from the input data (or the package default data) to simulate data under a variety of pre-determined conditions. These conditions include foldchange, number of genes, number of samples (i.e., independent experimental units), and the mean number of cells per individual.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
error_hierarchicell(
  data_summaries,
  method = "MAST_RE",
  n_genes = 10000,
  n_per_group = 3,
  n_cases = n_per_group,
  n_controls = n_per_group,
  cells_per_case = 100,
  cells_per_control = 100,
  ncells_variation_type = "Poisson",
  pval = 0.05,
  foldchange = 1,
  decrease_dropout = 0,
  alter_dropout_cases = 0
)

Arguments

data_summaries

an R object that has been output by the package's compute_data_summaries function. No default

method

a name. The method for differential expression to be used for the computation of error. Possible methods include: MAST with random effects ("MAST_RE"), MAST ("MAST"), MAST with batch effect correction ("MAST_Combat"), GLM assuming a tweedie distribution ("GLM_tweedie"), GLMM assuming a tweedie distribution ("GLMM_tweedie"), generalized estimating equations ("GEE1"), ROTS ("ROTS"), or Monocle ("Monocle"). Defaults to "MAST_RE" which is the currently recommended analysis pipeline for single-cell data. See de_methods for more details on each of the methods.

n_genes

an integer. The number of genes you would like to simulate for your dataset. Too large of a number may cause memory failure and may slow the simulation down tremendously. We recommend an integer less than 40,000. Defaults to 10,000.

n_per_group

an integer. The number of independent samples per case/control group for simulation. Creates a balanced design, for unbalanced designs, specify n_cases and n_controls separately. If not specifying a foldchange, the number of cases and controls does not matter. Defaults to 3.

n_cases

an integer. The number of independent control samples for simulation. Defaults to n_per_group.

n_controls

an integer. The number of independent case samples for simulation. Defaults to n_per_group.

cells_per_case

an integer. The mean number of cells per case you would like to simulate. Too large of a number may cause memory failure and may slow the simulation down tremendously. We recommend an integer less than 300, but more is possible. We note that anything greater than 100, brings marginal improvements in type 1 error. Defaults to 100.

cells_per_control

an integer. The mean number of cells per control you would like to simulate. Too large of a number may cause memory failure and may slow the simulation down tremendously. We recommend an integer less than 300, but more is possible. We note that anything greater than 100, brings marginal improvements in type 1 error. Defaults to 100.

ncells_variation_type

either "Poisson", "NB", or "Fixed". Allows the number of cells per individual to be fixed at exactly the specified number of cells per individual, vary slightly with a poisson distribution with a lambda equal to the specified number of cells per individual, or a negative binomial with a mean equal to the specified number of cells and dispersion size equal to one.Defaults to "Poisson".

pval

a number. The significance threshold (alpha) to use for significance. Defaults to 0.05. Can also be a vector of pvalue - up to a length of 5.

foldchange

a number between 1 and 10. The amount of fold change to simulate a difference in expression between case and control groups. The foldchange changes genes in either direction, so a foldchange of 2 would cause the mean expression in cases to be either twice the amount or half the amount for any particular gene. Defaults to 1.

decrease_dropout

a numeric proportion between 0 and 1. The proportion by which you would like to simulate decreasing the amount of dropout in your data. For example, if you would like to simulate a decrease in the amount of dropout in your data by twenty percent, then 0.2 would be appropriate. This component of the simulation allows the user to adjust the proportion of dropout if they believe future experiments or runs will have improved calling rates (due to improved methods or improved cell viability) and thereby lower dropout rates. Defaults to 0.

alter_dropout_cases

a numeric proportion between 0 and 1. The proportion by which you would like to simulate decreasing the amount of dropout between case control groups. For example, if you would like to simulate a decrease in the amount of dropout in your cases by twenty percent, then 0.2 would be appropriate. This component of the simulation allows the user to adjust the proportion of dropout if they believe the stochastic expression of a gene will differ between cases and controls. For a two-part hurdle model, like MAST implements, this will increase your ability to detect differences. Defaults to 0.

Details

Prior to running the error_hierarchicell function, it is important to run the filter_counts function followed by the compute_data_summaries function to build an R object that is in the right format for the following simulation function to properly work.

Value

The estimated error under the specified conditions when using 'MAST' with random effects to account for the correlation structure that exists among measures from cells within an individual.

Note

Data should be only for cells of the specific cell-type you are interested in simulating or computing type 1 error for. Data should also contain as many unique sample identifiers as possible. If you are inputing data that has less than 5 unique values for sample identifier (i.e., independent experimental units), then the empirical estimation of the inter-individual heterogeneity is going to be very unstable. Finding such a dataset will be difficult at this time, but, over time (as experiments grow in sample size and the numbers of publically available single-cell RNAseq datasets increase), this should improve dramatically.

Examples

1
2
3
4
5
6
7
clean_expr_data <- filter_counts()
data_summaries <- compute_data_summaries(clean_expr_data)
error_hierarchicell(data_summaries,
                   n_genes = 100,
                   n_per_group = 4,
                   cells_per_case = 50,
                   cells_per_control = 50)

kdzimm/hierarchicell documentation built on Dec. 21, 2021, 5:23 a.m.