sccomp_glm: DEPRECATED - sccomp_glm main

View source: R/methods_OLD_framework.R

sccomp_glmR Documentation

DEPRECATED - sccomp_glm main

Description

The function for linear modelling takes as input a table of cell counts with three columns containing a cell-group identifier, sample identifier, integer count and the factors (continuous or discrete). The user can define a linear model with an input R formula, where the first factor is the factor of interest. Alternatively, sccomp accepts single-cell data containers (Seurat, SingleCellExperiment44, cell metadata or group-size). In this case, sccomp derives the count data from cell metadata.

Usage

sccomp_glm(
  .data,
  formula_composition = ~1,
  formula_variability = ~1,
  .sample,
  .cell_group,
  .count = NULL,
  contrasts = NULL,
  prior_mean_variable_association = list(intercept = c(5, 2), slope = c(0, 0.6),
    standard_deviation = c(20, 40)),
  check_outliers = TRUE,
  bimodal_mean_variability_association = FALSE,
  enable_loo = FALSE,
  cores = detectCores(),
  percent_false_positive = 5,
  approximate_posterior_inference = "none",
  test_composition_above_logit_fold_change = 0.1,
  .sample_cell_group_pairs_to_exclude = NULL,
  verbose = FALSE,
  noise_model = "multi_beta_binomial",
  exclude_priors = FALSE,
  use_data = TRUE,
  mcmc_seed = sample(1e+05, 1),
  max_sampling_iterations = 20000,
  pass_fit = TRUE
)

Arguments

.data

A tibble including a cell_group name column | sample name column | read counts column (optional depending on the input class) | factor columns.

formula_composition

A formula. The formula describing the model for differential abundance, for example ~treatment.

formula_variability

A formula. The formula describing the model for differential variability, for example ~treatment. In most cases, if differentially variability is of interest, the formula should only include the factor of interest as a large anount of data is needed to define variability depending to each factors.

.sample

A column name as symbol. The sample identifier

.cell_group

A column name as symbol. The cell_group identifier

.count

A column name as symbol. The cell_group abundance (read count). Used only for data frame count output. The variable in this column should be of class integer.

contrasts

A vector of character strings. For example if your formula is ~ 0 + treatment and the factor treatment has values yes and no, your contrast could be constrasts = c("treatmentyes - treatmentno").

prior_mean_variable_association

A list of the form list(intercept = c(5, 2), slope = c(0, 0.6), standard_deviation = c(20, 40)). Where for intercept and slope parameters, we specify mean and standard deviation, while for standard deviation, we specify shape and rate. This is used to incorporate prior knowledge about the mean/variability association of cell-type proportions.

check_outliers

A boolean. Whether to check for outliers before the fit.

bimodal_mean_variability_association

A boolean. Whether to model the mean-variability as bimodal, as often needed in the case of single-cell RNA sequencing data, and not usually for CyTOF and microbiome data. The plot summary_plot()$credible_intervals_2D can be used to assess whether the bimodality should be modelled.

enable_loo

A boolean. Enable model comparison by the R package LOO. This is helpful when you want to compare the fit between two models, for example, analogously to ANOVA, between a one factor model versus a interceot-only model.

cores

An integer. How many cored to be used with parallel calculations.

percent_false_positive

A real between 0 and 100 non included. This used to identify outliers with a specific false positive rate.

approximate_posterior_inference

A boolean. Whether the inference of the joint posterior distribution should be approximated with variational Bayes. It confers execution time advantage.

test_composition_above_logit_fold_change

A positive integer. It is the effect threshold used for the hypothesis test. A value of 0.2 correspond to a change in cell proportion of 10% for a cell type with baseline proportion of 50%. That is, a cell type goes from 45% to 50%. When the baseline proportion is closer to 0 or 1 this effect thrshold has consistent value in the logit uncontrained scale.

.sample_cell_group_pairs_to_exclude

A column name that includes a boolean variable for the sample/cell-group pairs to be ignored in the fit. This argument is for pro-users.

verbose

A boolean. Prints progression.

noise_model

A character string. The two noise models available are multi_beta_binomial (default) and dirichlet_multinomial.

exclude_priors

A boolean. Whether to run a prior-free model, for benchmarking purposes.

use_data

A booelan. Whether to sun the model data free. This can be used for prior predictive check.

mcmc_seed

An integer. Used for Markov-chain Monte Carlo reproducibility. By default a random number is sampled from 1 to 999999. This itself can be controlled by set.seed()

max_sampling_iterations

An integer. This limit the maximum number of iterations in case a large dataset is used, for limiting the computation time.

pass_fit

A boolean. Whether to pass the Stan fit as attribute in the output. Because the Stan fit can be very large, setting this to FALSE can be used to lower the memory imprint to save the output.

Value

A nested tibble tbl, with the following columns

  • cell_group - column including the cell groups being tested

  • parameter - The parameter being estimated, from the design matrix dscribed with the input formula_composition and formula_variability

  • factor - The factor in the formula corresponding to the covariate, if exists (e.g. it does not exist in case og Intercept or contrasts, which usually are combination of parameters)

  • c_lower - lower (2.5%) quantile of the posterior distribution for a composition (c) parameter.

  • c_effect - mean of the posterior distribution for a composition (c) parameter.

  • c_upper - upper (97.5%) quantile of the posterior distribution fo a composition (c) parameter.

  • c_pH0 - Probability of the null hypothesis (no difference) for a composition (c). This is not a p-value.

  • c_FDR - False-discovery rate of the null hypothesis (no difference) for a composition (c).

  • c_n_eff - Effective sample size - the number of independent draws in the sample, the higher the better (mc-stan.org/docs/2_25/cmdstan-guide/stansummary.html).

  • c_R_k_hat - R statistic, a measure of chain equilibrium, should be within 0.05 of 1.0 (mc-stan.org/docs/2_25/cmdstan-guide/stansummary.html).

  • v_lower - Lower (2.5%) quantile of the posterior distribution for a variability (v) parameter

  • v_effect - Mean of the posterior distribution for a variability (v) parameter

  • v_upper - Upper (97.5%) quantile of the posterior distribution for a variability (v) parameter

  • v_pH0 - Probability of the null hypothesis (no difference) for a variability (v). This is not a p-value.

  • v_FDR - False-discovery rate of the null hypothesis (no difference), for a variability (v).

  • v_n_eff - Effective sample size for a variability (v) parameter - the number of independent draws in the sample, the higher the better (mc-stan.org/docs/2_25/cmdstan-guide/stansummary.html).

  • v_R_k_hat - R statistic for a variability (v) parameter, a measure of chain equilibrium, should be within 0.05 of 1.0 (mc-stan.org/docs/2_25/cmdstan-guide/stansummary.html).

  • count_data Nested input count data.

Examples


data("counts_obj")

estimate =
  sccomp_glm(
  counts_obj ,
   ~ type,
   ~1,
   sample,
   cell_group,
   count,
    check_outliers = FALSE,
    cores = 1
  )


stemangiola/sccomp documentation built on May 17, 2024, 6:24 a.m.