lr.twosigma: Convenient wrapper function for performing joint likelihood...

View source: R/lr.twosigma.R

lr.twosigmaR Documentation

Convenient wrapper function for performing joint likelihood ratio tests using the TWO-SIGMA model.

Description

Convenient wrapper function for performing joint likelihood ratio tests using the TWO-SIGMA model.

Usage

lr.twosigma(
  count_matrix,
  mean_covar,
  zi_covar,
  covar_to_test,
  mean_re = FALSE,
  zi_re = FALSE,
  id,
  return_full_fits = TRUE,
  adhoc = FALSE,
  adhoc_thresh = 0.1,
  silent = FALSE,
  disp_covar = NULL,
  weights = rep(1, ncol(count_matrix)),
  control = glmmTMBControl(),
  ncores = 1,
  cluster_type = "Fork",
  chunk_size = 10,
  lb = FALSE
)

Arguments

count_matrix

Matrix of non-negative integer read counts, with rows corresponding to genes and columns corresponding to cells. It is recommended to make the rownames the gene names for better output.

mean_covar

Covariates for the (conditional) mean model. Must be a matrix (without an intercept column) or a vector if a single covariate is being tested.

zi_covar

Covariates for the zero-inflation model. Must be a matrix (without an intercept column) or a vector if a single covariate is being tested.

covar_to_test

Either a string indicating the column name of the covariate to test or an integer referring to its column position in BOTH the mean_covar and zi_covar matrices (if the two matrices differ using a string name is preferred). Argument is ignored if mean_covar and zi_covar are both a single covariate (that covariate is assumed of interest).

mean_re

Should random intercepts be included in the (conditional) mean model?

zi_re

Should random intercepts be included in the zero-inflation model?

id

Vector of individual-level ID's. Used for random effect prediction and the adhoc method but required regardless.

return_full_fits

If TRUE, fit objects of class glmmTMB are returned. If FALSE, only objects of class summary.glmmTMB are returned. The latter require a much larger amount of memory to store.

adhoc

Should the adhoc method be used by default to judge if random effects are needed?

adhoc_thresh

Value below which the adhoc p-value is deemed significant (and thus RE are deemed necessary). Only used if adhoc==TRUE.

silent

If TRUE, progress is not printed.

disp_covar

Covariates for a log-linear model for the dispersion. Either a matrix or = 1 to indicate an intercept only model.

weights

weights, as in glm. Defaults to 1 for all observations and no scaling or centering of weights is performed. See ?glmmTMBControl.

control

Control parameters for optimization in glmmTMB.

ncores

Number of cores used for parallelization. Defaults to 1, meaning no parallelization of any kind is done.

cluster_type

Whether to use a "cluster of type "Fork" or "Sock". On Unix systems, "Fork" will likely improve performance. On Windows, only "Sock" will actually result in parallelized computing.

chunk_size

Number of genes to be sent to each parallel environment. Parallelization is more efficient, particularly with a large count matrix, when the count matrix is 'chunked' into some common size (e.g. 10, 50, 200). Defaults to 10.

lb

Should load balancing be used for parallelization? Users will likely want to set to FALSE for improved performance.

Value

A list with the following elements:

  • fit_null: Model fits under the null hypothesis. If return_summary_fits=TRUE, returns a list of objects of class summary.glmmTMB. If return_summary_fits=FALSE, returns a list of model fit objects of class glmmTMB. In either case, the order matches the row order of count_matrix, and the names of the list elements are taken as the rownames of count_matrix.

  • fit_alt: Model fits under the alt hypothesis of the same format as fit_null.

  • LR_stat: Vector of Likelihood Ratio statistics. A value of 'NA' implies a convergence issue or other model fit problem.

  • LR_p.val: Vector of Likelihood Ratio p-values. A value of 'NA' implies a convergence issue or other model fit problem.

  • adhoc_include_RE: Logical vector indicator whether the adhoc method determined random effects needed. If adhoc=F, then a vector of NA's.

Details

This function assumes that the variable being tested is in both components of the model (and thus that the zero-inflation component exists and contains more than an Intercept). Users wishing to do fixed effect testing in other cases or specify custom model formulas they will need to construct the statistics themselves using either two separate calls to twosigma or the lr.twosigma_custom function. If adhoc=TRUE, any input in mean_re and zi_re will be ignored. If either model fails to converge, or the LR statistic is negative, both the statistic and p-value are assigned as NA.

Examples

# Set Parameters to Simulate Some Data

nind<-10;ncellsper<-rep(50,nind)
sigma.a<-.5;sigma.b<-.5;phi<-.1
alpha<-c(1,0,-.5,-2);beta<-c(2,0,-.1,.6)
beta2<-c(2,1,-.1,.6)
id.levels<-1:nind;nind<-length(id.levels)
id<-rep(id.levels,times=ncellsper)
sim.seed<-1234

# Simulate individual level covariates

t2d_sim<-rep(rbinom(nind,1,p=.4),times=ncellsper)
cdr_sim<-rbeta(sum(ncellsper),3,6)
age_sim<-rep(sample(c(20:60),size=nind,replace = TRUE),times=ncellsper)

# Construct design matrices

Z<-cbind(scale(t2d_sim),scale(age_sim),scale(cdr_sim))
colnames(Z)<-c("t2d_sim","age_sim","cdr_sim")
X<-cbind(scale(t2d_sim),scale(age_sim),scale(cdr_sim))
colnames(X)<-c("t2d_sim","age_sim","cdr_sim")

# Simulate Data

sim_dat<-matrix(nrow=2,ncol=sum(ncellsper))
for(i in 1:nrow(sim_dat)){
   sim_dat[i,]<-simulate_zero_inflated_nb_random_effect_data(ncellsper,X,Z,alpha,beta2
   ,phi,sigma.a,sigma.b,id.levels=NULL)$Y
}
rownames(sim_dat)<-paste("Gene",1:2)

# Run lr.twosigma

lr.twosigma(count=sim_dat[1,,drop=FALSE],mean_covar = X,zi_covar = Z,id=id,covar_to_test = 1)

edvanburen/twosigma documentation built on July 3, 2023, 3:39 p.m.