treecor_expr: Pipeline for TreeCorTreat using gene expression as features

View source: R/treecor_expr.R

treecor_exprR Documentation

Pipeline for TreeCorTreat using gene expression as features

Description

Pipeline for TreeCorTreat using gene expression as features

Usage

treecor_expr(
  expr,
  hierarchy_list,
  cell_meta,
  sample_meta,
  response_variable,
  method = "aggregate",
  formula = NULL,
  separate = T,
  analysis_type = "cancor",
  num_cancor_components = 1,
  num_permutations = 1000,
  alternative = "two.sided",
  num_PCs = 2,
  num_hvgs = NULL,
  threshold_pct_samples = 100,
  method_threshold = "global",
  filter_prop = 0.1,
  pseudobulk_list = NULL,
  ncores = parallel::detectCores(),
  verbose = T
)

Arguments

expr

A count gene expression matrix. Can directly extract from Seurat object using seurat_object@assays$RNA@counts.

hierarchy_list

A hierarchy list by running extract_hrchy_string() or 'extract_hrchy_seurat()' functions, which contains four elements: 'edges', 'layout', 'immediate_children' and 'leaves_info'.

cell_meta

Cell-level metadata, where each row is a cell. Must contain these columns: barcode, celltype and sample.

sample_meta

Sample-level metadata, where each row is a sample. Must contain 'sample' column and additional variables such as covariates or outcome of interest.

response_variable

A vector of response variables.

method

A character string indicating which approach is used to summarize features. One of 'concat_leaf' or 'concat_immediate_children' or 'aggregate'(default).

formula

An object of class 'formula': a symbolic description of the model to be fitted, adjusting for confounders.

separate

A TRUE (default) or FALSE indicator, specifying how to evaluate multivariate outcomes.

  • TRUE: evaluate multivariate phenotype separately (it is equivalent to run this pipeline for each univariate phenotype).

  • FALSE: evaluate multivariate phenotype jointly.

analysis_type

Either 'cancor' (canonical correlation, by default) or 'regression' (F statistics) to evaluate association between gene expression and samples' phenotype.

num_cancor_components

Number of canonical components to be extracted. Only works when separate = F.

num_permutations

Number of permutations (by default: 1000).

alternative

A character string specifying the alternative hypothesis, must be one of "two.sided" (default), "greater" or "less".

num_PCs

Number of PCs used in canonical correlation calculation

num_hvgs

Number of highly variable genes extracted from sample-level pseudobulk. Highly variable genes (HVGs) are defined as genes with positive residuals by fitting a gene-specific lowess function of standard deviation against its mean. The HVGs are ordered by residuals in a descending order. Default is NULL, which includes all genes with positive residuals. Can specify a reasonable positive number (e.g. 1000).

threshold_pct_samples

Threshold for selecting cell clusters with at least threshold_pct_samples% of total samples. Ranges from 0 to 100, where '100' indicates to include cell clusters with all samples; '0' indicates to include all cell clusters. For any non-leaf node, there will be a trade-off between number of cell clusters and number of samples to be included.

method_threshold

A character string specifying method to apply the threshold (i.e. 'threshold_pct_samples'), must be one of 'global' or 'local'.

  • global (default): indicates a minimum sample size is calculated at global level and only affects non-leaf node with sample size that exceeds global cutoff.

  • local: indicates that threshold_pct_samples is applied to every hierarchy and affect every non-leaf node.

filter_prop

A number ranges from 0 to 1, to filter low expressed genes across samples (by default: 0.1). Genes with at least this proportion of samples with log2-normalized count greater than 0.01 are retained.

pseudobulk_list

A list of sample-level pseudobulk for each node. Default is NULL. Each element is a dataframe with rows representing genes and columns representing samples. Users can provide their processed pseudobulk list (e.g. use combat or other batch correction methods) via this parameter. Note that the names of list shall be matched with id extracted from hierarchy_list$layout[,id].

ncores

Number of cores to be used. If ncores > 1, it will be implemented in a parallel mode.

verbose

Show progress

Value

A list that contains:

  • canonical_corr: A data frame for summary statistic (e.g. canonical correlation or F-statistic), p-value, adjusted p-value and label information for each node.

  • pc_ls: A list for top n PC matrices for each node

Author(s)

Boyang Zhang <bzhang34@jhu.edu>, Hongkai Ji

Examples

# default setting
result <- treecor_expr(expr,hierarchy_list, cell_meta, sample_meta, response_variable = 'severity')
# obtain summary statistic for each cell type
result$canonical_corr # or result[[1]]
# extract PC matrix for celltype 'T'
result$pc.ls[['T']]

byzhang23/TreeCorTreat documentation built on May 7, 2024, 8:37 a.m.