censcyt: Run 'censcyt' pipeline

View source: R/censcyt_wrapper.R

censcytR Documentation

Run censcyt pipeline

Description

Wrapper function to run complete censcyt pipeline

Usage

censcyt(
  d_input,
  experiment_info = NULL,
  marker_info = NULL,
  design = NULL,
  formula = NULL,
  contrast,
  analysis_type = c("DA"),
  method_DA = c("censcyt-DA-censored-GLMM"),
  markers_to_test = NULL,
  clustering_to_use = NULL,
  cols_to_include = NULL,
  subsampling = FALSE,
  n_sub = NULL,
  seed_sub = NULL,
  transform = TRUE,
  cofactor = 5,
  cols_clustering = NULL,
  xdim = 10,
  ydim = 10,
  meta_clustering = FALSE,
  meta_k = 40,
  seed_clustering = NULL,
  min_cells = 3,
  min_samples = NULL,
  normalize = FALSE,
  norm_factors = "TMM",
  verbose = TRUE,
  mi_reps = 10,
  imputation_method = c("km", "km_exp", "km_wei", "km_os", "rs", "mrl", "cc", "pmm"),
  BPPARAM = BiocParallel::SerialParam()
)

Arguments

d_input

Input data. Must be either: (i) a flowSet-class or list of flowFrame-classs, DataFrames, data.frames, or matrices as input (one flowFrame or list item per sample) (see prepareData); or (ii) a CATALYST daFrame (containing cluster labels in rowData; see vignette for an example).

experiment_info

data.frame, DataFrame, or {tbl_df} of experiment information, for example sample IDs and group IDs. Must contain a column named sample_id. See prepareData. (Not required when providing a CATALYST daFrame for d_input.)

marker_info

data.frame, DataFrame, or tbl_df of marker information for each column of data. This should contain columns named marker_name and marker_class. The columns contain: (i) marker names (and any other column names); and (ii) a factor indicating the marker class for each column (with entries "type", "state", or "none"). See prepareData. (Not required when providing a CATALYST daFrame for d_input.)

design

Design matrix, created with createDesignMatrix. See createDesignMatrix.

formula

Model formula object, created with createFormula. See createFormula.

contrast

Contrast matrix, created with createContrast. See createContrast.

analysis_type

Type of differential analysis to perform: differential abundance (DA) of cell populations. The only option at the moment is "DA". See testDA_censoredGLMM.

method_DA

Method to use for calculating differential abundance (DA) tests. Currently the only option is testDA_censoredGLMM. Default = testDA_censoredGLMM.

markers_to_test

(Optional) Logical vector specifying which markers to test for differential expression (from the set of markers stored in the assays of d_medians; for method testDS_limma or testDS_LMM). Default = all 'cell state' markers, which are identified by the logical vector id_state_markers stored in the meta-data of d_medians. See testDS_limma or testDS_LMM.

clustering_to_use

(Optional) Column name indicating which set of cluster labels to use for differential testing, when input data are provided as a CATALYST daFrame object containing multiple sets of cluster labels. (In this case, the metadata of the daFrame object is assumed to contain a data frame named cluster_codes; clustering_to_use is the column name of the selected column in cluster_codes. If clustering_to_use is provided, an identifier clustering_name to identify this column will also be saved in the metadata of the output object.) Default = NULL, in which case cluster labels stored in column named cluster_id in the rowData of the daFrame object are used.

cols_to_include

Logical vector indicating which columns to include from the input data. Default = all columns. See prepareData.

subsampling

Whether to use random subsampling to select an equal number of cells from each sample. Default = FALSE. See prepareData.

n_sub

Number of cells to select from each sample by random subsampling, if subsampling = TRUE. Default = number of cells in smallest sample. See prepareData.

seed_sub

Random seed for subsampling. Set to an integer value to generate reproducible results. Default = NULL. See prepareData.

transform

Whether to apply 'arcsinh' transform. This may be set to FALSE if the input data has already been transformed. Default = TRUE. See transformData.

cofactor

Cofactor parameter for 'arcsinh' transform. Default = 5, which is appropriate for mass cytometry (CyTOF) data. For fluorescence flow cytometry, cofactor = 150 is recommended instead. See transformData.

cols_clustering

Columns to use for clustering. Default = NULL, in which case markers identified as 'cell type' markers (with entries "type") in the vector marker_class in the column meta-data of d_se will be used. See generateClusters.

xdim

Horizontal length of grid for self-organizing map for FlowSOM clustering (number of clusters = xdim * ydim). Default = 10 (i.e. 100 clusters). See generateClusters.

ydim

Vertical length of grid for self-organizing map for FlowSOM clustering (number of clusters = xdim * ydim). Default = 10 (i.e. 100 clusters). See generateClusters.

meta_clustering

Whether to include FlowSOM 'meta-clustering' step. Default = FALSE. See generateClusters.

meta_k

Number of meta-clusters for FlowSOM, if meta-clustering = TRUE. Default = 40. See generateClusters.

seed_clustering

Random seed for clustering. Set to an integer value to generate reproducible results. Default = NULL. See generateClusters.

min_cells

Filtering parameter. Default = 3. Clusters are kept for differential testing if they have at least min_cells cells in at least min_samples samples. See testDA_censoredGLMM.

min_samples

Filtering parameter. Default = number of samples / 2, which is appropriate for two-group comparisons (of equal size). Clusters are kept for differential testing if they have at least min_cells cells in at least min_samples samples. See testDA_censoredGLMM.

normalize

Whether to include optional normalization factors to adjust for composition effects. Default = FALSE. See testDA_censoredGLMM.

norm_factors

Normalization factors to use, if normalize = TRUE. Default = "TMM", in which case normalization factors are calculated automatically using the 'trimmed mean of M-values' (TMM) method from the edgeR package. Alternatively, a vector of values can be provided (the values should multiply to 1). See testDA_censoredGLMM.

verbose

Whether to print status messages during each step of the pipeline. Default = TRUE.

mi_reps

Number of imputations in multiple imputation. Default = 10. See testDA_censoredGLMM.

imputation_method

Method to be used in the imputation step. One of km,km_exp,km_wei,km_os, rs, mrl, cc, pmm. See testDA_censoredGLMM.

BPPARAM

Specification of parallelization option as one of BiocParallelParam if BiocParallel is available otherwise no parallelization is used. e.g. MulticoreParam-class(workers=2) for parallelization with two cores. Default is SerialParam-class() (no parallelization).

Details

This wrapper function runs the complete diffcyt analysis pipeline where the only difference is the analysis step which uses the functions from censcyt (which is currently only testDA_censoredGLMM).

For more details about the functions for the individual steps, see diffcyt, the diffcyt vignette, the censcyt package vignette and the function help pages. The following is a slightly adapted summary from diffcyt:

Running the individual functions may provide additional flexibility, especially for complex analyses.

The input data can be provided as a flowSet-class or a list of flowFrame-classs, DataFrames, data.frames, or matrices (one flowFrame or list item per sample). Alternatively, it is also possible to provide the input as a daFrame object from the CATALYST Bioconductor package (Chevrier, Crowell, Zanotelli et al., 2018). This can be useful when initial exploratory analyses and clustering have been performed using CATALYST; the daFrame object from CATALYST (containing cluster labels in the rowData) can then be provided directly to the censcyt functions for differential testing.

Minimum required arguments when not providing a flowSet-class or list of flowFrame-classs, DataFrames, data.frames, or matrices:

  • d_input

  • experiment_info

  • marker_info

  • either design or formula (depending on the differential testing method used)

  • contrast

  • analysis_type

Minimum required arguments when providing a CATALYST daFrame object:

  • d_input

  • either design or formula (depending on the differential testing method used)

  • contrast

  • analysis_type

Value

Returns a list containing the results object res, as well as the data objects d_se, d_counts, d_medians, d_medians_by_cluster_marker, and d_medians_by_sample_marker. (If a CATALYST daFrame object was used as input, the output list contains objects res, d_counts, and d_medians.)

Examples

# Function to create random data (one sample)
fcs_sim <- function(n = 2000, mean = 0, sd = 1, ncol = 10, cofactor = 5) {
  d <- matrix(sinh(rnorm(n*ncol, mean, sd)) * cofactor,ncol=ncol)
  for(i in seq_len(ncol)){
    d[seq(n/ncol*(i-1)+1,n/ncol*(i)),i] <- sinh(rnorm(n/ncol, mean+5, sd)) * cofactor
  }
  colnames(d) <- paste0("marker", sprintf("%02d", 1:ncol))
  d
}

# Create random data (without differential signal)
set.seed(123)
d_input <- lapply(1:50, function(i) fcs_sim())

# simulate survival time
d_surv <- simulate_singlecluster(50, formula(Y~Surv(X,I)))[c("X","I","TrVal")]

# Add differential abundance (DA) signal
for(i in 1:50){
  # number of cells in cluster 1 
  n_da <- round(sqrt(2000*d_surv$TrVal[i]))*9
  # set to no expression
  tmpd <- matrix(sinh(rnorm(n_da*10, 0, 1)) * 5, ncol=10)
  # increase expresion for cluster 1
  tmpd[ ,1] <- sinh(rnorm(n_da, 5, 1)) * 5
  d_input[[i]][seq_len(n_da), ] <- tmpd
}

experiment_info <- data.frame(
  sample_id = factor(paste0("sample", 1:50)),
  survival_time = d_surv$X,
  event_indicator= d_surv$I,
  stringsAsFactors = FALSE
)

marker_info <- data.frame(
  channel_name = paste0("channel", sprintf("%03d", 1:10)),
  marker_name = paste0("marker", sprintf("%02d", 1:10)),
  marker_class = factor(c(rep("type", 10)),
                        levels = c("type", "state", "none")),
  stringsAsFactors = FALSE
)

# Create formula
da_formula <- createFormula(experiment_info, cols_fixed="survival_time",
                            cols_random = "sample_id",event_indicator = "event_indicator")
# Create contrast matrix
contrast <- diffcyt::createContrast(c(0, 1))

# Test for differential abundance (DA) of clusters
out_DA <- censcyt(d_input, experiment_info, marker_info,
                  formula = da_formula, contrast = contrast,
                  analysis_type = "DA", method_DA = "censcyt-DA-censored-GLMM",
                  seed_clustering = 123, verbose = FALSE, mi_reps = 3,
                  BPPARAM=BiocParallel::MulticoreParam(workers = 1),
                  imputation_method = "mrl",meta_clustering = TRUE, meta_k = 10)

# Display results for top DA clusters
diffcyt::topTable(out_DA, format_vals = TRUE)


# Plot heatmap for DA tests
diffcyt::plotHeatmap(out_DA, analysis_type = "DA")


retogerber/censcyt documentation built on Feb. 7, 2023, 9:56 a.m.