diffcyt: Run 'diffcyt' pipeline

View source: R/diffcyt_wrapper.R

diffcytR Documentation

Run 'diffcyt' pipeline

Description

Wrapper function to run complete 'diffcyt' pipeline

Usage

diffcyt(
  d_input,
  experiment_info = NULL,
  marker_info = NULL,
  design = NULL,
  formula = NULL,
  contrast,
  analysis_type = c("DA", "DS"),
  method_DA = c("diffcyt-DA-edgeR", "diffcyt-DA-voom", "diffcyt-DA-GLMM"),
  method_DS = c("diffcyt-DS-limma", "diffcyt-DS-LMM"),
  markers_to_test = NULL,
  clustering_to_use = NULL,
  cols_to_include = NULL,
  subsampling = FALSE,
  n_sub = NULL,
  seed_sub = NULL,
  transform = TRUE,
  cofactor = 5,
  cols_clustering = NULL,
  xdim = 10,
  ydim = 10,
  meta_clustering = FALSE,
  meta_k = 40,
  seed_clustering = NULL,
  min_cells = 3,
  min_samples = NULL,
  normalize = FALSE,
  norm_factors = "TMM",
  trend_method = "none",
  block_id = NULL,
  trend = TRUE,
  weights = TRUE,
  plot = FALSE,
  path = ".",
  verbose = TRUE
)

Arguments

d_input

Input data. Must be either: (i) a flowSet or list of flowFrames, DataFrames, data.frames, or matrices as input (one flowFrame or list item per sample) (see prepareData); or (ii) a CATALYST daFrame (containing cluster labels in rowData; see vignette for an example).

experiment_info

data.frame, DataFrame, or tbl_df of experiment information, for example sample IDs and group IDs. Must contain a column named sample_id. See prepareData. (Not required when providing a CATALYST daFrame for d_input.)

marker_info

data.frame, DataFrame, or tbl_df of marker information for each column of data. This should contain columns named marker_name and marker_class. The columns contain: (i) marker names (and any other column names); and (ii) a factor indicating the marker class for each column (with entries "type", "state", or "none"). See prepareData. (Not required when providing a CATALYST daFrame for d_input.)

design

Design matrix, created with createDesignMatrix. See createDesignMatrix.

formula

Model formula object, created with createFormula. See createFormula.

contrast

Contrast matrix, created with createContrast. See createContrast.

analysis_type

Type of differential analysis to perform: differential abundance (DA) of cell populations, or differential states (DS) within cell populations. Options are "DA" and "DS". See testDA_edgeR, testDA_voom, testDA_GLMM, testDS_limma, or testDS_LMM.

method_DA

Method to use for calculating differential abundance (DA) tests. Options are "diffcyt-DA-edgeR", "diffcyt-DA-voom", and "diffcyt-DA-GLMM". Default = "diffcyt-DA-edgeR". See testDA_edgeR, testDA_voom, or testDA_GLMM.

method_DS

Method to use for calculating differential state (DS) tests. Options are "diffcyt-DS-limma" and "diffcyt-DS-LMM". Default = "diffcyt-DS-limma". See testDS_limma or testDS_LMM.

markers_to_test

(Optional) Logical vector specifying which markers to test for differential expression (from the set of markers stored in the assays of d_medians; for method testDS_limma or testDS_LMM). Default = all 'cell state' markers, which are identified by the logical vector id_state_markers stored in the meta-data of d_medians. See testDS_limma or testDS_LMM.

clustering_to_use

(Optional) Column name indicating which set of cluster labels to use for differential testing, when input data are provided as a CATALYST daFrame object containing multiple sets of cluster labels. (In this case, the metadata of the daFrame object is assumed to contain a data frame named cluster_codes; clustering_to_use is the column name of the selected column in cluster_codes. If clustering_to_use is provided, an identifier clustering_name to identify this column will also be saved in the metadata of the output object.) Default = NULL, in which case cluster labels stored in column named cluster_id in the rowData of the daFrame object are used.

cols_to_include

Logical vector indicating which columns to include from the input data. Default = all columns. See prepareData.

subsampling

Whether to use random subsampling to select an equal number of cells from each sample. Default = FALSE. See prepareData.

n_sub

Number of cells to select from each sample by random subsampling, if subsampling = TRUE. Default = number of cells in smallest sample. See prepareData.

seed_sub

Random seed for subsampling. Set to an integer value to generate reproducible results. Default = NULL. See prepareData.

transform

Whether to apply 'arcsinh' transform. This may be set to FALSE if the input data has already been transformed. Default = TRUE. See transformData.

cofactor

Cofactor parameter for 'arcsinh' transform. Default = 5, which is appropriate for mass cytometry (CyTOF) data. For fluorescence flow cytometry, we recommend cofactor = 150 instead. See transformData.

cols_clustering

Columns to use for clustering. Default = NULL, in which case markers identified as 'cell type' markers (with entries "type") in the vector marker_class in the column meta-data of d_se will be used. See generateClusters.

xdim

Horizontal length of grid for self-organizing map for FlowSOM clustering (number of clusters = xdim * ydim). Default = 10 (i.e. 100 clusters). See generateClusters.

ydim

Vertical length of grid for self-organizing map for FlowSOM clustering (number of clusters = xdim * ydim). Default = 10 (i.e. 100 clusters). See generateClusters.

meta_clustering

Whether to include FlowSOM 'meta-clustering' step. Default = FALSE. See generateClusters.

meta_k

Number of meta-clusters for FlowSOM, if meta-clustering = TRUE. Default = 40. See generateClusters.

seed_clustering

Random seed for clustering. Set to an integer value to generate reproducible results. Default = NULL. See generateClusters.

min_cells

Filtering parameter. Default = 3. Clusters are kept for differential testing if they have at least min_cells cells in at least min_samples samples. See testDA_edgeR, testDA_voom, testDA_GLMM, testDS_limma, or testDS_LMM.

min_samples

Filtering parameter. Default = number of samples / 2, which is appropriate for two-group comparisons (of equal size). Clusters are kept for differential testing if they have at least min_cells cells in at least min_samples samples. See testDA_edgeR, testDA_voom, testDA_GLMM, testDS_limma, or testDS_LMM.

normalize

Whether to include optional normalization factors to adjust for composition effects. Default = FALSE. See testDA_edgeR, testDA_voom, or testDA_GLMM.

norm_factors

Normalization factors to use, if normalize = TRUE. Default = "TMM", in which case normalization factors are calculated automatically using the 'trimmed mean of M-values' (TMM) method from the edgeR package. Alternatively, a vector of values can be provided (the values should multiply to 1). See testDA_edgeR, testDA_voom, or testDA_GLMM.

trend_method

Method for estimating dispersion trend; passed to function estimateDisp from edgeR package (for method testDA_edgeR). Default = "none". (See estimateDisp help file from edgeR package for other options.) See testDA_edgeR.

block_id

(Optional) Vector or factor of block IDs (e.g. patient IDs) for paired experimental designs, to be included as random effects (for method testDA_voom or testDS_limma). If provided, the block IDs will be included as random effects using the limma duplicateCorrelation methodology. Alternatively, block IDs can be included as fixed effects in the design matrix (createDesignMatrix). See testDA_voom or testDS_limma.

trend

(Optional) Whether to fit a mean-variance trend when calculating moderated tests with function eBayes from limma package (for method testDS_limma). When trend = TRUE, this is known as the limma-trend method (Law et al., 2014; Phipson et al., 2016). Default = TRUE. See testDS_limma.

weights

(Optional) Whether to include precision weights (for method testDS_limma or testDS_LMM). For method testDS_limma, cluster cell counts will be used as precision weights (across all samples and clusters); this allows the limma model fitting functions to account for uncertainty due to the total number of cells per sample (library sizes) and total number of cells per cluster. For methods testDS_LMM, cluster cell counts will be used as precision weights within each model (across samples, i.e. within the model for each cluster); these represent the relative uncertainty in calculating each median value (within each model). Default = TRUE. See testDS_limma or testDS_LMM.

plot

Whether to save diagnostic plots (for method testDA_voom or testDS_limma). Default = FALSE. See testDA_voom or testDS_limma.

path

Path for diagnostic plots, if plot = TRUE (for method testDA_voom or testDS_limma). Default = current working directory. See testDA_voom or testDS_limma.

verbose

Whether to print status messages during each step of the pipeline. Default = TRUE.

Details

This wrapper function runs the complete 'diffcyt' analysis pipeline, by calling the functions for the individual steps in the pipeline in the correct sequence.

For more details about the functions for the individual steps, see the package vignette and the function help pages. Running the individual functions may provide additional flexibility, especially for complex analyses.

The input data can be provided as a flowSet or a list of flowFrames, DataFrames, data.frames, or matrices (one flowFrame or list item per sample). Alternatively, it is also possible to provide the input as a daFrame object from the CATALYST Bioconductor package (Chevrier, Crowell, Zanotelli et al., 2018). This can be useful when initial exploratory analyses and clustering have been performed using CATALYST; the daFrame object from CATALYST (containing cluster labels in the rowData) can then be provided directly to the diffcyt functions for differential testing.

Minimum required arguments when not providing a flowSet or list of flowFrames, DataFrames, data.frames, or matrices:

  • d_input

  • experiment_info

  • marker_info

  • either design or formula (depending on the differential testing method used)

  • contrast

  • analysis_type

Minimum required arguments when providing a CATALYST daFrame object:

  • d_input

  • either design or formula (depending on the differential testing method used)

  • contrast

  • analysis_type

Value

Returns a list containing the results object res, as well as the data objects d_se, d_counts, d_medians, d_medians_by_cluster_marker, and d_medians_by_sample_marker. (If a CATALYST daFrame object was used as input, the output list contains objects res, d_counts, and d_medians.) The structure of res depends on the differential testing method used. See testDA_edgeR, testDA_voom, testDA_GLMM, testDS_limma, or testDS_LMM.

Examples

# For a complete workflow example demonstrating each step in the 'diffcyt' pipeline, 
# see the package vignette.

# Function to create random data (one sample)
d_random <- function(n = 20000, mean = 0, sd = 1, ncol = 20, cofactor = 5) {
  d <- sinh(matrix(rnorm(n, mean, sd), ncol = ncol)) * cofactor
  colnames(d) <- paste0("marker", sprintf("%02d", 1:ncol))
  d
}

# Create random data (without differential signal)
set.seed(123)
d_input <- list(
  sample1 = d_random(), 
  sample2 = d_random(), 
  sample3 = d_random(), 
  sample4 = d_random()
)

# Add differential abundance (DA) signal
ix_DA <- 801:900
ix_cols_type <- 1:10
d_input[[3]][ix_DA, ix_cols_type] <- d_random(n = 1000, mean = 2, ncol = 10)
d_input[[4]][ix_DA, ix_cols_type] <- d_random(n = 1000, mean = 2, ncol = 10)

# Add differential states (DS) signal
ix_DS <- 901:1000
ix_cols_DS <- 19:20
d_input[[1]][ix_DS, ix_cols_type] <- d_random(n = 1000, mean = 3, ncol = 10)
d_input[[2]][ix_DS, ix_cols_type] <- d_random(n = 1000, mean = 3, ncol = 10)
d_input[[3]][ix_DS, c(ix_cols_type, ix_cols_DS)] <- d_random(n = 1200, mean = 3, ncol = 12)
d_input[[4]][ix_DS, c(ix_cols_type, ix_cols_DS)] <- d_random(n = 1200, mean = 3, ncol = 12)

experiment_info <- data.frame(
  sample_id = factor(paste0("sample", 1:4)), 
  group_id = factor(c("group1", "group1", "group2", "group2")), 
  stringsAsFactors = FALSE
)

marker_info <- data.frame(
  channel_name = paste0("channel", sprintf("%03d", 1:20)), 
  marker_name = paste0("marker", sprintf("%02d", 1:20)), 
  marker_class = factor(c(rep("type", 10), rep("state", 10)), 
                        levels = c("type", "state", "none")), 
  stringsAsFactors = FALSE
)

# Create design matrix
design <- createDesignMatrix(experiment_info, cols_design = "group_id")

# Create contrast matrix
contrast <- createContrast(c(0, 1))

# Test for differential abundance (DA) of clusters (using default method 'diffcyt-DA-edgeR')
out_DA <- diffcyt(d_input, experiment_info, marker_info, 
                  design = design, contrast = contrast, 
                  analysis_type = "DA", method_DA = "diffcyt-DA-edgeR", 
                  seed_clustering = 123, verbose = FALSE)

# Test for differential states (DS) within clusters (using default method 'diffcyt-DS-limma')
out_DS <- diffcyt(d_input, experiment_info, marker_info, 
                  design = design, contrast = contrast, 
                  analysis_type = "DS", method_DS = "diffcyt-DS-limma", 
                  seed_clustering = 123, verbose = FALSE)

# Display results for top DA clusters
topTable(out_DA, format_vals = TRUE)

# Display results for top DS cluster-marker combinations
topTable(out_DS, format_vals = TRUE)

# Plot heatmap for DA tests
plotHeatmap(out_DA, analysis_type = "DA")

# Plot heatmap for DS tests
plotHeatmap(out_DS, analysis_type = "DS")


lmweber/diffcyt documentation built on March 19, 2024, 5:24 a.m.