generateClusters: Generate clusters

View source: R/3_generateClusters.R

generateClustersR Documentation

Generate clusters

Description

Generate high-resolution clusters for diffcyt analysis

Usage

generateClusters(
  d_se,
  cols_clustering = NULL,
  xdim = 10,
  ydim = 10,
  meta_clustering = FALSE,
  meta_k = 40,
  seed_clustering = NULL,
  ...
)

Arguments

d_se

Transformed input data, from prepareData and transformData.

cols_clustering

Columns to use for clustering. Default = NULL, in which case markers identified as 'cell type' markers (with entries "type") in the vector marker_class in the column meta-data of d_se will be used.

xdim

Horizontal length of grid for self-organizing map for FlowSOM clustering (number of clusters = xdim * ydim). Default = 10 (i.e. 100 clusters).

ydim

Vertical length of grid for self-organizing map for FlowSOM clustering (number of clusters = xdim * ydim). Default = 10 (i.e. 100 clusters).

meta_clustering

Whether to include FlowSOM 'meta-clustering' step. Default = FALSE.

meta_k

Number of meta-clusters for FlowSOM, if meta-clustering = TRUE. Default = 40.

seed_clustering

Random seed for clustering. Set to an integer value to generate reproducible results. Default = NULL.

...

Other parameters to pass to the FlowSOM clustering algorithm (through the function BuildSOM).

Details

Performs clustering to group cells into clusters representing cell populations or subsets, which can then be further analyzed by testing for differential abundance of cell populations or differential states within cell populations. By default, we use high-resolution clustering or over-clustering (i.e. we generate a large number of small clusters), which helps ensure that rare populations are adequately separated from larger ones.

Data is assumed to be in the form of a SummarizedExperiment object generated with prepareData and transformed with transformData.

The input data object d_se is assumed to contain a vector marker_class in the column meta-data. This vector indicates the marker class for each column ("type", "state", or "none"). By default, clustering is performed using the 'cell type' markers only. For example, in immunological data, this may be the lineage markers. The choice of cell type markers is an important design choice for the user, and will depend on the underlying experimental design and research questions. It may be made based on prior biological knowledge or using data-driven methods. For an example of a data-driven method of marker ranking and selection, see Nowicka et al. (2017), F1000Research.

By default, we use the FlowSOM clustering algorithm (Van Gassen et al. 2015, Cytometry Part A, available from Bioconductor) to generate the clusters. We previously showed that FlowSOM gives very good clustering performance for high-dimensional cytometry data, for both major and rare cell populations, and is extremely fast (Weber and Robinson, 2016, Cytometry Part A).

The clustering is run at high resolution to give a large number of small clusters (i.e. over-clustering). This is done by running only the initial 'self-organizing map' clustering step in the FlowSOM algorithm, i.e. without the final 'meta-clustering' step. This ensures that small or rare populations are adequately separated from larger populations, which is crucial for detecting differential signals for extremely rare populations.

The minimum spanning tree (MST) object from BuildMST is stored in the experiment metadata slot in the SummarizedExperiment object d_se, and can be accessed with metadata(d_se)$MST.

Value

d_se: Returns the SummarizedExperiment input object, with cluster labels for each cell stored in an additional column of row meta-data. Row meta-data can be accessed with rowData. The minimum spanning tree (MST) object is also stored in the metadata slot, and can be accessed with metadata(d_se)$MST.

Examples

# For a complete workflow example demonstrating each step in the 'diffcyt' pipeline, 
# see the package vignette.

# Function to create random data (one sample)
d_random <- function(n = 20000, mean = 0, sd = 1, ncol = 20, cofactor = 5) {
  d <- sinh(matrix(rnorm(n, mean, sd), ncol = ncol)) * cofactor
  colnames(d) <- paste0("marker", sprintf("%02d", 1:ncol))
  d
}

# Create random data (without differential signal)
set.seed(123)
d_input <- list(
  sample1 = d_random(), 
  sample2 = d_random(), 
  sample3 = d_random(), 
  sample4 = d_random()
)

experiment_info <- data.frame(
  sample_id = factor(paste0("sample", 1:4)), 
  group_id = factor(c("group1", "group1", "group2", "group2")), 
  stringsAsFactors = FALSE
)

marker_info <- data.frame(
  channel_name = paste0("channel", sprintf("%03d", 1:20)), 
  marker_name = paste0("marker", sprintf("%02d", 1:20)), 
  marker_class = factor(c(rep("type", 10), rep("state", 10)), 
                        levels = c("type", "state", "none")), 
  stringsAsFactors = FALSE
)

# Prepare data
d_se <- prepareData(d_input, experiment_info, marker_info)

# Transform data
d_se <- transformData(d_se)

# Generate clusters
d_se <- generateClusters(d_se)


lmweber/diffcyt documentation built on March 19, 2024, 5:24 a.m.