Differential discovery with `CATALYST`
In CATALYST: Cytometry dATa anALYSis Tools

knitr::opts_chunk$set(cache = TRUE)

Most of the pipeline and visualizations presented herein have been adapted from @Nowicka2019-F1000's "CyTOF workflow: differential discovery in high-throughput high-dimensional cytometry datasets" available here.

# load required packages
library(CATALYST)
library(cowplot)
library(flowCore)
library(diffcyt)
library(scater)
library(SingleCellExperiment)

Example data

PBMC_fs:
a flowSet holding PBMCs samples from 4 patients, each containing between 500 and 1000 cells. For each sample, the expression of 10 cell surface and 14 signaling markers was measured before (REF) and upon BCR/FcR-XL stimulation (BCRXL) with B cell receptor/Fc receptor crosslinking for 30', resulting in a total of 8 samples.
This data set represents a subset of data originating from @Bodenmiller2012 that was also used in the citrus paper [@Bruggner2014-CITRUS].
PBMC_panel:
a data.frame containing each marker's column name in the FCS file (fcs_colname column), its targeted protein marker (antigen column), and the marker_class ("type" or "state").
PBMC_md:
a data.frame where rows correspond to samples, and columns specify each sample's file_name, sample_id, condition, and patient_id.

# load example data
data(PBMC_fs, PBMC_panel, PBMC_md)
PBMC_fs
head(PBMC_panel)
head(PBMC_md)

The code snippet below demonstrates how to construct a flowSet from a set of FCS files. However, we also give the option to directly specify the path to a set of FCS files (see next section).

# download exemplary set of FCS files
url <- "http://imlspenticton.uzh.ch/robinson_lab/cytofWorkflow"
zip <- "PBMC8_fcs_files.zip"
download.file(paste0(url, "/", zip), destfile = zip, mode = "wb")
unzip(zip)

# read in FCS files as flowSet
fcs <- list.files(pattern = ".fcs$")
fs <- read.flowSet(fcs, transformation = FALSE, truncate_max_range = FALSE)

Data preparation

Data used and returned throughout differential analysis are held in objects of the r BiocStyle::Biocpkg("SingleCellExperiment") class. To bring the data into the appropriate format, prepData() requires the following inputs:

x: a flowSet holding the raw measurement data, or a character string that specifies a path to a set of FCS files.
panel: a 2 column data.frame that contains for each marker of interest i) its column name in the raw input data, and ii) its targeted protein marker.
md: a data.frame with columns describing the experimental design.

Optionally, features will specify which columns (channels) to keep from the input data. Here, we keep all measurement parameters (default value features = NULL).

(sce <- prepData(PBMC_fs, PBMC_panel, PBMC_md))

We provide flexibility in the way the panel and metadata table can be set up. Specifically, column names are allowed to differ from the example above, and multiple factors (patient ID, conditions, batch etc.) can be specified. Arguments panel_cols and md_cols should then be used to specify which columns hold the required information. An example is given below:

# alter panel column names
panel2 <- PBMC_panel
colnames(panel2)[1:2] <- c("channel_name", "marker")

# alter metadata column names & add 2nd condition
md2 <- PBMC_md
colnames(md2) <- c("file", "sampleID", "cond1", "patientID")
md2$cond2 <- rep(c("A", "B"), 4)

# construct SCE
prepData(PBMC_fs, panel2, md2, 
    panel_cols = list(channel = "channel_name", antigen = "marker"),
    md_cols = list(file = "file", id = "sampleID", 
        factors = c("cond1", "cond2", "patientID")))

Note that, independent of the input panel and metadata tables, the constructor will fix the names of mandatory slots for latter data accession (sample_id in the rowData, channel_name and marker_name in the colData). The md table will be stored under experiment_info inside the metadata.

Clustering

`cluster`: FlowSOM clustering & ConsensusClusterPlus metaclustering

r BiocStyle::Biocpkg("CATALYST") provides a simple wrapper to perform high resolution FlowSOM clustering and lower resolution ConsensusClusterPlus metaclustering. By default, the data will be initially clustered into xdim = 10 x ydim = 10 = 100 groups. Secondly, the function will metacluster populations into 2 through maxK (default 20) clusters. To make analyses reproducible, the random seed may be set via seed. By default, if the colData(sce)$marker_class column is specified, the set of markers with marker class "type" will be used for clustering (argument features = "type"). Alternatively, the markers that should be used for clustering can be specified manually.

sce <- cluster(sce, features = "type", 
    xdim = 10, ydim = 10, maxK = 20, 
    verbose = FALSE, seed = 1)

Let K = xdim x ydim be the number of r BiocStyle::Biocpkg("FlowSOM") clusters. cluster will add information to the following slots of the input SingleCellExperiment:

rowData:
cluster_id: cluster ID as inferred by r BiocStyle::Biocpkg("FlowSOM"). One of 1, ..., K.
colData:
marker_class: factor "type" or "state". Specifyies whether a marker has been used for clustering or not, respectively.
metadata:
SOM_codes: a table with dimensions K x (# type markers). Contains the SOM codes.
cluster_codes: a table with dimensions K x (maxK + 1). Contains the cluster codes for all metaclusterings.
delta_area: a ggplot object (see below for details).

`mergeClusters`: Manual cluster merging

Provided with a 2 column data.frame containing old_cluster and new_cluster IDs, mergeClusters allows for manual cluster merging of any clustering available within the input SingleCellExperiment (i.e. the xdim x ydim r BiocStyle::Biocpkg("FlowSOM") clusters, and any of the 2-maxK r BiocStyle::Biocpkg("ConsensusClusterPlus") metaclusters). For latter accession (visualization, differential testing), the function will assign a unique ID (specified with id) to each merging, and add a column to the cluster_codes inside the metadata slot of the input SingleCellExperiment.

data(merging_table)
head(merging_table)
sce <- mergeClusters(sce, k = "meta20", table = merging_table, id = "merging1")
head(cluster_codes(sce))[, seq_len(10)]

Delta area plot

The delta area represents the amount of extra cluster stability gained when clustering into k groups as compared to k-1 groups. It can be expected that high stability of clusters can be reached when clustering into the number of groups that best fits the data. The "natural" number of clusters present in the data should thus corresponds to the value of k where there is no longer a considerable increase in stability (pleateau onset). For more details, the user can refer to the original description of the consensus clustering method [@Monti2003-ConsensusClusterPlus].

# access & render delta area plot
# (equivalent to metadata(sce)$delta_area)
delta_area(sce)

Visualization

`plotCounts`: Number of cells measured per sample

The number of cells measured per sample may be plotted with plotCounts. This plot should be used as a guide together with other readouts to identify samples where not enough cells were assayed. Here, the grouping of samples (x-axis) is controlled by group_by; bars can be colored by a an additional cell metadata variable (argument color_by):

plotCounts(sce, 
    group_by = "sample_id", 
    color_by = "condition")

As opposed to plotting absolute cell counts, argument prop can be used to visualize relative abundances (frequencies) instead:

plotCounts(sce, 
    prop = TRUE,
    group_by = "condition", 
    color_by = "patient_id")

`pbMDS`: Pseudobulk-level MDS plot

A multi-dimensional scaling (MDS) plot on aggregated measurement values may be rendered with pbMDS. Such a plot will give a sense of similarities between cluster and/or samples in an unsupervised way and of key difference in expression before conducting any formal testing.

Arguments by, assay and fun control the aggregation strategy, allowing to compute pseudobulks by sample (by = "sample_id"), cluster (by = "cluster_id") or cluster-sample instances (by = "both") using the specified assay data and summarry statistic (argument fun)[^1]. When by != "sample_id", i.e., when aggregating by cluster or cluster-sample, argument k specifies the clustering to use. The features to include in the computation of reduced dimensions may be specified via argument features.

[^1]: By default, median expression values are computed.

Arguments color_by, label_by, shape_by can be used to color, label, shape pseudobulk instances by cell metadata variables of interest. Moreover, size_by = TRUE will scale point sizes proportional to the number of cells that went into aggregation. Finally, a custom color palette may be supplied to argument pal.

Ex. 2: MDS on sample-level pseudobulks

A multi-dimensional scaling (MDS) plot on median marker expression by sample has the potential to reveal global proteomic differences across conditions or other experimental metadata. Here, we color points by condition (to reveal treatment effects) and further shape them by patient (to highlight patient effects). In our example, we can see a clear horizontal (MDS dim. 1) separation between reference (REF) and stimulation condition (BCRXL), while patients are, to a lesser extent, separated vertically (MDS dim. 2):

pbMDS(sce, shape_by = "patient_id", size_by = FALSE)

Ex. 1: MDS on pseudobulks by cluster-sample

Complementary to the visualize above, we can generate an MDS plot on pseudobulks computed for each cluster-sample instance. Here, we color point by cluster (to highlight similarity between cell subpopulations), and shape them by condition (to reveal subpopulation-specific expression changes across conditions). In this example, we can see that cluster-sample instances of the same cell subpopulations group together. Meanwhile, most subpopulations exhibit a shift between instances where samples come from different treatment groups:

pbMDS(sce, by = "both", k = "meta12", 
    shape_by = "condition", size_by = TRUE)

`clrDR`: Reduced dimension plot on CLR of proportions

A dimensionality reduction plot on centered log-ratios (CLR) of sample/cluster proportions across clusters/samples can be rendered with clrDR. Here, we view each sample (cluster) as a vector of cluster (sample) proportions. Complementary to pbMDS, such a plot will give a sense of similarities between samples/clusters in terms of their composition.

Centered log-ratio
Let $s_i=s_1,...,s_S$ denote one of $S$ samples, $k_i=k_1,...,k_K$ one of $K$ clusters, and $p_k(s_i)$ be the proportion of cells from sample $s_i$ in cluster $k$. The centered log-ratio (CLR) on a given sample's cluster composition is then defined as:

$$\text{clr}{sk} = \log p_k(s_i) - \frac{1}{K}\sum{i=1}^K \log p_k(s_i)$$

Thus, each sample $s$ gives a vector with length $K$ with mean 0, and the CLRs computed across all instances can be represented as a matrix with dimensions $S \times K$.
We can embed the CLR matrix into a lower dimensional space in which points represent samples; or embed its transpose, in which case points represent samples. Distances in this lower-dimensional space will then represent the similarity in cluster compositions between samples and the in sample compositions between clusters, respectively.

Dimensionality reduction
In principle, clrDR allows any dimension reduction to be applied on the CLRs, with dims (default c(1, 2)) specifying which dimensions to visualize. The default method dr = "PCA" will include the percentage of variance explained by each principal component (PC) in the axis labels. Noteworthily, distances between points in the lower-dimensional space are meaningful only for linear DR methods (PCA and MDS), and results obtained from other methods should be interpreted with caution. The output plot's aspect ratio should thus be kept as is for PCA and MDS; meanwhile, non-linear DR methods can use aspect.ratio = 1, rendering a squared plot.

Interpreting loadings
For dr = "PCA", PC loadings will be represented as arrows that may be interpreted as follows: 0° (180°) between vectors indicates a strong positive (negative) relation between them, while vectors that are orthogonal to one another (90°) are roughly independent.
When a vector points towards a given quadrant, the variability in proportions for the points within this quadrant are largely driven by the corresponding variable. Here, only the relative orientation of loading vectors to each other and to the PC axes is meaningful; however, the sign of loadings (i.e., whether an arrow points left or right) can be flipped when re-computing PCs.

Aesthetics
Cell metadata variables to color points and PC loading arrows by are determined by arguments point/arrow_col, with point/arrow_pal specifying the color palettes to use for each layer. For example, rather than coloring samples by their unique identifiers, we may choose to use their condition for coloring instead to highlight differences across groups. In addition, point sizes may be scaled by the number of cells in a given sample/cluster (when by = "sample/cluster_id") via setting size_by = TRUE.
Argument arrow_len (default 0.5) controls the length of PC loading vectors relative to the largest absolute xy-coordinate. When specified, PC loading vectors will be re-scaled to improve their visibility: A value of 1 will stretch vectors such that the largest loading will touch on the outer most point. Importantly, while absolute arrow lengths are not interpretable, their relative length is.

Ex. 1: CLR on cluster proportions across samples

We here visualize the first two PCs computed on CLRs of sample proportions across clusters: small distances between samples mean similar cluster compositions between them, while large distances are indicative of differences in cluster proportions. In our example, PC 1 clearly separates treatment groups:

clrDR(sce, by = "sample_id", k = "meta12")

Ex. 2: CLR on sample proportions across clusters

As an alternative to the plot above, we can visualize the first two PCs computed on the CLR matrix' transpose. Here, we can observe that most of the variability in cluster-compositions across samples is driven by BCRXL samples (PC1):

clrDR(sce, by = "cluster_id", arrow_col = "condition", size_by = FALSE)

`plotExprHeatmap`: Heatmap of aggregated marker expressions {#plotExprHeatmap}

plotExprHeatmap allows generating heatmaps of aggregated (pseudobulk) marker expressions. Argument assay (default "exprs") and fun (default "median") control the aggregation. Depending on argument by, aggregation will be performed by "sample_id", "cluster", or "both".

Scaling
plotExprHeatmap supports various scaling strategies that are controlled by arguments scale and q, and will greatly alter the visualization and its interpretation[^2]:

[^2]: Regardless of the chosen scaling strategy, row and column clusterings will be performed on unscaled data.

When scale = "first", the specified assay data will be scaled between 0 and 1 using lower (q) and upper (1-q) quantiles as boundaries, where q = 0.01 by default. This way, while losing information on absolut expression values, marker expressions will stay comparable, and visualization is improved in cases where the expression range varies greatly between markers.
When scale = "last", assay data will be aggregated first and scaled subsequently. Thus, each marker's value range will be [0,1]. While all comparability between markers is lost, such scaling will improve seeing differences across, e.g., samples or clusters.
When scale = "never", no scaling (and quantile trimming) is applied. The resulting heatmap will thus display raw pseudobulk data (e.g., median expression values).

Hierarchical clustering
Various arguments control whether rows/columns should be hierarchically clustered (and re-ordered) accordingly (row/col_clust), and whether to include the resulting dendrograms in the heatmap (row/col_dend). Here, clustering is performed using dist followed by hclust on the assay data matrix with the specified method as distance metric (default euclidean) and linkage for agglomeration (default average).

Cell count annotation
Optionally, argument bars specifies whether to include a barplot of cell counts, which can further be annotated with relative abundances (%) via perc = TRUE. These will correspong to cell counts by cluster (when by != "sample_id"), and by sample otherwise.

Sample and cluster annotations
By default (row/col_anno = TRUE), for axes corresponding to samples (y-axis for by = "sample_id" and x-axis for by = "both"), annotations will be drawn for all non-numeric cell metadata variables. Alternatively, a specific subset of annotations can be included for only a subset of variables by specifying row/col_anno to be a character vector in \code{names(colData(x))} (see examples).
For axes corresponding to clusters (y-axis for by = "cluster_id" and "both"), annotations will be drawn for the specified clustering(s) (arguments k and m).

The following examples shall cover the 3 different modes of plotExprHeatmap:

# scale each marker between 0 and 1 
# (after aggregation & without trimming)
plotExprHeatmap(sce, features = "state", 
    scale = "last", q = 0, bars = FALSE)

When by != "sample_id", the clustering to use for aggregation is specified via k, and an additional metaclusting may be included as an annotation (argument m):

# medians of scaled & trimmed type-marker expressions by cluster
plotExprHeatmap(sce, features = "type",
    by = "cluster_id", k = "meta12", m = "meta8",
    scale = "first", q = 0.01, perc = TRUE, col_dend = FALSE)

Finally, we can visualize the aggregated expression of specific markers by cluster and sample via by = "both"[^3]:

[^3]: In this case, only a single features is allowed as input.

# raw (not scaled, not trimmed) 
# median expression by cluster-sample
plotExprHeatmap(sce, features = "pS6", by = "both", k = "meta8", 
    scale = "never", col_clust = FALSE, row_anno = FALSE, bars = FALSE)

`plotPbExprs`: Pseudobulk expression boxplot

A combined boxplot and jitter of aggregated marker intensities can be generated via plotPbExprs(). Here, argument features (default "state", which is equivalent to state_markers(sce)) controls which markers to include. features = NULL will include all markers (and is equivalent to rownames(sce)).
The specified assay values (default "exprs") will be aggregated using fun (default "median") as summary statistic, resulting in one pseudobulk value per sample, or cluster-sample (when one of facet_by, color_by or group_by is set to "cluster_id").

In order to compare medians for each cluster, and potentially identify changes across conditions early on, we specify facet = "cluster_id":

plotPbExprs(sce, k = "meta8", facet_by = "cluster_id", ncol = 4)

Alternatively, we can facet the above plot by antigen in order to compare marker expressions calculated over all cells across conditions:

plotPbExprs(sce, facet_by = "antigen", ncol = 7)

Thirdly, we can investigate how consistent type-markers are expressed across clusters. To this end, we specify color_by = "cluster_id" in order to aggregate expression values by both clusters and samples. The resulting plot gives an indication how good our selection of type markers is: Ideally, their expression should be fairly specific to a subset of clusters.

plotPbExprs(sce, k = "meta10", features = "type", 
  group_by = "cluster_id", color_by = "sample_id", 
  size_by = TRUE, geom = "points", jitter = FALSE, ncol = 5)

Finally, we can investigate how variable state-marker expressions are across clusters by setting group_by = "cluster_id" (to aggregate by both, samples and clusters) and color_by = "condition" (to additionally group samples by treatment). This type of visualization yields similar information as the plot above: We can observe that there are global shifts in expression for a set of markers (e.g., pNFkB, pp38), and rather subpopulation-specific changes for others (e.g., pS6).

plotPbExprs(sce, k = "meta6", features = "state",  
    group_by = "cluster_id", color_by = "condition", ncol = 7)

`plotClusterExprs`: Marker-densities by cluster

Distributions of marker intensities (arcsinh-transformed) across cell populations of interest can be plotted with plotClusterExprs. We specify features = "type" (equivalent to type_markers(sce)), to include type-markers only. Here, blue densities (top row) are calculated over all cells and serve as a reference.

plotClusterExprs(sce, k = "meta8", features = "type")

`plotAbundances`: Relative population abundances

Relative population abundances for any clustering of interest can be plotted with plotAbundances. Argument by will specify whether to plot proportions for each sample or cluster; group_by determines the grouping within each panel as well ascolor coding.

If by = "sample_id", the function displays each sample's cell type composition, and the size of a given stripe reflects the proportion of the corresponding cell type the given sample. Argument group_by then specifies the facetting.
If by = "cluster_id", argument group_by then specifies the grouping and color coding.

plotAbundances(sce, k = "meta12", by = "sample_id", group_by = "condition")

plotAbundances(sce, k = "meta8", by = "cluster_id", 
    group_by = "condition", shape_by = "patient_id")

`plotFreqHeatmap`: Heatmap of cluster fequencies {#plotFreqHeatmap}

Complementary to plotAbundances, a heatmap of relative cluster abundances by cluster can be generated with plotFreqHeatmap. By default (normalize = TRUE), frequencies will we standarized for each cluster, across samples. Analogous to plotExprHeatmap (Sec. \@ref(plotExprHeatmap)), arguments row/col_clust/dend control hierarchical clustering or rows (clusters) and columns (samples), and whether the resulting dendrograms should be display; k specifies the clustering across which to compute abundances, and m a secondary clustering for display. Again, bars and perc can be used to include a labelled cell count barplot.

# complete example
plotFreqHeatmap(sce, 
    k = "meta8", m = "meta5",
    hm_pal = rev(hcl.colors(10, "RdBu")),
    k_pal = hcl.colors(7, "Zissou 1"),
    m_pal = hcl.colors(4, "Temps"),
    bars = TRUE, perc = TRUE)

# minimal example
plotFreqHeatmap(sce, k = "meta10", 
    normalize = FALSE, bars = FALSE,
    row_anno = FALSE, col_anno = FALSE,
    row_clust = FALSE, col_clust = FALSE,
    hm_pal = c("grey95", "black"))

`plotMultiHeatmap`: Multi-panel Heatmaps

plotMultiHeatmap provides flexible options to combine expression and cluster frequency heatmaps generated with plotExprHeatmap (Sec. \@ref(plotExprHeatmap)) and plotFreqHeatmap (Sec. \@ref(plotFreqHeatmap)), respectively, with arguments hm1 and hm2 controlling the panel contents.

Panel contents
In its 1st panel, plotMultiHeatmap will display pseudobulks by cluster. Here, hm1 may be used to specify a set of markers (subset of rownames(sce)), or "type"/"state" for type/state_markers(sce) when marker_classes(sce) have been specified. The 1st heatmap can be turned off altogether by setting hm1 = FALSE.
Anologously, for hm2 = "type"/"state", an expression heatmap of type-/state-markers will be displayed as 2nd heatmap (see Sec. \@ref(plotMultiHeatmap-example1)). hm2 = "abundances" will render cluster frequencies by sample (see Sec. \@ref(plotMultiHeatmap-example2)). As opposed to argument hm1, however, when hm2 specifies one or multiple marker(s), a separate heatmap of pseudobulks by cluster-sample will be drawn for each marker (see Sec. \@ref(plotMultiHeatmap-example3)).

Row and column annotations
The clustering to aggregate by is specified with argument k. Optionally, a metaclustering of interest may be provided via m; here, clustering m is merely included as an additional annotation (and not used for any computation). Thus, m serves to visually inspect the quality a lower-resolution clustering or manual merging.
When an x-axis corresponds to samples, plotMultiHeatmap will include column annotations for cell metadata variables (columns in colData) that map uniquely to each sample (e.g., condition, patient ID). These annotations can be omitted via col_anno = FALSE, or reduced (e.g., col_anno = "condition" to include only a single annotation); by default (col_anno = TRUE), all available metadata is included.

Argument handling
Most of plotMultiHeatmap's arguments take a single value as input. For example, row annotations and dendrograms are automatically removed for all but the 1st panel. Nevertheless, a subset of arguments may be set uniquely for each heatmap; these are:

scale and q controlling the scaling strategy for expression heatmaps
col_clust/dend specifying whether or not to column-cluster each heatmap and draw the resulting dendrograms

When any of these arguments is of length one, the specified value will be recycled for both heatmaps. E.g., setting scale = "never" will have all expression heatmaps show unscaled data.

Ex. 1: Type- & state-markers {#plotMultiHeatmap-example1}

To demonstrate plotMultiHeatmap's basic functionality and handling of arguments, we plot expression heatmaps for both type- (hm1 = "type") and state-markers (hm2 = "state"), and choose to include a column dendrogram for the 2nd heatmap only:

# both, median type- & state-marker expressions
plotMultiHeatmap(sce, 
    hm1 = "type", hm2 = "state", 
    k = "meta12", m = "meta8",
    col_dend = c(FALSE, TRUE))

Ex. 2: CDx markers & cluster frequencies {#plotMultiHeatmap-example2}

As a second example, we plot an expession heatmap for a selection of markers (here, those starting with "CD": hm1 = c("CDx", "CDy", ...)[^4] next to the relative cluster abundances across samples (hm2 = "abundances"). We also add a barplot for the cell counts in each cluster (bars = TRUE) along with labels for their relative abundance (perc = TRUE):

[^4]: hm1 = NULL would include all markers.

# 1st: CDx markers by cluster; 
# 2nd: population frequencies by sample
cdx <- grep("CD", rownames(sce), value = TRUE)
plotMultiHeatmap(sce, k = "meta6",
    hm1 = cdx, hm2 = "abundances", 
    bars = TRUE, perc = TRUE, row_anno = FALSE)

Ex. 3: Selected markers {#plotMultiHeatmap-example3}

In this final example, we view a selection of markers side-by-side, and omit the 1st panel (hm1 = FALSE). We also retain the ordering of samples (column order) across panels (col_clust = FALSE). In this case, plotMultiHeatmap will drop column names (sample IDs) for all but the first panel to avoid repeating these labels and overcrowding the plot. Lastly, we set scale = "never" to visualize raw (unscaled) pseudobulks (= median expressions by cluster-sample):

# plot selected markers side-by-side;
# omit left-hand side heatmap
plotMultiHeatmap(sce, 
    k = "meta8", scale = "never",
    hm1 = FALSE, hm2 = c("pS6", "pp38", "pBtk"),
    row_anno = FALSE, col_clust = FALSE,
    hm2_pal = c("grey95", "black"))

Dimensionality reduction

The number of cells in cytometry data is typically large, and for visualization of cells in a two-dimensional space it is often sufficient to run dimension reductions on a subset of the data. Thus, CATALYST provides the wrapper function runDR to apply any of the dimension reductions available from BiocStyle::Biocpkg("scater") using

the subset of features specified via argument features; either a subset of rownames(.) or, e.g., "type" for type_markers(.) (if marker_classes(.) have been specified).
the subset of cells specified via argument cells; either NULL for all cells, or n to sample a random subset of n cells per sample.

To make results reproducible, the random seed should be set via set.seed prior to computing reduced dimensions:

set.seed(1601)
sce <- runDR(sce, dr = "UMAP", cells = 500, features = "type")

Alternatively, dimension reductions can be computed using one of r BiocStyle::Biocpkg("scater")'s runX functions (X = "TSNE", "UMAP", ...). Note that, by default, scater expects expression values to be stored in the logcounts assay of the SCE; specification of exprs_values = "exprs" is thus required:

sce <- runUMAP(sce, exprs_values = "exprs")

DRs available within the SCE can be viewed via reducedDimNames and accessed with reducedDim(s):

# view & access DRs
reducedDimNames(sce)
head(reducedDim(sce, "UMAP"))

While r BiocStyle::Biocpkg("scater")'s plotReducedDim function can be used to visualize DRs, CATALYST provides the plotDR wrapper, specifically to allow for coloring cells by the various clusterings available, and to support facetting by metadata factors (e.g., experimental condition, sample IDs):

# color by marker expression & split by condition
plotDR(sce, color_by = c("pS6", "pNFkB"), facet_by = "condition")

# color by 8 metaclusters & split by sample ID
p <- plotDR(sce, color_by = "meta8", facet_by = "sample_id")
p$facet$params$ncol <- 4; p

Filtering

SCEs constructed with prepData can be filtered using the filterSCE function, which allows for filtering of both cells and markers according to conditional statements in dplyr-style. When filtering on cluster_ids, argument k specifies which clustering to use (the default NULL uses colData column "cluster_id"). Two examples are given below:

u <- filterSCE(sce, patient_id == "Patient1")
table(u$sample_id)

u <- filterSCE(sce, k = "meta8",
    cluster_id %in% c(1, 3, 8))
plot_grid(
    plotDR(sce, color_by = "meta8"),
    plotDR(u, color_by = "meta8"))

Differental testing with `r BiocStyle::Biocpkg("diffcyt")`

r BiocStyle::Biocpkg("CATALYST") has been designed to be compatible with the r BiocStyle::Biocpkg("diffcyt") package [@Weber2019-diffcyt], which implements statistical methods for differential discovery in high-dimensional cytometry (including flow cytometry, mass cytometry or CyTOF, and oligonucleotide-tagged cytometry) using high-resolution clustering and moderated tests. The input to the r BiocStyle::Biocpkg("diffcyt") pipeline can either be raw data, or a SingleCellExperiment object. We give an exmaple of the latter below.
Please refer to the r BiocStyle::Biocpkg("diffcyt") vignette and R documentation (??diffcyt) for more detailed information.

# create design & constrast matrix
design <- createDesignMatrix(ei(sce), cols_design = "condition")
contrast <- createContrast(c(0, 1))

# test for
# - differential abundance (DA) of clusters
# - differential states (DS) within clusters
res_DA <- diffcyt(sce, clustering_to_use = "meta10",
    analysis_type = "DA", method_DA = "diffcyt-DA-edgeR",
    design = design, contrast = contrast, verbose = FALSE)
res_DS <- diffcyt(sce, clustering_to_use = "meta10",
    analysis_type = "DS", method_DS = "diffcyt-DS-limma",
    design = design, contrast = contrast, verbose = FALSE)

# extract result tables
tbl_DA <- rowData(res_DA$res)
tbl_DS <- rowData(res_DS$res)

`plotDiffHeatmap`: Heatmap of differential testing results

Differential testing results returned by r BiocStyle::Biocpkg("diffcyt") can be displayed with the plotDiffHeatmap function.

For differential abundance (DA) tests, plotDiffHeatmap will display relative cluster abundances by samples; for differential state (DS) tests, plotDiffHeatmap will display aggregated marker expressions by sample.

Filtering
The results to retain for visualization can be filtered via

fdr: threshold on adjusted p-values below which to keep a result
lfc: thershold on absolute logFCs above which to keep a result

The number of top findings to display can be specified with top_n (default 20). When all = TRUE, significance and logFC filtering will be skipped, and all top_n results are shown.

Annotations
Analogous to plotFreq/Expr/MuliHeatmap, when col_anno = TRUE, plotDiffHeatmap will include column annotations for cell metadata variables (columns in colData) that map uniquely to each sample (e.g., condition, patient ID). These annotations can be omitted via col_anno = FALSE, or reduced (e.g., col_anno = "condition" to include only a single annotation).
When row_anno = TRUE, cluster (DA) and cluster-marker instances (DS) will be marked as significant if their adjusted p-value falls below the threshold value specified with fdr. A second annotation will be drawn for the logFCs.

Normalization
When normalize = TRUE, the heatmap will display Z-score normalized values. For DA, cluster frequencies will be arcsine-square-root scaled prior to normalization. While losing information on absolution frequency/expression values, this option will make differences across samples and conditions more notable.

Ex. 1: DA testing results

We here set all = TRUE to display top-20 DA analysis results, without filtering on adjusted p-values and logFCs. Since differential testing was performed on 10 clusters only, this will simply include all available results.
By setting fdr = 0.05 despite not filtering on significance, we can control the right-hand side annotation:

plotDiffHeatmap(sce, tbl_DA, all = TRUE, fdr = 0.05)

Ex. 2: DS testing results

Via setting fdr = 0.05, we here display the top DS analysis results in terms of significance. Alternative to the example above, we sort these according their logFCs (sort_by = "lfc"), and include only a selected sample annotation (col_anno = "condition"):

plotDiffHeatmap(sce, tbl_DS, fdr = 0.05, 
    sort_by = "lfc", col_anno = "condition")

Ex. 3: Filtering results

As an alternative to leaving the selection of markers and clusters to their ordering (significance), we can visualize a specific subset of results (e.g., a selected marker or cluster) using filterSCE:

# include all results for selected marker
plotDiffHeatmap(sce["pp38", ], tbl_DS, all = TRUE, col_anno = FALSE)

# include all results for selected cluster
k <- metadata(res_DS$res)$clustering_name
sub <- filterSCE(sce, cluster_id == 8, k = k)
plotDiffHeatmap(sub, tbl_DS, all = TRUE, normalize = FALSE)

Ex. 4: Customizing appearance

Heatmap and annotation colors are controlled via arguments hm_pal and fdr/lfc_pal, respectively. Here's an example how these can be modified:

plotDiffHeatmap(sce, tbl_DA, all = TRUE, col_anno = FALSE,
    hm_pal = c("gold", "white", "navy"),
    fdr_pal = c("grey90", "grey50"),
    lfc_pal = c("red3", "grey90", "green3"))

Exporting FCS files

Conversion from SCE to flowFrames/flowSet, which in turn can be writting to FCS files using r Biocpkg("flowCore")'s write.FCS function, is not straightforward. It is not recommended to directly write FCS via write.FCS(flowFrame(t(assay(sce)))), as this can lead to invalid FCS files or the data being shown on an inappropriate scale in e.g. Cytobank. Instead, CATALYST provides the sce2fcs function to facilitate correct back-conversion.

sce2fcs allows specification of a colData column to split the SCE by (argument split_by), e.g., to split the data by cluster; whether to keep or drop any cell metadata (argument keep_cd) and dimension reductions (argument keep_dr) available within the object; and which assay data to use (argument assay)[^5]:

[^5]: Only count-like data should be written to FCS files and is guaranteed to show with approporiate scale in Cytobank!

# store final clustering in cell metadata
sce$mm <- cluster_ids(sce, "merging1")
# convert to 'flowSet' with one frame per cluster 
(fs <- sce2fcs(sce, split_by = "mm"))
# split check: number of cells per barcode ID
# equals number of cells in each 'flowFrame'
all(c(fsApply(fs, nrow)) == table(sce$mm))
# store identifiers (= cluster names)
(ids <- c(fsApply(fs, identifier)))

Having converted out SCE to a flowSet, we can write out each of its flowFrames to an FCS file with a meaningul filename that retains the cluster of origin:

for (id in ids) {
    # subset 'flowFrame' for cluster 'id'
    ff <- fs[[id]]                      
    # specify output name that includes ID
    fn <- sprintf("manuel_merging_%s.fcs", id) 
    # construct output path
    fn <- file.path("...", fn)   
    # write frame to FCS
    write.FCS(ff, fn)                   
}

Using other clustering algorithms

While r BiocStyle::Biocpkg("FlowSOM") has proven to perform well in systematic comparisons of clustering algorithms for CyTOF data [@Weber2016-clustering; @Freytag2018-clustering], it is not the only method out there. Here we demonstrate how clustering assignments from another clustering method, say, r Githubpkg("JinmiaoChenLab", "Rphenograph"), could be incorporated into the SCE to make use of the visualizations available in CATALYST. Analogous to the example below, virtually any clustering algorithm could be applied, however, with the following limitation:

The ConsensusClusterPlus metaclusterings applied to the initial FlowSOM clustering by CATALYST's cluster function have a hierarchical cluster structure. Thus, clustering IDs can be matched from a higher resolution (e.g. 100 SOM clusters) to any lower resolution (e.g., 2 through 20 metaclusters). This is not guaranteed for other clustering algorithms. Thus, we store only a single resolution in the cell metadata column cluster_id, and a single column under metadata slot cluster_codes containing the unique cluster IDs. Adding additional resolutions to the cluster_codes will fail if cluster IDs can not be matched uniquely across conditions, which will be the case for any non-hierarchical clustering method.

# subset type-marker expression matrix
es <- assay(sce, "exprs")
es <- es[type_markers(sce), ]

# run clustering method X
# (here, we just split the cells into 
# equal chunks according to CD33 expression)
cs <- split(seq_len(ncol(sce)), cut(es["CD33", ], nk <- 10))
kids <- lapply(seq_len(nk), function(i) {
    rep(i, length(cs[[i]]))
})
kids <- factor(unlist(kids))

# store cluster IDs in cell metadata & codes in metadata
foo <- sce
foo$cluster_id[unlist(cs)] <- unlist(kids)
metadata(foo)$cluster_codes <- data.frame(
    custom = factor(levels(kids), levels = levels(kids)))

# tabulate cluster assignments
table(cluster_ids(foo, "custom"))

Customizing visualizations

Most of CATALYST's plotting functions return ggplot objects whose aesthetics can (in general) be modified easily. However, while e.g. theme aesthetics and color scales can simply be added to the plot, certain modifications can be achieved only through overwriting elements stored in the object, and thus require a decent understanding of its structure.

Other functions (plotExprHeatmap, plotMultiHeatmap and plotDiffHeatmap) generate objects of the Heatmap of HeatmapList class from the r BiocStyle::Biocpkg("ComplexHeatmap") package, and are harder to modify once created. Therefore, CATALYST tries to expose a reasonable amount of arguments to the user that control key aesthetics such as the palettes used for coloring clusters and heatmaps.

The examples below serve to illustrate how some less exposed ggplot aesthetics can be modified in retrospect, and the effects of different arguments that control visualization of r BiocStyle::Biocpkg("ComplexHeatmap") outputs.

Modifying `ggplot`s

p <- plotMedExprs(sce, k = "meta4", facet_by = "cluster_id")
# facetting layout is 2x2; plot all side-by-side instead
p$facet$params$nrow <- 1
# remove points
p$layers <- p$layers[-1]
# overwrite default colors
p <- p + scale_color_manual(values = c("royalblue", "orange"))
# remove x-axis title, change angle & decrease size of labels
(p + labs(x = NULL) + theme(axis.text.x = element_text(angle = 90, size = 8)))

Modifying `ComplexHeatmap`s

plotMultiHeatmap(sce,
    k = "meta8", 
    m = "meta4",
    hm2 = "abundances",
    # include all dendrograms
    row_dend = TRUE, 
    col_dend = TRUE, 
    # exclude sample annotations
    col_anno = FALSE,
    # primary & merging cluster palettes
    k_pal = hcl.colors(8, "Vik"),     
    m_pal = hcl.colors(4, "Tropic"), 
    # 1st & 2nd heatmap coloring
    hm1_pal = c("grey95", "blue"),  
    hm2_pal = c("grey95", "red3"))

# minimal heatmap
plotExprHeatmap(sce,
    row_anno = FALSE,   # don't annotate samples
    row_clust = FALSE,  # keep samples in original order
    col_clust = FALSE,  # keep markers in original order
    bin_anno = FALSE,   # don't annotate bins
    bars = FALSE,       # don't include sample sizes
    scale = "last",     # aggregate, then scale
    hm_pal = hcl.colors(10, "YlGnBu", rev = TRUE))

# complete heatmap
plotExprHeatmap(sce, row_anno = TRUE,   # annotate samples
    row_clust = TRUE, col_clust = TRUE, # cluster samples/markers
    row_dend = TRUE, col_dend = TRUE,   # include dendrograms
    bin_anno = TRUE,          # annotate bins with value
    bars = TRUE, perc = TRUE, # include barplot of sample sizes
    hm_pal = c("grey95", "orange"))

Combining `ComplexHeatmap`s

While plotMultiHeatmap provides a convenient way to combine pseudoublk expression heatmaps across clusters or cluster-sample combinations with heatmaps of relative cluster abundances, Heatmap objects can be, in principle, combined arbitrarily with a few notes of caution:

each Heatmap should use the same clustering for aggregation (argument k).
each Heatmap should have unique identifier (slot @name); otherwise, a warning is given.
each Heatmap's legend should have a unique title (slot @matrix_color_mapping@name);
otherwise, legends with a title that is already in use will be dropped.

Ex. 1: type- & state-markers + cluster frequencies

# specify clustering to aggregate by
k <- "meta11"

# median type-marker expression by cluster
p1 <- plotExprHeatmap(sce, features = "type",
    by = "cluster_id", k = k, m = "meta7")

# median state-marker expression by cluster
p2 <- plotExprHeatmap(sce, features = "state",
    by = "cluster_id", k = k, row_anno = FALSE)

# relative cluster abundances by sample
p3 <- plotFreqHeatmap(sce, k = k, perc = TRUE,
    row_anno = FALSE, col_clust = FALSE)

# make legend titles unique
p1@name <- p1@matrix_color_mapping@name <- "type"
p2@name <- p2@matrix_color_mapping@name <- "state"

p1 + p2 + p3

Ex. 2: frequencies + selected markers + all markers

# specify clustering to aggregate by
k <- "meta9"

# relative cluster abundances by sample
p <- plotFreqHeatmap(sce, k = k, 
    bars = FALSE, hm_pal = c("white", "black"),
    row_anno = FALSE, col_clust = FALSE, col_anno = FALSE)

# specify unique coloring
cs <- c(pp38 = "maroon", pBtk = "green4")

# median expression of selected markers by cluster-sample
for (f in names(cs)) {
    q <- plotExprHeatmap(sce, features = f, 
        by = "both", k = k, scale = "never",
        row_anno = FALSE, col_clust = FALSE, 
        hm_pal = c("white", cs[f]))
    # make identifier & legend title unique
    q@name <- q@matrix_color_mapping@name <- f
    # remove column annotation names
    for (i in seq_along(q@top_annotation@anno_list))
        q@top_annotation@anno_list[[i]]@name_param$show <- FALSE
    # remove redundant sample names
    q@column_names_param$show <- FALSE
    p <- p + q
}

# add heatmap of median expression across all features
p + plotExprHeatmap(sce, features = NULL,
    by = "cluster_id", k = k, row_anno = FALSE)

Session information

sessionInfo()

References

Any scripts or data that you put into this service are public.

CATALYST documentation built on Nov. 8, 2020, 6:53 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

CATALYST Cytometry dATa anALYSis Tools

Differential discovery with `CATALYST` In CATALYST: Cytometry dATa anALYSis Tools

Example data

Data preparation

Clustering

cluster: FlowSOM clustering & ConsensusClusterPlus metaclustering

mergeClusters: Manual cluster merging

Delta area plot

Visualization

plotCounts: Number of cells measured per sample

pbMDS: Pseudobulk-level MDS plot

Ex. 2: MDS on sample-level pseudobulks

Ex. 1: MDS on pseudobulks by cluster-sample

clrDR: Reduced dimension plot on CLR of proportions

Ex. 1: CLR on cluster proportions across samples

Ex. 2: CLR on sample proportions across clusters

plotExprHeatmap: Heatmap of aggregated marker expressions {#plotExprHeatmap}

plotPbExprs: Pseudobulk expression boxplot

plotClusterExprs: Marker-densities by cluster

plotAbundances: Relative population abundances

plotFreqHeatmap: Heatmap of cluster fequencies {#plotFreqHeatmap}

plotMultiHeatmap: Multi-panel Heatmaps