dr_to_fcs: Calculate dimension reduction and cluster annotation with...

View source: R/dr_to_fcs.R

dr_to_fcsR Documentation

Calculate dimension reduction and cluster annotation with data from one or more flow frames and add these parameters to a (concatenated) fcs file

Description

Prepare a list of flowframes (ff.list) with fcexpr::wsp_get_ff or fcexpr::inds_get_ff. The objects returned from these function will be compatible with fcexpr::dr_to_fcs. The ff.list must contain a list of flowframes with untransformed fluorescence intensities (FI) and optionally may contain and additional list of flowframes with transformed FI. The transformation is your choice and may be logicle, arcsinh or whatever. It may be individual to channels and/or to flowframes. You have to calculate the transformation outside this function (dr_to_fcs) though. If a list of transformed flowframes is provided this one will be used for dimension reduction and clustering etc. If only a list untransformed flowframes is provided, this one will be used. Both, the transformed and/or untransformed values will be written to the concatenated fcs file (see function arguments). Calculated dimension reductions and cluster annotations will also be written as channels into that fcs file. Some dimension reduction and cluster detection algorithm may become slow with a large number of events (e.g. > 1e6, see the details). In order to get a quick impression of what the algorithms can pull out for you, you may use the 'downsample' argument in fcexpr::wsp_get_ff or fcexpr::inds_get_ff to conveniently sample a random subset of events from each fcs files (flowframe).

Usage

dr_to_fcs(
  ff.list,
  channels = NULL,
  add.sample.info = NULL,
  scale.whole = c("z.score", "min.max", "none"),
  scale.samples = c("none", "z.score", "min.max"),
  run.harmony = F,
  run.pca = F,
  run.lda = F,
  run.umap = T,
  run.tsne = F,
  run.som = T,
  run.gqtsom = F,
  metaclustering.on = NULL,
  run.louvain = F,
  run.kmeans = F,
  run.minibatchkmeans = F,
  run.kmeans_arma = F,
  run.kmeans_rcpp = F,
  run.leiden = F,
  run.hclust = F,
  run.flowClust = F,
  run.MUDAN = F,
  n.pca.dims = 0,
  clustering.for.marker.calc = NULL,
  calc.global.markers = F,
  calc.pairwise.markers = F,
  extra.cols = NULL,
  mc.cores = 1,
  save.to.disk = c("fcs", "rds"),
  save.path = file.path(getwd(), paste0(substr(gsub("\\.", "",
    make.names(as.character(Sys.time()))), 2, 15), "_FCS_dr")),
  exclude.extra.channels = ifelse(length(ff.list) == 1 && names(ff.list) ==
    "transformed", "cluster.id", "FSC|SSC|Time|cluster.id"),
  write.transformed.channels.to.FCS = T,
  write.untransformed.channels.to.FCS = T,
  write.scaled.channels.to.FCS = F,
  timeChannel = "Time",
  transformation_name = "trans",
  return_processed_raw_data_only = F,
  seed = 42,
  ...
)

Arguments

ff.list

a list of flowFrames as received from fcexpr::wsp_get_ff (compensated with Compensation Matrix as defined in FlowJo by default) or as received from fcexpr::inds_get_ff (directly from FCS files, not compensated by default)

channels

a named vector of channels to use for dimension reduction. values can be channel names (v-450LP..., b-520..., or so), or channel descriptions (e.g. CD3 or LAG3-PeCy7 for example) names will be used as new description in the fcs file to be written; if no names provided, names of the very first FCS file will be used

add.sample.info

named list of additional channels to identify samples or group them in flowjo; e.g. for 9 fcs files to be concatenated: add.sample.info = list(condition = c(1,2,3,1,2,3,1,2,3), donor = c(1,1,1,2,2,2,3,3,3))

scale.whole

if and how to scale channels after concatenation of flowframes in ff.list

scale.samples

if and how to scale channels of flowframes in ff.list individually before concatenation

run.harmony

attempt batch correction using harmony::HarmonyMatrix; if TRUE, then harmony__meta_data has to be provided in ... indicating the groups to be batch corrected; harmony is conducted before run.pca; to calculate a pca before harmony, pass harmony__do_pca = T and optional a number of pcs with harmony__npcs. Set run.pca = F when a pca is before in harmony.

run.pca

run principle component analysis only, or prior to other dimension reduction (umap, SOM, tSNE, ...)

run.lda

run Linear Discriminant Analysis before dimension reduction; should be F (FALSE) or a clustering calculated before, e.g. louvain_0.8 or leiden_1.1, kmeans_12 etc.; respective clustering calculation has to be provided in ...

run.umap

logical, whether to calculate UMAP dimension reduction with uwot::umap

run.tsne

logical, whether to calculate tsne dimension reduction with Rtsne::Rtsne

run.som

logical, whether to calculate SOM dimension reduction EmbedSOM::SOM followed by EmbedSOM::EmbedSOM

run.gqtsom

logical, whether to calculate GQTSOM dimension reduction EmbedSOM::GQTSOM followed by EmbedSOM::EmbedSOM

metaclustering.on

SOM or GQTSOM; run clustering-algorithms not on original data but on SOM map (so called metaclustering); SOM or GQTSOM may be selected for metaclustering; this is expected to yield a substantial improvement to calculation speed on large data sets; if NULL, clustering is performed on original data; recommendation: when applying meta-clustering choose run.hclust = T and supply hclust__method = "average" and cutree__k = c(xx) with xx being the number of desired/expected clusters

run.louvain

logical, whether to detect clusters (communities, groups) of cells with the louvain algorithm, implemented in Seurat::FindClusters (subsequent to snn detection by Seurat::FindNeighbors)

run.kmeans

logical, whether to detect clusters with stats::kmeans

run.minibatchkmeans

logical, whether to detect clusters with ClusterR::MiniBatchKmeans

run.kmeans_arma

logical, whether to detect clusters with ClusterR::KMeans_arma

run.kmeans_rcpp

logical, whether to detect clusters with ClusterR::KMeans_rcpp

run.leiden

logical, whether to detect clusters (communities, groups) of cells with the leiden algorithm, with leiden::leiden (subsequent to snn detection by Seurat::FindNeighbors)

run.hclust

logical, whether to detect clusters with stats::dist, stats::hclust and stats::cutree

run.flowClust

logical, whether to detect clusters with flowClust::flowClust

run.MUDAN

logical, whether to detect clusters with MUDAN::getComMembership

n.pca.dims

number of principle components to calculate

clustering.for.marker.calc

if NULL nothing is calculated; otherwise marker features (stained markers) are determined by wilcox test using presto::wilcoxauc for the provided clustering(s). each cluster is tested against other events and clusters are compared pairwise. respective clustering calculation has to be provided in ...; e.g. if louvain__resolution = 0.5 is provided set clustering.for.marker.calc = louvain_0.5; and if in addition leiden__resolution_parameter = 0.7 then set clustering.for.marker.calc = c(louvain_0.5, leiden_0.7).

calc.global.markers

logical whether to calculate global cluster markers: so each cluster vs. all other cells

calc.pairwise.markers

logical whether to calculate pairwise cluster markers: so each cluster vs. each cluster

extra.cols

numeric vector of an extra column (or matrix of multiple columns) with arbitraty information to add to the final fcs file; has to be equal to the number of rows in all flowframes provided; colnames will be the channel names in the FCS file; could be a previously calculated dimension reduction or cluster annotation.

mc.cores

number of cores to use for parallel computing of cluster markers and multiple cluster resolutions; mc.cores is passed to parallel::mclapply or parallel::mcmapply; limited to parallel::detectCores()-1

save.to.disk

what to save to disk: (concatenated) and appended FCS file and/or rds file with several elements in a list

save.path

where to save elements specified in save.to.disk; set to NULL to have nothing written to disk

exclude.extra.channels

when scaled and transform channels are written to FCS file, some channels may be redundant and will only occupy disk space, those are specified here; matched with grepl

write.transformed.channels.to.FCS

do save transformed channels (e.g. logicle or arcsinh) to FCS file

write.untransformed.channels.to.FCS

do save untransformed channels (inverse) to FCS file

write.scaled.channels.to.FCS

do save scaled channels (scale.whole, scale.samples) to FCS file

timeChannel

name of the Time channel to exclude from all analyses and calculation; if NULL will be attempted to be detected automatically

transformation_name

name of the applied transformation (will appear in FCS file as channel desc)

return_processed_raw_data_only

do not calculate dimension reduction etc but only return the preprocessed data for external calculations or tryouts

seed

set a seed for reproduction of dimension reductions

...

additional parameters to calculations of UMAP, tSNE, SOM, GQTSOM, EmbedSOM, louvain, leiden, harmony, flowClust, hclust, MUDAN, kmeans; provide arguments as follows: UMAP__n_neighbors = c(15,20,25), or tsne__theta = 0.3, etc. see respected help files to get to know which arguments can be passed: uwot::umap, Rtsne::Rtsne, EmbedSOM::SOM, EmbedSOM::GQTSOM, EmbedSOM::EmbedSOM, harmony::HarmonyMatrix, flowClust::flowClust, louvain: Seurat::FindNeighbors and Seurat::FindCluster, leiden: Seurat::FindNeighbors and leiden::leiden. hclust: stats::dist and stats::hclust, MUDAN: MUDAN::getComMembership, stats::kmeans, ClusterR::MiniBatchKmeans, ClusterR::KMeans_rcpp, ClusterR::KMeans_arma

Details

Logicle transformation of FI has been described here: Parks, et al., 2006, PMID: 16604519 DOI: 10.1002/cyto.a.20258. Different transformations have been compared for instance here: Transformations. Dimension reduction (low dimension embedding) algorithms are: UMAP, tSNE, SOM, GQTSOM, PCA. tSNE is expected to be too slow for more then 1e4 or 1e5 cells (depending precision set by tsne__theta). UMAP has been developed more recently, is well-know and quicker. SOM and GQTSOM are very quick and are similar to flowSOM but are much more convenient to use in programming as they accept matrices whereas flowSOM strictly requires flowFrames. SOMs also enable the so-called meta-clustering which massively speeds up cluster detection. The cost of preciseness has to be inspected manually. PCA will be orders of magnitude faster than UMAP especially on large numbers of events (1e6 and above). On the other hand it will not produce as nicely separated clusters as it is a linear algorithm. Cluster (community) detection algorithms are louvain, leiden, kmeans, hclust, flowClust and MUDAN. They assign events with similar properties in the high dimensional space a common discrete label. kmeans will be the quickest. Set metaclustering.on to 'SOM' or 'GQTSOM' to enable quick metaclustering. In this any of the clustering algorithms is performed on the SOM-map. Louvain is slower but may yield better results on original data (not meta-clustering). For a low number of events (e.g. below 1e6) louvain will perform in a reasonable amount of time and could used immediately in parallel to kmeans. leiden will require the original python library and is approximately similar to louvain with respect to calculation time, maybe a bit slower. leiden is an enhancement of louvain, though louvain does not produce "wrong" results per se. flowClust, MUDAN, and hclust have not been tested extensively. For kmeans, hclust, flowClust the number of expected cluster can be provided. louvain and leiden take a resolution parameter (lower resolution –> less clusters). MUDAN takes the a number of nearest neighbors (less neighbors –> more clusters). An initial PCA may not be required if the selected channels are chosen thoughtfully (only those with stained markers). If you decide though to simply use all channels (with and and without stained marker) for dimension reduction, PCA will automatically lead to focus on the ones with highest variation (so those channels which contain staining on a subset of events). In this case you may set n.pca.dims roughly equal to the number of channels which contain a staining.

Value

A list with 3 elements: (i) The matrix of fluorescence intensities and appended information (dim red, clustering). This is the table which is written into a newly generated fcs file. (ii) A character vector of meaningful column names which may be used for the table in R (rather for convenience). (iii) Tables of marker features (each cluster vs all other events and all clusters pairwise).

Examples

## Not run: 
############################
### Plot cluster markers ###
############################
dr <- fcexpr::dr_to_fcs(ff.list = ffs,
channels = channels,
louvain__resolution = 0.5,
run.lda = "louvain_0.5",
clustering.for.marker.calc = c("louvain_0.5"),
save.path = NULL)
marker <- dr[[3]][[1]][[1]]
marker$channel_desc2 <- sapply(strsplit(marker$channel_desc, "_"), "[", 1)
marker <-
 marker %>%
 dplyr::mutate(pvalue = ifelse(pvalue == 0, 1e-300, marker$pvalue)) %>%
 dplyr::group_by(channel_desc2) %>%
 dplyr::mutate(mean_scaled = fcexpr:::min.max.normalization(mean))

ggplot(marker, aes(x = as.factor(cluster), y = channel_desc2, fill = -log10(pvalue))) +
 geom_tile(color = "black") +
 theme_bw() +
 geom_text(aes(label = diff_sign)) +
 scale_fill_viridis_c()

ggplot(marker, aes(x = as.factor(cluster), y = channel_desc2, fill = mean_diff)) +
 geom_tile(color = "black") +
 theme_bw() +
 geom_text(aes(label = diff_sign)) +
 scale_fill_viridis_c()

ggplot(marker, aes(x = as.factor(cluster), y = channel_desc2, fill = mean_scaled)) +
 geom_tile(color = "black") +
 theme_bw() +
 scale_fill_viridis_c()

ggplot(marker, aes(x = mean_diff, y = -log10(pvalue), label = channel_desc2)) + #color = channel_desc2,
 geom_point() +
 theme_bw() +
 labs(title = "cluster markers (vs all other cells each)") +
 ggrepel::geom_text_repel(max.overlaps = 20, show.legend = F) +
 theme(panel.grid.minor.x = element_blank(), panel.grid.minor.y = element_blank(), panel.grid.major.x = element_blank(), strip.background = element_rect(fill = "hotpink2")) +
 geom_vline(xintercept = 0, col = "tomato2", linetype = "dashed") +
 geom_hline(yintercept = 100, col = "tomato2", linetype = "dashed") +
 facet_wrap(vars(cluster))

 ##############################################
 ### find clusters which may be subject #######
 ### to bi- or multimodality in any channel ###
 ##############################################
 # make one data frame
 marker_all <- purrr::map_dfr(setNames(names(out[["marker"]]), names(out[["marker"]])),
 function(x) out[["marker"]][[x]][["marker_table"]], .id = "clustering")

 # sort by diptest p value; or low p indicates bi- or multimodality
 marker_all <- dplyr::arrange(marker_all, diptest_p)
 # see ?diptest::dip.test
 # a low p-value indicates bi- or multimodality (multiple peaks)
 # a high p-value (close to 1) indicates unimodality
 # multimodality indicates heterogeneity within in the cluster
 # and may justify to separate that cluster further into sub-clusters
 # this depends on the interpretation of the scientist though


 ##############################################
 ### overlay gated populations from flowjo ####
 ### on dimension reduction plot (SOM/UMAP) ###
 ##############################################

common_cols <- Reduce(intersect, purrr::map(ff[["indices"]], colnames))
ff[["indices"]] <- purrr::map(ff[["indices"]], function(x) x[,which(colnames(x) %in% common_cols)])
ind_mat <- do.call(rbind, ff[["indices"]])

som <- ggplot(out[["df"]], aes(SOM_1, SOM_2, color = as.factor(cutree_30))) +
geom_point(size = 0.5) +
geom_density2d(data = . %>% dplyr::filter(ind_mat[,which(basename(colnames(ind_mat)) == "NK cells")]), color = "black",
contour_var = "ndensity", breaks = c(0.1, 0.2, 0.4, 0.6, 0.8)) +
guides(color = guide_legend(override.aes = list(size = 2)))


## End(Not run)

Close-your-eyes/fcexpr documentation built on Sept. 29, 2023, 12:27 a.m.