In saeyslab/CytoNorm: Normalisation of cytometry data measured across multiple batches

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

In this vignette, you learn how CytoNorm can be used to normalize a dataset where there are no controls available.

Ideally, controls should have been taken along over the course of the study to have a better understanding of the technical variation and to train and/or evaluate normalization models eventually. These controls should be the same sample measured multiple times so that the variation present is only the result of technical issues. However, sometimes it is not feasible to include dedicated controls. For example, if a study runs over a long period of time and concerns fresh samples, it is often not possible to measure an additional control each time. Other times, it might be the case that there is simply no appropriate control available; think of rare samples or studies with a specific stimulation or treatment. In the original CytoNorm publication, it was explained that CytoNorm requires acontrol sample per batch on which the normalization model is trained. Although we believe that using dedicated controls to build and validate the normalization model is preferable, we understand that control samples are not always available and that this limits the applicability of CytoNorm.

Here, we explain how CytoNorm can still be used even without controls. The key is to treat an aggregate of the batch as a proxy for a control sample of that batch. Either the aggregates are built internally when multiple files per label are given when training the model, or they can be created prior to the CytoNorm call. This has the advantage of more control on the number of cells per file.

Prepare the CytoNorm normalization step

Load CytoNorm and other necessary packages and create directories

library(CytoNorm)
library(flowCore)
library(FlowSOM)
library(ggpubr)

dir.create("RDS")

Download and organize the data

As example dataset, we will use flow cytometry PBMC data that was collected in the context of a clinical study to evaluate adjuvant dendritic cell based vaccination for malignant peritoneal mesothelioma. The raw fcs files were compensated and transformed, and margin events, low quality events and non-B cells were removed using a preprocessing workflow that was described in Analyzing high-dimensional cytometry data using FlowSOM.The preprocessed and cleaned dataset can be downloaded from flowRepository dataset FR-FCM-Z6UT. Once the data is downloaded, put it in a folder in your working directory, e.g. "Data/Preprocessed".

The dataset consists of 48 fcs files (8 patients, 2 panels and 3 consecutive time points per patient-panel combination) and there are two levels of batch effects:

Panel, both panels have common set of backbone markers
B cell panel 1 (P1)
B cell panel 2 (P2)
Patient cohort, corresponds with analysis day
Inclusion cohort 1 (C1)
Inclusion cohort 2 (C2)

This results in four batches of data with 12 fcs files each (P1_C1, P1_C2, P2_C1 and P2_C2).

dir_prepr <- "Data/Preprocessed"

# Organize preprocessed data in batches
files_P1_C1 <- list.files(path = dir_prepr,
                          pattern = "ID[1-4]_Panel1_TP[1-3].fcs",
                          full.names = TRUE)
files_P2_C1 <- list.files(path = dir_prepr,
                          pattern = "ID[1-4]_Panel2_TP[1-3].fcs",
                          full.names = TRUE)

# Read in 1 file per panel to get channel information
ff_P1 <- read.FCS(files_P1_C1[1])
ff_P2 <- read.FCS(files_P2_C1[1])

# Retrieve channel information
cellTypeMarkers <- c("CD27", "CD38", "IgD", "IgM") 
cellTypeChannels <- GetChannels(object = ff_P1, markers = cellTypeMarkers)
P1_cellStateMarkers <- c("CD5", "CD268", "CD24", "PDL1", "PD1", "CD86", "KI67")
P1_cellStateChannels <- GetChannels(object = ff_P1, markers = P1_cellStateMarkers)
P2_cellStateMarkers <- c("CD40", "CD21", "HLA-DR", "ICOSL", "PD1", "CD86", "KI67")
P2_cellStateChannels <- GetChannels(object = ff_P2, markers = P2_cellStateMarkers)

In this vignette, we will normalize the cell type markers of the two panels of the first patient cohort towards each other, i.e. P1_C1 and P2_C1.

Train the CytoNorm model

Create aggregate files that will be used as a proxi for a control file

``` {r Prepare aggregates}

Create aggregate per batch with 10000 cells from each file

set.seed(1) agg_P1_C1 <- AggregateFlowFrames(fileNames = files_P1_C1, cTotal = length(files_P1_C1)10000) agg_P2_C1 <- AggregateFlowFrames(fileNames = files_P2_C1, cTotal = length(files_P2_C1)10000)

Save aggregates

write.FCS(agg_P1_C1, "Data/Preprocessed/agg_P1_C1.fcs") write.FCS(agg_P2_C1, "Data/Preprocessed/agg_P2_C1.fcs")

### Train the CytoNorm model
```r
model1_withoutControl <- CytoNorm.train(files = flowSet(agg_P1_C1, agg_P2_C1),
                                        labels = c("P1", "P2"),
                                        channels = cellTypeChannels,
                                        transformList = NULL,
                                        seed = 1,
                                        verbose = TRUE,
                                        plot = TRUE,
                                        FlowSOM.params = list(nCells = 1e+06, 
                                                              xdim = 7, ydim = 7, 
                                                              nClus = 3, 
                                                              scale = FALSE))
saveRDS(object = model1_withoutControl, file = "RDS/model1_withoutControl.rds")

The files parameter in the CytoNorm.train call requires the paths to the control fcs files or a flowSet with the control samples in it and the labels argument is used to define the batch labels for the control files. Here however, an aggregate or concatenated fcs file is given per batch instead of a control file. The channels that should be normalized should be specified in the channels argument and -if needed- a transformation object to transform the data can be given at transformList. Optionally, a seed can be determined to make the results reproducible, progress updates can be printed via verbose and plot = TRUE will generate and save visualizations of the internal FlowSOMmodel and CytoNorm splines. The dimensions of the self-organizing map (SOM) of the FlowSOM model, number of cells to use and number of meta-clusters, should be specified in the FlowSOM.params argument. Here, we had a moderate number of markers for the clustering and normalization and based on our panel we did not expect many distinct cell populations. Therefore, and after inspection of the visualizations listed below, we opted for a seven-by-seven SOM grid and three meta-clusters. In contrast, the files argument in CytoNorm.normalize() takes all the files that need to be normalized with the trained model that is provided to the model parameter. Batch identities per file should be specified in labels and similar to the function to train the model, progress updates can be printed via verbose. If a transformation object is provided at transformList, an object to reverse the transformation should also be given at transformList.reverse so that the fcs files can be saved in the original space. A prefix can be specified to mark the normalized fcs files when they are saved in the outputDir.

Evaluate the CytoNorm assumption: check the CVs

# Evaluate the CVs
CVs1 <- testCV(fsom = model1_withoutControl$fsom, 
               cluster_values = 3:20, plot = FALSE)

PlotOverviewCV(fsom = model1_withoutControl$fsom, cv_res = CVs1,
               show_cv = 0.8, max_cv = 1.5)

Apply CytoNorm model

# Normalize files
CytoNorm.normalize(model = model1_withoutControl,
                   files = c(files_P1_C1, files_P2_C1),
                   labels = c(rep("P1", length(files_P1_C1)),
                              rep("P2", length(files_P2_C1))),
                   transformList = NULL,
                   verbose = TRUE,
                   prefix = "",
                   transformList.reverse = NULL, 
                   outputDir = "Data/Normalized_cellType")

Evaluate the quality

Density plots

# List files
files_P1_C1_norm <- list.files(path = "Data/Normalized_cellType", 
                               pattern = "ID[1-4]_Panel1_TP[1-3].fcs", full.names = TRUE)
files_P2_C1_norm <- list.files(path = "Data/Normalized_cellType", 
                               pattern = "ID[1-4]_Panel2_TP[1-3].fcs", full.names = TRUE)

# Make plots
p <- plotDensities(input = list("P1_C1" = agg_P1_C1,
                                "P2_C1" = agg_P2_C1,
                                "P1_C1_norm" = files_P1_C1_norm,
                                "P2_C1_norm" = files_P2_C1_norm),
                   channels = cellTypeChannels,
                   colors = c("P1_C1" = "#900002", 
                              "P2_C1" = "#0046A1"), 
                   model = model1_withoutControl)

# Arrange figure
p_ <- ggarrange(ggarrange(plotlist = p[1:length(p)-1], 
                          ncol = (length(p)-1)/(2*length(cellTypeChannels)), 
                          nrow = 2*length(cellTypeChannels)),
                widths = c(1,3),
                p$legend, nrow = 2, heights = c(10,1))
print(p_)

Spline plots

plotSplines(model = model1_withoutControl, channels = cellTypeChannels, 
            groupClusters = TRUE)
plotSplines(model = model1_withoutControl, channels = cellTypeChannels)

EMD and MAD

Make aggregates and get aggregate manual labels

# Get manual labels per preprocessed file
gating <- readRDS("Data/Preprocessed/attachments/gating.rds")
manualLabels <- list()

dictionary <- c("Unlabeled" = "unlabeled",
                "Trans" = "Transitional B cells",
                "Q1: CD27- , IgD+ (naive B)" = "Naive B cells",
                "Q2: CD27+ , IgD+ (mem)" = "IgD+ memory B cells",
                "CD27, IgM+ (mem)" = "IgM+ memory B cells", 
                "CD27, IgM- (class switched mem)" = "Class switched memory B cells", 
                "Q4: CD27- , IgD-" = "Double negative B cells")

for (file in names(gating)){
  manual <- ManualVector(manualMatrix = gating[[file]], 
                         cellTypes = names(dictionary)[-1])
  manualLabels[[paste0(file, ".fcs")]] <- factor(unname(dictionary[manual]),                                                 
                                                 levels = unname(dictionary))
}

# Aggregates
dir.create("tmp/before")
dir.create("tmp/after")
manual_agg <- list()

aggregates <- list("agg_P1_C1" = files_P1_C1,
                   "agg_P2_C1" = files_P2_C1)

for (agg in names(aggregates)){
  writeLines(agg)
  files <- aggregates[[agg]]
  set.seed(2023)
  before <- AggregateFlowFrames(fileNames = files, silent = TRUE,
                                cTotal = length(files) * 10000)

  write.FCS(before, paste0("tmp/before/", agg, ".fcs"))  
  after <- before

  manual <- c()
  for (i_file in unique(before@exprs[,"File"])){
    ff <- read.FCS(paste0("Data/Normalized_cellType/", basename(aggregates[[agg]][i_file])))
    i_IDs <- before@exprs[before@exprs[,"File"] == i_file,"Original_ID2"]
    after@exprs[before@exprs[,"File"] == i_file,1:(ncol(after@exprs)-3)] <- ff@exprs[i_IDs, ]
    labels <- as.character(manualLabels[[basename(aggregates[[agg]][i_file])]])
    manual <- c(manual, labels[i_IDs])
  }    

  manualLabels[[paste0(agg, ".fcs")]] <- factor(manual, levels = levels(manualLabels[[1]]))
  write.FCS(after, paste0("tmp/after/", agg, ".fcs"))
}
saveRDS(manualLabels, "Data/Preprocessed/attachments/manualLabels.rds")

Define cell types and files for model evaluation

cellTypes <- c("unlabeled",
               "All B cells",
               "Transitional B cells",
               "Naive B cells",
               "IgD+ memory B cells",
               "IgM+ memory B cells", 
               "Class switched memory B cells", 
               "Double negative B cells")

models <- list("Model1_before" = list("files" = c(files_P1_C1, "tmp/before/agg_P1_C1.fcs", 
                                                   "tmp/before/agg_P2_C1.fcs", files_P2_C1),
                                       "channels" = cellTypeChannels),
                "Model1_after" = list("files" = c(files_P1_C1_norm, "tmp/after/agg_P1_C1.fcs", 
                                                  "tmp/after/agg_P2_C1.fcs", files_P2_C1_norm),
                                       "channels" = cellTypeChannels))

Calculate EMD and MAD

EMDs <- list()
MADs <- list()
for (model in names(models)){
  print(model)
  files <- models[[model]][["files"]]
  channels <- models[[model]][["channels"]]

  EMDs[[model]] <- emdEvaluation(files = files,
                                 channels = channels,
                                 manual = manualLabels, return_all = TRUE, 
                                 manualThreshold = 20)
  MADs[[model]] <- CytoNorm:::madEvaluation(files = files[!startsWith(files, "tmp")], 
                                 channels = channels, return_all = TRUE,
                                 manual = manualLabels, transformList = NULL)
}

EMD_scores <- list()
for (subset in names(models)){
  dist <- EMDs[[subset]]$distances
  EMD_scores[[subset]] <- matrix(NA,
                                 nrow = length(dist), ncol = length(dist[[1]]),
                                 dimnames = list(names(dist),
                                                 names(dist[[1]])))
  for (cellType in names(dist)){
    for (marker in names(dist[[cellType]])){
      EMD_scores[[subset]][cellType, marker] <- dist[[cellType]][[marker]][13, 14]
      }
  }
}

Make figures

markerlevels <- NULL
plotlist <- list()

# EMD plot
EMD_before <- EMD_scores[["Model1_before"]][-2,]
EMD_after <- EMD_scores[["Model1_after"]][-2,]

EMD_df <- data.frame("before" = c(EMD_before),
                     "after" = c(EMD_after),
                     "feature" = paste(rownames(EMD_before), 
                                       rep(colnames(EMD_before), 
                                           each = nrow(EMD_before)), sep = "_"))
if (is.null(markerlevels)){
  markerlevels <- unique(unique(sub(".*_", "", EMD_df$feature)))
}
EMD_df$marker <- factor(sub(".*_", "", EMD_df$feature), levels = markerlevels)
EMD_df$celltype <- factor(sub("_.*", "", EMD_df$feature), levels = c("AllCells", cellTypes))
max <- max(EMD_df[,1:2], na.rm = T)
p_emd <- ggplot(EMD_df, aes(x = after, y = before)) +
  xlim(0,max) + ylim(0,max) +
  geom_point(aes(color = marker, shape = celltype)) +
  geom_abline() +
  scale_shape_manual(values = c(16, 17, 15, 3, 7, 8, 6)) +
  ggtitle("EMD") +
  theme_minimal()

leg_emd <- get_legend(p_emd)

p_emd <- p_emd  +
    theme(legend.position = "none")
plotlist[[length(plotlist)+1]] <- p_emd    

names(markerlevels) <- sub(".*<(.*)>.*", "\\1", markerlevels)

# MAD plot
MAD_before <- MADs[["Model1_before"]][["comparison"]][-2,]
MAD_after <- MADs[["Model1_after"]][["comparison"]][-2,]
MAD_df <- data.frame("before" = c(MAD_before),
                     "after" = c(MAD_after),
                     "feature" = paste(rownames(MAD_before), 
                                       rep(colnames(MAD_before), 
                                           each = nrow(MAD_before)), sep = "_"))
MAD_df$marker <- factor(markerlevels[sub(".*_", "", MAD_df$feature)], levels = markerlevels)
MAD_df$celltype <- factor(sub("_.*", "", MAD_df$feature), c("AllCells", cellTypes))
max <- max(MAD_df[,1:2], na.rm = T)
p_mad <- ggplot(MAD_df, aes(x = after, y = before)) +
  xlim(0,max) + ylim(0,max) +
  geom_point(aes(color = marker, shape = celltype)) +
  geom_abline() +
  scale_shape_manual(values = c(16, 17, 15, 3, 7, 8, 6)) +
  ggtitle("MAD") +
  theme_minimal()

leg_mad <- get_legend(p_mad)

p_mad <- p_mad  +
    theme(legend.position = "none")

plotlist[[length(plotlist)+1]] <- p_mad

ggarrange(ggarrange(plotlist = plotlist),
          as_ggplot(leg_emd), ncol=2, widths = c(4,1.5))

saeyslab/CytoNorm documentation built on June 15, 2025, 11:40 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

saeyslab/CytoNorm
Normalisation of cytometry data measured across multiple batches

In saeyslab/CytoNorm: Normalisation of cytometry data measured across multiple batches

In this vignette, you learn how CytoNorm can be used to normalize a dataset where there are no controls available.

Prepare the CytoNorm normalization step

Load CytoNorm and other necessary packages and create directories

Download and organize the data

Train the CytoNorm model

Create aggregate files that will be used as a proxi for a control file

Create aggregate per batch with 10000 cells from each file

Save aggregates

Evaluate the CytoNorm assumption: check the CVs

Apply CytoNorm model

Evaluate the quality

Density plots

Spline plots

EMD and MAD

Make aggregates and get aggregate manual labels

Define cell types and files for model evaluation

Calculate EMD and MAD

Make figures

R Package Documentation

Browse R Packages

We want your feedback!

saeyslab/CytoNorm Normalisation of cytometry data measured across multiple batches

In saeyslab/CytoNorm: Normalisation of cytometry data measured across multiple batches

In this vignette, you learn how CytoNorm can be used to normalize a dataset where there are no controls available.

Prepare the CytoNorm normalization step

Load CytoNorm and other necessary packages and create directories

Download and organize the data

Train the CytoNorm model

Create aggregate files that will be used as a proxi for a control file

Create aggregate per batch with 10000 cells from each file

Save aggregates

Evaluate the CytoNorm assumption: check the CVs

Apply CytoNorm model

Evaluate the quality

Density plots

Spline plots

EMD and MAD

Make aggregates and get aggregate manual labels

Define cell types and files for model evaluation

Calculate EMD and MAD

Make figures

R Package Documentation

Browse R Packages

We want your feedback!

saeyslab/CytoNorm
Normalisation of cytometry data measured across multiple batches