The goal of cytotypr is to identify flow cytometry cell populations efficiently using either Fluorescent Minus One controls (FMOs) or distinct population differences.
An in-depth description of the pipeline can be found in our paper Cyto-Feature Engineering: A Pipeline for Flow Cytometry Analysis to Uncover Immune Populations and Associations with Disease
For additional examples, check out the “vignettes” folder which contains 4 additional examples
You can install the development version of cytotypr from GitHub with:
# install.packages("devtools")
devtools::install_github("aef1004/cytotypr")
This is a basic example which shows you how to obtain basic results and plots for flow cytometry data using FMOs:
library(cytotypr)
List of additional packages needed to run this example
library(data.table)
library(dplyr)
library(stringr)
library(tidyr)
library(tibble)
library(ggplot2)
library(readxl)
library(broom)
library(purrr)
The data that is being used in this example is from the paper “Cyto-feature engineering….” The samples are from the lungs of C57BL/6 mice that were either sham-vaccinated or vaccinated with an Mycobacterium tuberculosis vaccine, Bacillus Calmette–Guérin (BCG). The flow cytometry panel is used to elucidate specific T cell populations.
For the purpose of this basic example, the flow cytometry samples have been read in as a flowset and initial gating has already been applied to limit the data to measurements of live, singlet lymphocyte cells.
We first want to convert the data to a “tidy data” format, to allow us to work with “tidyverse” tools for further analysis and visualization.
Apply the tidy_flow_set function to the ‘flowSet’ of gated FMO data to
output a dataframe:
FMO_gated_data <- tidy_flow_set(flowset_FMO_gated_data)
FMO_gated_data
#> # A tibble: 417,910 × 49
#> filename Time `SSC-H` `SSC-A` `FSC-H` `FSC-A` `BV421-H` `Pacific Blue-H`
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 CD103.fcs 130 105692 126259. 354442 645303. -58.8 295.
#> 2 CD103.fcs 137 270246 315833. 1094791 1587586. -6837. 181.
#> 3 CD103.fcs 139 326197 377253. 1312044 1802283. -6375. 2438.
#> 4 CD103.fcs 149 74523 85250. 295947 557216. 285. 542.
#> 5 CD103.fcs 159 103797 110015. 293302 485668. -419. 250.
#> 6 CD103.fcs 215 188622 219861. 355468 550662. -589. 1265.
#> 7 CD103.fcs 233 275485 317618. 1264605 1573930. -3984. 770.
#> 8 CD103.fcs 242 262851 303086 931615 1199019. -5463. -680.
#> 9 CD103.fcs 501 205731 360208. 578193 675543. -782. 865.
#> 10 CD103.fcs 590 1510416 2105650. 1396843 1871814. -56700. 192.
#> # … with 417,900 more rows, and 41 more variables: BV480-H <dbl>,
#> # BV510-H <dbl>, BV570-H <dbl>, BV605-H <dbl>, BV650-H <dbl>, BV711-H <dbl>,
#> # BV785-H <dbl>, BB515-H <dbl>, Alexa Fluor 532-H <dbl>, PE-H <dbl>,
#> # PE-Dazzle594-H <dbl>, PE-Cy5-H <dbl>, PE-Cy5.5-H <dbl>,
#> # PerCP-eFluor 710-H <dbl>, PE-Cy7-H <dbl>, APC-H <dbl>, APC-R700-H <dbl>,
#> # APC-Fire 750-H <dbl>, AF-H <dbl>, BV421-A <dbl>, Pacific Blue-A <dbl>,
#> # BV480-A <dbl>, BV510-A <dbl>, BV570-A <dbl>, BV605-A <dbl>, …
Note that when you plot the FMOs here, you should see all of the FMOs that you want to use. If you don’t see all of them, ensure that all ofyour filenames and column names for each of the markers matches exactly. For example, if the filename says “CD103_f” but the corresponding column name for that marker is “CD103”, you need to either change the filename or column name so that they are exactly the same.
# note that here the filename and the column marker names need to match exactly
df_FMO_gated_data <- FMO_gated_data %>%
dplyr::select(ends_with("-A"), -`FSC-A`, `SSC-A`, filename) %>%
dplyr::rename(`FoxP3` = "APC-A",
`CD44` = "APC-Fire 750-A",
`CD103` = "APC-R700-A",
`CD3` = "Alexa Fluor 532-A",
`Sca1` = "BB515-A",
`IL_10` = "BV421-A",
`CD4` = "BV480-A",
`CD69` = "BV510-A",
`CD8` = "BV570-A",
`CTLA4` = "BV605-A",
`CD27` = "BV650-A",
`CD153` = "BV711-A",
`KLRG1` = "BV785-A",
`IL_17` = "PE-A",
`CD122` = "PE-Cy5-A",
`IFN` = "PE-Cy7-A",
`CD62L` = "PE-Dazzle594-A",
`TNF` = "Pacific Blue-A",
`CD28` = "PE-Cy5.5-A",
`PD1` = "PerCP-eFluor 710-A") %>%
na.omit()%>%
dplyr::filter(`SSC-A` != max(`SSC-A`)) %>%
dplyr::mutate(filename = str_replace(filename, ".fcs", "")) %>%
dplyr::mutate(filename = str_replace(filename, "IFNG", "IFN"))
FMO_filtered_data <- filter_FMO(df_FMO_gated_data)
add_quantile <- get_99(FMO_filtered_data)
plot_FMOs(FMO_filtered_data, add_quantile)
#> Warning: Removed 25721 rows containing missing values (geom_vline).

Pull out the gated data
# tidy the flowset and convert to a dataframe
df_all_gated <- tidy_flow_set(flowset_gated_data)
unique(df_all_gated$filename)
#> [1] "D30_Lung_Group_1_A.fcs" "D30_Lung_Group_1_B.fcs" "D30_Lung_Group_1_C.fcs"
#> [4] "D30_Lung_Group_2_A.fcs" "D30_Lung_Group_2_B.fcs" "D30_Lung_Group_2_C.fcs"
#> [7] "D60_Lung_Group_1_A.fcs" "D60_Lung_Group_1_B.fcs" "D60_Lung_Group_1_C.fcs"
#> [10] "D60_Lung_Group_2_A.fcs" "D60_Lung_Group_2_B.fcs" "D60_Lung_Group_2_C.fcs"
#> [13] "D90_Lung_Group_1_A.fcs" "D90_Lung_Group_1_B.fcs" "D90_Lung_Group_1_C.fcs"
#> [16] "D90_Lung_Group_2_A.fcs" "D90_Lung_Group_2_B.fcs" "D90_Lung_Group_2_C.fcs"
Feature cut the data to get all of the possible populations for each file with the cell count and number of cells. Feature engineering is based on the 99%.
Remove FoxP3 (APC-A) Remove CD69 (BV510-A)- spreading from Sca1 makes the marker unusable
The “count_calc” function at the end calculates the cell counts and percentage of cells in each sample for each population. The dataframe that is input into this function should only contain the markers that you’re interesting in looking at, and should remove SSC-A, FSC-A, etc.
all_fe <- df_all_gated %>%
mutate(Timepoint = str_extract(filename, "D[0-9]*")) %>%
mutate(Group = str_extract(filename, "Group[:punct:][0-9][:punct:][A-Z]"),
Group = str_replace(Group, "Group_", "")) %>%
unite(filename, c("Timepoint", "Group")) %>%
dplyr::filter(`SSC-A` != max(`SSC-A`)) %>%
select(ends_with("-A"), -`FSC-A`, -`Zombie Nir-A`, -`AF-A`, -`SSC-A`, filename) %>%
dplyr::rename(`FoxP3` = "APC-A",
`CD44` = "APC-Fire 750-A",
`CD103` = "APC-R700-A",
`CD3` = "Alexa Fluor 532-A",
`Sca1` = "BB515-A",
`IL_10` = "BV421-A",
`CD4` = "BV480-A",
`CD69` = "BV510-A",
`CD8` = "BV570-A",
`CTLA4` = "BV605-A",
`CD27` = "BV650-A",
`CD153` = "BV711-A",
`KLRG1` = "BV785-A",
`IL_17` = "PE-A",
`CD122` = "PE-Cy5-A",
`IFN` = "PE-Cy7-A",
`CD62L` = "PE-Dazzle594-A",
`TNF` = "Pacific Blue-A",
`CD28` = "PE-Cy5.5-A",
`PD1` = "PerCP-eFluor 710-A") %>%
select(-CD69, -FoxP3) %>%
mutate(CD3 = fe(add_quantile, CD3, "CD3"),
CD4 = fe(add_quantile, CD4, "CD4"),
CD8 = fe(add_quantile, CD8, "CD8"),
CD44 = fe(add_quantile, CD44, "CD44"),
CD103 = fe(add_quantile, CD103, "CD103"),
Sca1 = fe(add_quantile, Sca1, "Sca1"),
IL_17 = fe(add_quantile, IL_17, "IL_17"),
CTLA4 = fe(add_quantile, CTLA4, "CTLA4"),
CD27 = fe(add_quantile, CD27, "CD27"),
CD153 = fe(add_quantile, CD153, "CD153"),
KLRG1 = fe(add_quantile, KLRG1, "KLRG1"),
IFN = fe(add_quantile, IFN, "IFN"),
CD122 = fe(add_quantile, CD122, "CD122"),
PD1 = fe(add_quantile, PD1, "PD1"),
CD62L = fe(add_quantile, CD62L, "CD62L"),
IL_10 = fe(add_quantile, IL_10, "IL_10"),
CD28 = fe(add_quantile, CD28, "CD28"),
TNF = fe(add_quantile, TNF, "TNF")) %>%
count_calc()
Initial identification of populations plot
We first want to view all of the different cell phenotypes within the data
# this is the order of markers that we want for all of our plots
order_of_markers <- c("CD3", "CD4", "CD8", "CD44", "CD103", "Sca1", "IL_17","CTLA4",
"CD27", "CD153", "KLRG1", "IFN", "CD122", "PD1", "CD62L",
"IL_10", "CD28","TNF")
# to view all of the possible combinations
total_phenotypes <- filter_for_total_pheno(all_fe, marker_order = order_of_markers)
heatmap_all_pheno(total_phenotypes)

# gives the total number of populations
nrow(total_phenotypes)
#> [1] 10912
After identifying all phenotypes, we can filter the data to see the ones that we’re interested in, for example, CD3+ cells that constitute >0.5% of total live leukocytes in a sample.
# view the specific cell phenotypes we're interested in
sample_populations <- all_fe %>%
dplyr::filter(CD3 == 1 & percentage > 0.5) %>%
filter_pops()
sample_populations_all_groups <- identified_pop_perc(sample_populations, all_fe,
marker_vector = order_of_markers)
#> Joining, by = "population"
#> Joining, by = c("filename", "IL_10", "TNF", "CD4", "CD8", "CTLA4", "CD27", "CD153", "KLRG1", "Sca1", "CD3", "IL_17", "CD62L", "CD122", "CD28", "PD1", "IFN", "CD103", "CD44")
Plot sample populations
simple_pop_df <- sample_populations %>%
column_to_rownames("population")
heatmap_subset_pheno(simple_pop_df, order_of_markers)

Correlation plot
corr <- calc_corr(sample_populations_all_groups)
melted_corr <- format_corr(corr)
plot_corr(melted_corr) +
ggplot2::xlab("Populations") +
ggplot2::ylab("Populations") +
ggplot2::labs(fill = "Correlation")

Time Series Plot
# take the data for filtered populations and rename so that plots are pretty
pops_for_plots_average <- sample_populations_all_groups %>%
tidyr::separate(filename, into = c("Timepoint", "Group", "Number"),
sep = "_") %>%
dplyr::group_by(population, Timepoint, Group) %>%
dplyr::summarise(average_percent = mean(percentage)) %>%
dplyr::mutate(Group = str_replace(Group, "1", "Control"),
Group = str_replace(Group, "2", "Vaccinated")) %>%
dplyr::mutate(Timepoint = str_extract(Timepoint, "[0-9].+"))%>%
dplyr::mutate(pop = as.numeric(str_extract(population, "[:digit:]+")))
#> `summarise()` has grouped output by 'population', 'Timepoint'. You can override using the `.groups` argument.
ggplot(pops_for_plots_average, aes(x = Timepoint, y = average_percent,
group = Group, color = Group)) +
scale_fill_identity() +
geom_point() +
geom_line() +
facet_wrap("pop", scales = "free", ncol = 7, labeller = label_both) +
xlab("Days Post-Infection") +
ylab("Average Percent of Total Live Leukocytes") +
theme_gray()

Data Visualization
CFUs <- readxl::read_xlsx("./inst/extdata/CFU_data.xlsx") %>%
dplyr::filter(Organ == "Lung") %>%
dplyr::filter(Group == "1" | Group == "2")
# clean the flow data and prepare to join with CFU
pops_for_CFUs <- sample_populations_all_groups %>%
separate(filename, into = c("Timepoint", "Group", "Number"), sep = "_") %>%
dplyr::mutate(Timepoint = str_replace(Timepoint, "D", "")) %>%
dplyr::mutate(Timepoint = as.numeric(Timepoint),
Group = as.numeric(Group))
# join together the CFU data and the population data
pops_CFUs <- inner_join(pops_for_CFUs, CFUs, by = c("Group", "Number", "Timepoint")) %>%
dplyr::mutate(pop = as.numeric(str_extract(population, "[:digit:]+"))) %>%
select(-population)
# calculate statistical significance
fitted_models <- pops_CFUs %>%
group_by(pop) %>%
nest() %>%
mutate(model = map(data, ~lm(CFU ~ percentage, data = .)),
summary_model = map(model, tidy)) %>%
unnest(summary_model) %>%
filter(term == "percentage") %>%
select(pop, estimate, model) %>%
mutate(tidy_model = map(model, broom::glance)) %>%
unnest(tidy_model) %>%
select(pop, adj.r.squared, p.value, estimate) %>%
ungroup()
values_for_plot <- fitted_models %>%
mutate(p.val.adj = p.adjust(p.value, method = "BH",
n = length(fitted_models$p.value)))
values_for_plot %>%
mutate("Significance" = p.val.adj < 0.05) %>%
rename("p-value" = p.value,
"Adjusted p-value" = p.val.adj)
#> # A tibble: 33 × 6
#> pop adj.r.squared `p-value` estimate `Adjusted p-value` Significance
#> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>
#> 1 1 0.624 0.0000581 -0.760 0.000479 TRUE
#> 2 10 0.263 0.0172 -2.52 0.0316 TRUE
#> 3 11 0.525 0.000402 -3.99 0.00189 TRUE
#> 4 12 0.135 0.0738 -2.35 0.0962 FALSE
#> 5 13 -0.0474 0.638 -0.626 0.658 FALSE
#> 6 14 0.220 0.0284 -0.916 0.0407 TRUE
#> 7 15 0.499 0.000627 1.08 0.00259 TRUE
#> 8 16 0.113 0.0937 -1.20 0.115 FALSE
#> 9 17 0.466 0.00107 -4.56 0.00394 TRUE
#> 10 18 0.318 0.00868 -4.53 0.0179 TRUE
#> # … with 23 more rows
Using geom_label instead of stat_smooth_fun
# the y_axis label looks good placed at y = 6 and y = 3 but I'm trying to remove the hard-coding here
r_labeling <- left_join(pops_CFUs, values_for_plot, by = "pop") %>%
group_by(pop) %>%
mutate(average_percentage = mean(percentage)) %>%
mutate(x_axis_label = min(percentage) + (max(percentage) - min(percentage))/2,
y_axis_label = ifelse(estimate <= 0, max(CFU), min(CFU))) %>%
select(pop, average_percentage, adj.r.squared, p.val.adj,
x_axis_label, y_axis_label) %>%
unique()
ggplot(pops_CFUs) +
scale_fill_identity() +
geom_point(aes(percentage, CFU), color = "black") +
geom_smooth(aes(x = percentage, y = CFU), method = "lm", se = FALSE, color = "#3F4788FF") +
geom_text(data = r_labeling, aes(x = x_axis_label, y = y_axis_label,
label = paste("r^2 == ",
round(adj.r.squared, 2))),
parse = TRUE) +
facet_wrap(~pop, scales = "free_x", ncol = 7, labeller = label_both) +
xlab("Percentage of Cells") +
ylab("log10 CFU") +
ggtitle("Population and CFU Linear regression")
#> `geom_smooth()` using formula 'y ~ x'

Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.