knitr::opts_chunk$set(
  collapse = TRUE,
  # eval = FALSE,
  comment = "#>"
)

Introduction {#introduction}

This vignette describes how to use the r BiocStyle::Biocpkg("quantiseqr") package for streamlining your workflow around the quanTIseq method. quanTIseq is a transcriptomics deconvolution method that uses an RNA-seq-derived signature matrix (called TIL10) for quantifying 10 different immune cell types in bulk tumor or blood transcriptomics data. quanTIseq has been extensively validated using real and simulated RNA-seq data, as well as flow cytometry and immunohistochemistry data.

The TIL10 signature can quantify cell fractions for

For detailed information about quanTIseq methodology and its embedded TIL10 signature, please refer to its original publication [@quantiseq2019], or consult the documentation for the quanTIseq pipeline (https://icbi.i-med.ac.at/software/quantiseq/doc/).

quantiseqr returns a cell type by sample quantification of these cell types, either as a simple data frame object, or alternatively, when providing an object derived from the SummarizedExperiment class, adds this information in the colData slot, where it can be further accessed.

Getting started {#gettingstarted}

To install this package, start R and enter:

if (!requireNamespace("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}

BiocManager::install("quantiseqr")

Once installed, the package can be loaded and attached to your current workspace as follows:

library("quantiseqr")

In the following chunk, we load a set of additional packages that will be required throughout this vignette.

library("dplyr")
library("ggplot2")
library("tidyr")
library("tibble")
library("GEOquery")
library("reshape2")
library("SummarizedExperiment")

In order to use r BiocStyle::Biocpkg("quantiseqr") in your workflow, one fundamental input is required, to be provided to run_quantiseqr() as expression_data. This is an object containing the gene TPM expression values measured in the sample under investigation, and can be provided in different ways:

Some use cases for quantiseqr

In this section, we illustrate the usage of quantiseqr on a variety of datasets. These differ with respect to their size and samples of origin, and we illustrate how the different parameters of quantiseqr should be set in the different scenarios.

The fundamental input for quantiseqr is a gene expression matrix-like object, with features on the rows, and samples as the columns. quantiseqr can also directly handle SummarizedExperiment objects, as well as ExpressionSet objects, commonly used for microarray data. In case a SummarizedExperiment object is passed, the quantifications of the immune cell composition can be directly returned extending the colData of the provided input.

Use case 1: Metastatic melanoma patients (Racle et al 2017)

quantiseqr ships with an example dataset with samples from four patients with metastatic melanoma published in [@EPIC2017].

The dataset quantiseqr::dataset_racle contains:

We are going to use the bulk RNA-seq data to run the deconvolution methods and will compare the results to the FACS data in the following steps.

Let's inspect the expression matrix first:

data("dataset_racle")
dim(dataset_racle$expr_mat)
knitr::kable(dataset_racle$expr_mat[1:5, ])

The quantification of the immune cell types with quantiseqr can be done as in the chunk below:

ti_racle <- quantiseqr::run_quantiseq(
  expression_data = dataset_racle$expr_mat,
  signature_matrix = "TIL10",
  is_arraydata = FALSE,
  is_tumordata = TRUE,
  scale_mRNA = TRUE
)

The call above means that we are passing the expression data as simple matrix (in dataset_racle$expr_mat) and quantifying the tumor immune cell composition using the (default) TIL10 signature. This is a dataset stemming from tumor RNA-seq samples. Therefore is_tumordata is set to TRUE, whereas is_arraydata is set to FALSE (default). With scale_mRNA set to TRUE (default), we are performing the correction of cell-type-specific mRNA content bias.

The output of quantiseqr can be further processed and visualized in a tabular or graphical manner to facilitate the comparisons across samples/conditions.

The estimates returned by quantiseqr can be interpreted as a cell-type fractions that can be compared between and within samples, making it possible to represent them as a stacked bar chart.

quantiplot(ti_racle)

We observe that

Estimating the amount of "uncharacterized cells" is a novel feature introduced by quanTIseq and EPIC [@quantiseq2019, @EPIC2017]. This estimate often corresponds to the fraction of tumor cells in the sample.

Use case 2: PBMCs from GSE107572 (Finotello et al 2019)

Here we show how to use quantiseqr to deconvolute blood-derived immune-cell mixtures [@quantiseq2019], for which also matching flow cytometry data are available.

This is also presented as an example in [@Plattner2020], please refer to this later publication for additional details on the processing steps.

The example dataset is available online at the Gene Expression Omnibus (accession number GSE107572), and is provided as preprocessed RNA-seq data from blood-derived immune-cell mixtures from nine healthy donors.
Flow cytometry estimates for the according immune subpopulations are also available.

## While downloading by hand is possible, it is recommended to use GEOquery
# wget -c ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE107nnn/GSE107572/suppl/GSE107572%5Ftpm%5FPBMC%5FRNAseq%2Etxt%2Egz
# unzip GSE107572_tpm_PBMC_RNAseq.txt.gz
# read.table("GSE107572_tpm_PBMC_RNAseq.txt", header = TRUE)


# downloading the supplemental files on the fly
tpminfo_GSE107572 <- getGEOSuppFiles("GSE107572",
  baseDir = tempdir(),
  filter_regex = "GSE107572_tpm_PBMC_RNAseq"
)
tpm_location <- rownames(tpminfo_GSE107572)[1]
tpm_location
tpmdata <- read.table(tpm_location, header = TRUE)
tpm_genesymbols <- tpmdata$GENE
tpmdata <- as.matrix(tpmdata[, -1])
rownames(tpmdata) <- tpm_genesymbols

ti_PBMCs <- quantiseqr::run_quantiseq(
  expression_data = tpmdata,
  signature_matrix = "TIL10",
  is_arraydata = FALSE,
  is_tumordata = FALSE,
  scale_mRNA = TRUE
)

To obtain an overview on immune cell type compositions, we can print out the results and plot them with the convenient quantiplot wrapper:

# printing out the percentages for the first 4 samples
signif(ti_PBMCs[1:4, 2:12], digits = 3)
# getting a complete visual overview
quantiplot(ti_PBMCs)

Notably, for these samples, corresponding quantifications of the true cell fractions done by flow cytometry are also available.
In the chunk that follows, we retrieve that information and generate scatter plots to display how the estimated values from quantiseqr correlate with the ground truth values.
This is adapted by [@Plattner2020] to match the output format generated by quantiseqr - and can be used as a template in case other types of matched ground-truth information are available.

We start by retrieving the information from the GSE107572 entry via GEOquery (this will be cached locally after the first execution), and process the information matching it to the quantification of cell fractions we just obtained as ti_PBMCs.

GEOid <- "GSE107572"
gds <- getGEO(GEOid)
GEOinfo <- pData(gds[[1]])
FACSdata <- data.frame(
  B.cells = GEOinfo$`b cells:ch1`,
  T.cells.CD4 = GEOinfo$`cd4+ t cells:ch1`,
  T.cells.CD8 = GEOinfo$`cd8+ t cells:ch1`,
  Monocytes = GEOinfo$`monocytes:ch1`,
  Dendritic.cells = GEOinfo$`myeloid dendritic cells:ch1`,
  NK.cells = GEOinfo$`natural killer cells:ch1`,
  Neutrophils = GEOinfo$`neutrophils:ch1`,
  Tregs = GEOinfo$`tregs:ch1`
)
rownames(FACSdata) <- gsub(
  "Blood-derived immune-cell mixture from donor ", "pbmc", GEOinfo$title
)

rownames(ti_PBMCs) <- gsub("_.*$", "", sub("_", "", rownames(ti_PBMCs)))

ccells <- intersect(colnames(ti_PBMCs), colnames(FACSdata))
csbjs <- intersect(rownames(ti_PBMCs), rownames(FACSdata))

ti_PBMCs <- ti_PBMCs[csbjs, ccells]
FACSdata <- FACSdata[csbjs, ccells]

Then we proceed to plot the agreement between the computed cell fractions and the estimated values extracted from flow cytometry experiments.

palette <- c("#451C87", "#B3B300", "#CE0648", "#2363C5", "#AB4CA1", "#0A839B", "#DD8C24", "#ED6D42")

names(palette) <- c("T.cells.CD4", "Dendritic.cells", "Monocytes", "T.cells.CD8", "Tregs", "B.cells", "NK.cells", "Neutrophils")

par(mfrow = c(3, 3))
colall <- c()
for (i in 1:(ncol(ti_PBMCs) + 1)) {
  if (i <= ncol(ti_PBMCs)) {
    x <- as.numeric(as.character(FACSdata[, i]))
    y <- ti_PBMCs[, i]
    ccell <- colnames(ti_PBMCs)[i]
    col <- palette[ccell]
  } else {
    x <- as.numeric(as.vector(as.matrix(FACSdata)))
    y <- as.vector(as.matrix(ti_PBMCs))
    ccell <- "All cells"
    col <- colall
  }
  res.cor <- cor.test(y, x)
  R <- round(res.cor$estimate, digits = 2)
  p <- format.pval(res.cor$p.value, digits = 2)
  RMSE <- round(sqrt(mean((y - x)^2, na.rm = TRUE)), digits = 2)

  regl <- lm(y ~ x)
  ymax <- max(round(max(y), digits = 2) * 1.3, 0.01)
  xmax <- max(round(max(x), digits = 2), 0.01)
  plot(x, y,
    main = gsub("(\\.)", " ", ccell), pch = 19,
    xlab = "Flow cytometry fractions",
    ylab = "quanTIseq cell fractions",
    col = col, cex.main = 1.3, ylim = c(0, ymax), xlim = c(0, xmax), las = 1
  )
  abline(regl)
  abline(a = 0, b = 1, lty = "dashed", col = "lightgrey")
  text(0, ymax * 0.98, cex = 1, paste0("r = ", R, ", p = ", p), pos = 4)
  text(0, ymax * 0.9, cex = 1, paste0("RMSE = ", RMSE), pos = 4)

  colall <- c(colall, rep(col, length(x)))
}

Use case 3: Expression changes in melanoma patients on vs. pre kinase-inhibitor treatment - GSE75299 (Song et al 2017)

We use here the dataset provided in [@Song2017], where the transcriptomes of cancer cell lines and patients' tumors were characterized with RNA-seq before and during treatment with kinase inhibitors.

The original dataset is available via GEO at the accession GSE75299, but we will be loading a preprocessed version of it, containing the precomputed TPM expression values, made available via ExperimentHub.
This will enable us to performed a paired analysis on the same observation units (the patients) to appreciate differences in the cell proportions induced by the kinase-inhibitor treatment.

library("ExperimentHub")
eh <- ExperimentHub()
quantiseqdata_eh <- query(eh, "quantiseqr")
quantiseqdata_eh

se_Song2017_MAPKi_treatment <- quantiseqdata_eh[["EH6015"]]

se_Song2017_MAPKi_treatment_tiquant <- quantiseqr::run_quantiseq(
  expression_data = se_Song2017_MAPKi_treatment,
  signature_matrix = "TIL10",
  is_arraydata = FALSE,
  is_tumordata = TRUE,
  scale_mRNA = TRUE
)

dim(se_Song2017_MAPKi_treatment_tiquant)
# colData(se_Song2017_MAPKi_treatment_tiquant)
colnames(colData(se_Song2017_MAPKi_treatment_tiquant))

As visible from the output of the last chunk, the cell type composition is stored in the colData slot, if providing a SummarizedExperiment as input.
We first plot the cell fractions by sample, this time with a color palette resembling the one used in [@Plattner2020].

# to extract the TIL10-relevant parts:
ti_quant <- quantiseqr::extract_ti_from_se(se_Song2017_MAPKi_treatment_tiquant)

# to access the full column metadata:
cdata <- colData(se_Song2017_MAPKi_treatment_tiquant)

cellfracs_tidy <- tidyr::pivot_longer(
  as.data.frame(cdata), 
  cols = quanTIseq_TIL10_B.cells:quanTIseq_TIL10_Other)

cellfracs_tidy$name <- factor(gsub("quanTIseq_TIL10_", "", cellfracs_tidy$name),
  levels = c(
    "B.cells", "Macrophages.M1", "Macrophages.M2",
    "Monocytes", "Neutrophils", "NK.cells",
    "T.cells.CD4", "T.cells.CD8", "Tregs",
    "Dendritic.cells", "Other"
  )
)

ggplot(cellfracs_tidy, aes(fill = name, y = value, x = sra_id)) +
  geom_bar(position = "fill", stat = "identity") +
  scale_fill_brewer(palette = "PuOr") +
  xlab("") +
  ylab("Cell Fractions") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# we could have also used the compact wrapper
## quantiplot(se_Song2017_MAPKi_treatment_tiquant)

Similarly, we proceed by splitting the data in pre- and post-treatment, first filtering out the cell line samples (keeping the ones where is_patient is TRUE), then manipulating to long format and explicitly building up the plot object.

prepost_data <- cdata[cdata$is_patient, ]

prepost_data_tidy <- tidyr::pivot_longer(
  as.data.frame(prepost_data), 
  cols = quanTIseq_TIL10_B.cells:quanTIseq_TIL10_Dendritic.cells)

prepost_data_tidy$groups <- factor(prepost_data_tidy$mapki.treatment.ch1, levels = c("none", "on-treatment"))

prepost_data_tidy$name <- factor(gsub("quanTIseq_TIL10_", "", prepost_data_tidy$name),
  levels = c(
    "B.cells", "Macrophages.M1", "Macrophages.M2",
    "Monocytes", "Neutrophils", "NK.cells",
    "T.cells.CD4", "T.cells.CD8", "Tregs",
    "Dendritic.cells"
  )
)

ggplot(prepost_data_tidy, aes(name, value, fill = groups)) +
  geom_boxplot() +
  xlab("") +
  ylab("cell fractions") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Use case 4: Running on simulated data for validation

We will now display the performance of quantiseqr on a dataset for which the ground truth mRNA fractions are known.
The large simulated dataset represents the RNA-seq expression data from breast tumors with different immune-infiltration scenarios, consisting of a total of 1700 samples, that were generated by mixing RNA-seq reads from purified immune cell types and from a MCF7 breast tumor cell line.

These samples were generated considering different immune relative cell proportions, tumor purity values (0:10:100%), and at different sequencing depths (1, 2, 5, 10, 20, 50, and 100 million read pairs).

Please refer to https://icbi.i-med.ac.at/software/quantiseq/doc/ and to [@quantiseq2019] for more details.

This dataset is coupled with a table where the original information on the used true fractions is available, and can be used to benchmark the performance of the deconvolution algorithm.

In this case, we set the scale_mRNA parameter to FALSE because the samples were simulated without modelling any cell-type-specific mRNA bias, so quanTIseq does not have to correct for it. Instead, when analyzing real RNA-seq data, this parameter should always set to TRUE (default) to avoid that some cell types with higher (or lower) total mRNA abundance are systematically under (or over) estimated via deconvolution.

# downloading first the file from https://icbi.i-med.ac.at/software/quantiseq/doc/downloads/quanTIseq_SimRNAseq_mixture.txt

# https://icbi.i-med.ac.at/software/quantiseq/doc/

tpm_1700mixtures <- readr::read_tsv("quanTIseq_SimRNAseq_mixture.txt.gz")
dim(tpm_1700mixtures)

# extracting the gene names, restructuring the matrix by dropping the column
tpm_genesymbols <- tpm_1700mixtures$Gene
tpm_1700mixtures <- as.matrix(tpm_1700mixtures[, -1])
rownames(tpm_1700mixtures) <- tpm_genesymbols

# running quantiseq on that set
# True mRNA fractions were simulated with no total-mRNA bias. Thus, these data should be analyzed specifying the option scale_mRNA set to FALSE
ti_quant_sim1700mixtures <- quantiseqr::run_quantiseq(
  expression_data = tpm_1700mixtures,
  signature_matrix = "TIL10",
  is_arraydata = FALSE,
  is_tumordata = TRUE,
  scale_mRNA = FALSE
)

# save(ti_quant_sim1700mixtures, file = "data/ti_quant_sim1700mixtures.RData")

To avoid the download of a large file, we provide the precomputed object ti_quant_sim1700mixtures in the quantiseqr package - the chunk above is still fully functional once the mixture file has been retrieved.

data(ti_quant_sim1700mixtures)
dim(ti_quant_sim1700mixtures)
head(ti_quant_sim1700mixtures)
quantiplot(ti_quant_sim1700mixtures[1:100, ])

We also read in the true proportions, known by design - this is provided as a text file inside quantiseqr.

true_prop_1700mix <- read.table(
  system.file("extdata", "quanTIseq_SimRNAseq_read_fractions.txt.gz", package = "quantiseqr"),
  sep = "\t", header = TRUE
)
head(true_prop_1700mix)

In the following chunk we perform some preprocessing steps to facilitate the comparison, also in a graphical manner.

# merging the two sets to facilitate the visualization
# colnames(ti_quant_sim1700mixtures) <- paste0("quantiseq_", colnames(ti_quant_sim1700mixtures))
# colnames(true_prop_1700mix) <- paste0("trueprops_", colnames(true_prop_1700mix))

# ti_quant_sim1700mixtures$method <- "quanTIseq"
# true_prop_1700mix$method <- "ground_truth"

colnames(true_prop_1700mix)[1] <- "Sample"
colnames(true_prop_1700mix)[12] <- "Other"

ti_long <- tidyr::pivot_longer(ti_quant_sim1700mixtures,
  cols = B.cells:Other,
  names_to = "cell_type",
  values_to = "value_quantiseq"
)
ti_long$mix_id <- paste(ti_long$Sample, ti_long$cell_type, sep = "_")

tp_long <- pivot_longer(true_prop_1700mix,
  cols = B.cells:Other,
  names_to = "cell_type",
  values_to = "value_trueprop"
)
tp_long$mix_id <- paste(tp_long$Sample, tp_long$cell_type, sep = "_")


ti_tp_merged <- merge(ti_long, tp_long, by = "mix_id")
ti_tp_merged$cell_type.x <- factor(ti_tp_merged$cell_type.x, levels = colnames(true_prop_1700mix)[2:12])

# ti_merged <- rbind(ti_quant_sim1700mixtures,
# true_prop_1700mix)

# ti_merged_long <- pivot_longer(ti_quant_sim1700mixtures, cols = B.cells:Other)

ggplot(
  ti_tp_merged,
  aes(
    x = value_trueprop,
    y = value_quantiseq,
    col = cell_type.x
  )
) +
  geom_point(alpha = 0.5) +
  theme_bw() + 
  labs(
    x = "True fractions",
    y = "quanTIseq cell fractions",
    col = "Cell type"
  )
ggplot(
  ti_tp_merged,
  aes(
    x = value_trueprop,
    y = value_quantiseq,
    col = cell_type.x
  )
) +
  facet_wrap(~cell_type.x, scales = "free") +
  geom_point(alpha = 0.5) +
  geom_abline(slope = 1, col = "lightgrey") + 
  labs(
    x = "True fractions",
    y = "quanTIseq cell fractions",
    col = "Cell type"
  ) + 
  theme_bw()

This figure aims to replicate with live code the one available as Supplementary Figure 1.

FAQs {#faqs}

Q: Do I have to provide my expression data formatted as TPMs? Why is that so?

A: The expression data is indeed expected to be provided as TPM values. quantiseqr might warn you if you are providing a different format (counts, normalized counts) - this does not mean that it will trigger an error as the computation is still able to proceed.
Still: it is not the recommended way. If using a SummarizedExperiment object coming from Salmon's quantifications, the tximeta/tximport pipeline will provide an assay named "abundance", which would be handled internally by the se_to_matrix() function - you can simply call quantiseqr() and provide the SummarizedExperiment object as main parameter.

Q: Can I use quantiseqr with samples from model systems, i.e. not from human?

A: You might exploit orthology-based conversions among gene identifiers to use quantiseqr e.g. in mouse scenarios. Keep in mind, though, that the TIL10 signature has been explicitly designed and validated on human samples.

Q: My expression data is encoding the features in a different identifier than Gene Symbols. Can I use quantiseqr for that?

A: Sure, just make sure to convert the identifiers beforehand - you can use one of the many options available inside Bioconductor for streamlining this step (e.g. the org.Hs.eg.db and the function AnnotationDbi::mapIds()).

Q: I'm interested in other such deconvolution methods. What other options are available?

A: You can check out the works of [@sturm2019, @Sturm2020] to find a collection of methods, provided in the immunedeconv package, and benchmarked in the above mentioned manuscripts.

Q: Can I provide my own signature like the TIL10 and use that in quantiseqr?

A: No, for now it is only possible to use the TIL10 signature matrix.

Session Info {- .smaller}

sessionInfo()

References {-}



federicomarini/quantiseqr documentation built on May 7, 2024, 9:50 a.m.