In omnideconv/SimBu: Simulate Bulk RNA-seq Datasets from Single-Cell Datasets

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Using scaling factors

This package allows the user to introduce an mRNA bias into pseudo-bulk RNA-seq samples. Different cell-types contain different amounts of mRNA (Dendritic cells for examples contain much more than Neutrophils); this bias can be added into simulations artificially in different ways. \ The scaling factors will always be applied on the single-cell dataset first, altering its expression profiles accordingly, and then the pseudo-bulk samples are generated by summing up the count data from the sampled cells.

# Example data
counts <- Matrix::Matrix(matrix(stats::rpois(3e5, 5), ncol = 300), sparse = TRUE)
tpm <- Matrix::Matrix(matrix(stats::rpois(3e5, 5), ncol = 300), sparse = TRUE)
tpm <- Matrix::t(1e6 * Matrix::t(tpm) / Matrix::colSums(tpm))
colnames(counts) <- paste0("cell_", rep(1:300))
colnames(tpm) <- paste0("cell_", rep(1:300))
rownames(counts) <- paste0("gene_", rep(1:1000))
rownames(tpm) <- paste0("gene_", rep(1:1000))
annotation <- data.frame(
  "ID" = paste0("cell_", rep(1:300)),
  "cell_type" = c(rep("T cells CD4", 150), rep("T cells CD8", 150)),
  "spikes" = stats::runif(300),
  "add_1" = stats::runif(300),
  "add_2" = stats::runif(300)
)
ds <- SimBu::dataset(
  annotation = annotation,
  count_matrix = counts,
  name = "test_dataset"
)

Pre-defined scaling factors

Some studies have proposed scaling factors for immune cells, such as EPIC [@Racle2017] or quanTIseq [@Finotello2019], deconvolution tools which correct for the mRNA bias internally using these values:

epic <- data.frame(
  type = c(
    "B cells",
    "Macrophages",
    "Monocytes",
    "Neutrophils",
    "NK cells",
    "T cells",
    "T cells CD4",
    "T cells CD8",
    "T helper cells",
    "T regulatory cells",
    "otherCells",
    "default"
  ),
  mRNA = c(
    0.4016,
    1.4196,
    1.4196,
    0.1300,
    0.4396,
    0.3952,
    0.3952,
    0.3952,
    0.3952,
    0.3952,
    0.4000,
    0.4000
  )
)
epic

quantiseq <- data.frame(
  type = c(
    "B cells",
    "Macrophages",
    "MacrophagesM2",
    "Monocytes",
    "Neutrophils",
    "NK cells",
    "T cells CD4",
    "T cells CD8",
    "T regulatory cells",
    "Dendritic cells",
    "T cells"
  ),
  mRNA = c(
    65.66148,
    138.11520,
    119.35447,
    130.65455,
    27.73634,
    117.71584,
    63.87200,
    70.25659,
    72.55110,
    140.76091,
    68.89323
  )
)
quantiseq

When you want to apply one of these scaling factors into your simulation (therefore in-/decreasing the expression signals for the cell-types), we can use the scaling_factor parameter. Note, that these pre-defined scaling factors only offer values for a certain number of cell types, and your annotation in the provided dataset has to match these names 1:1. All cell types from your dataset which are not present in this scaling factor remain unscaled and a warning message will appear.

sim_epic <- SimBu::simulate_bulk(
  data = ds,
  scenario = "random",
  scaling_factor = "epic",
  nsamples = 10,
  ncells = 100,
  BPPARAM = BiocParallel::MulticoreParam(workers = 4),
  run_parallel = TRUE
)

We can also try out some custom scaling factors, for example increasing the expression levels for a single cell-type (T cells CD8) by 10-fold compared to the rest. All cell-types which are not mentioned in the named list given to custom_scaling_vector will be transformed with a scaling factor of 1, meaning nothing changes for them.

sim_extreme_b <- SimBu::simulate_bulk(
  data = ds,
  scenario = "random",
  scaling_factor = "custom",
  custom_scaling_vector = c("T cells CD8" = 10),
  nsamples = 10,
  ncells = 100,
  BPPARAM = BiocParallel::MulticoreParam(workers = 4),
  run_parallel = TRUE
)

Important: Watch out that the cell-type annotation names in your dataset are the same as in the scaling factor! Otherwise the scaling factor will not be applied or even worse, applied to a different cell-type.

Dataset specific scaling factors

You can also choose to calculate scaling factors, which are depending on your provided single-cell dataset. Compared to the previous section, this will give a unique value for each cell rather than a cell-type, making it possibly more sensitive.

Reads and genes

Two straight forward approaches would be the number of reads or number of expressed genes/features. As these values are easily obtainable from the provided count data, SimBu already calculates them during dataset generation.

sim_reads <- SimBu::simulate_bulk(
  data = ds,
  scenario = "random",
  scaling_factor = "read_number", # use number of reads as scaling factor
  nsamples = 10,
  ncells = 100,
  BPPARAM = BiocParallel::MulticoreParam(workers = 4),
  run_parallel = TRUE
)
sim_genes <- SimBu::simulate_bulk(
  data = ds,
  scenario = "random",
  scaling_factor = "expressed_genes", # use number of expressed genes column as scaling factor
  nsamples = 10,
  ncells = 100,
  BPPARAM = BiocParallel::MulticoreParam(workers = 4),
  run_parallel = TRUE
)

These options would also allow you to use other numerical measurements you have for the single cells as scaling factors, such as weight or size for example. Lets pretend, add_1 and add_2 are such measurements. With the additional_cols parameter, they can be added to the SimBu dataset and we can use them as scaling factor as well:

utils::head(annotation)

ds <- SimBu::dataset(
  annotation = annotation,
  count_matrix = counts,
  name = "test_dataset",
  additional_cols = c("add_1", "add_2")
) # add columns to dataset
sim_genes <- SimBu::simulate_bulk(
  data = ds,
  scenario = "random",
  scaling_factor = "add_1", # use add_1 column as scaling factor
  nsamples = 10,
  ncells = 100,
  BPPARAM = BiocParallel::MulticoreParam(workers = 4),
  run_parallel = TRUE
)

Spike-ins

One other numerical measurement can be spike-ins. Usually the number of reads mapped to spike-in molecules per cell is given in the cell annotations. If this is the case, they can be stored in the dataset annotation using the spike_in_col parameter, where you indicate the name of the column from the annotation dataframe in which the spike-in information is stored. To calculate a scaling factor from this, the number of reads are also necessary, so we will add this information as well (as above using the read_number_col parameter).

The scaling factor with spike-ins is calculated as the "% of reads NOT mapped to spike-in reads", or: (n_reads - n_spike_in)/n_reads for each cell. We apply it like this:

ds <- SimBu::dataset(
  annotation = annotation,
  count_matrix = counts,
  name = "test_dataset",
  spike_in_col = "spikes"
) # give the name in the annotation file, that contains spike-in information
sim_spike <- SimBu::simulate_bulk(
  data = ds,
  scenario = "random",
  scaling_factor = "spike_in", # use spike-in scaling factor
  nsamples = 10,
  ncells = 100,
  BPPARAM = BiocParallel::MulticoreParam(workers = 4),
  run_parallel = TRUE
)

Census - estimate mRNA counts per cell

Census is an approach which tries to convert TPM counts into relative transcript counts. This basically means, you get the mRNA counts per cell, which can differ between cell-types. \ [@Qiu2017] state in their paper, that it should only be applied to TPM/FPKM normalized data, but I tried it out with raw expression counts as well, which worked as well. \ Census calculates a vector with a scaling value for each cell in a sample. You can switch this feature on, by setting the scaling_factor parameter to census.

ds <- SimBu::dataset(
  annotation = annotation,
  count_matrix = counts,
  tpm_matrix = tpm,
  name = "test_dataset"
)
sim_census <- SimBu::simulate_bulk(
  data = ds,
  scenario = "random",
  scaling_factor = "census", # use census scaling factor
  nsamples = 10,
  ncells = 100,
  BPPARAM = BiocParallel::MulticoreParam(workers = 4),
  run_parallel = TRUE
)

In our analysis we found, that Census is basically a complicated way of estimating the number of expressed genes per cell. It will remain to the user to decide if he/she wants to use census or simply the number of expressed genes (as shown above) as scaling factor.