knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
This package allows the user to introduce an mRNA bias into pseudo-bulk RNA-seq samples. Different cell-types contain different amounts of mRNA (Dendritic cells for examples contain much more than Neutrophils); this bias can be added into simulations artificially in different ways. \ The scaling factors will always be applied on the single-cell dataset first, altering its expression profiles accordingly, and then the pseudo-bulk samples are generated by summing up the count data from the sampled cells.
# Example data counts <- Matrix::Matrix(matrix(stats::rpois(3e5, 5), ncol = 300), sparse = TRUE) tpm <- Matrix::Matrix(matrix(stats::rpois(3e5, 5), ncol = 300), sparse = TRUE) tpm <- Matrix::t(1e6 * Matrix::t(tpm) / Matrix::colSums(tpm)) colnames(counts) <- paste0("cell_", rep(1:300)) colnames(tpm) <- paste0("cell_", rep(1:300)) rownames(counts) <- paste0("gene_", rep(1:1000)) rownames(tpm) <- paste0("gene_", rep(1:1000)) annotation <- data.frame( "ID" = paste0("cell_", rep(1:300)), "cell_type" = c(rep("T cells CD4", 150), rep("T cells CD8", 150)), "spikes" = stats::runif(300), "add_1" = stats::runif(300), "add_2" = stats::runif(300) ) ds <- SimBu::dataset( annotation = annotation, count_matrix = counts, name = "test_dataset" )
Some studies have proposed scaling factors for immune cells, such as EPIC [@Racle2017] or quanTIseq [@Finotello2019], deconvolution tools which correct for the mRNA bias internally using these values:
epic <- data.frame( type = c( "B cells", "Macrophages", "Monocytes", "Neutrophils", "NK cells", "T cells", "T cells CD4", "T cells CD8", "T helper cells", "T regulatory cells", "otherCells", "default" ), mRNA = c( 0.4016, 1.4196, 1.4196, 0.1300, 0.4396, 0.3952, 0.3952, 0.3952, 0.3952, 0.3952, 0.4000, 0.4000 ) ) epic
quantiseq <- data.frame( type = c( "B cells", "Macrophages", "MacrophagesM2", "Monocytes", "Neutrophils", "NK cells", "T cells CD4", "T cells CD8", "T regulatory cells", "Dendritic cells", "T cells" ), mRNA = c( 65.66148, 138.11520, 119.35447, 130.65455, 27.73634, 117.71584, 63.87200, 70.25659, 72.55110, 140.76091, 68.89323 ) ) quantiseq
When you want to apply one of these scaling factors into your simulation (therefore in-/decreasing the expression signals for the cell-types), we can use the scaling_factor
parameter. Note, that these pre-defined scaling factors only offer values for a certain number of cell types, and your annotation in the provided dataset has to match these names 1:1. All cell types from your dataset which are not present in this scaling factor remain unscaled and a warning message will appear.
sim_epic <- SimBu::simulate_bulk( data = ds, scenario = "random", scaling_factor = "epic", nsamples = 10, ncells = 100, BPPARAM = BiocParallel::MulticoreParam(workers = 4), run_parallel = TRUE )
We can also try out some custom scaling factors, for example increasing the expression levels for a single cell-type (T cells CD8) by 10-fold compared to the rest. All cell-types which are not mentioned in the named list given to custom_scaling_vector
will be transformed with a scaling factor of 1, meaning nothing changes for them.
sim_extreme_b <- SimBu::simulate_bulk( data = ds, scenario = "random", scaling_factor = "custom", custom_scaling_vector = c("T cells CD8" = 10), nsamples = 10, ncells = 100, BPPARAM = BiocParallel::MulticoreParam(workers = 4), run_parallel = TRUE )
Important: Watch out that the cell-type annotation names in your dataset are the same as in the scaling factor! Otherwise the scaling factor will not be applied or even worse, applied to a different cell-type.
You can also choose to calculate scaling factors, which are depending on your provided single-cell dataset. Compared to the previous section, this will give a unique value for each cell rather than a cell-type, making it possibly more sensitive.
Two straight forward approaches would be the number of reads or number of expressed genes/features. As these values are easily obtainable from the provided count data, SimBu already calculates them during dataset generation.
sim_reads <- SimBu::simulate_bulk( data = ds, scenario = "random", scaling_factor = "read_number", # use number of reads as scaling factor nsamples = 10, ncells = 100, BPPARAM = BiocParallel::MulticoreParam(workers = 4), run_parallel = TRUE ) sim_genes <- SimBu::simulate_bulk( data = ds, scenario = "random", scaling_factor = "expressed_genes", # use number of expressed genes column as scaling factor nsamples = 10, ncells = 100, BPPARAM = BiocParallel::MulticoreParam(workers = 4), run_parallel = TRUE )
These options would also allow you to use other numerical measurements you have for the single cells as scaling factors, such as weight or size for example. Lets pretend, add_1
and add_2
are such measurements. With the additional_cols
parameter, they can be added to the SimBu dataset and we can use them as scaling factor as well:
utils::head(annotation)
ds <- SimBu::dataset( annotation = annotation, count_matrix = counts, name = "test_dataset", additional_cols = c("add_1", "add_2") ) # add columns to dataset sim_genes <- SimBu::simulate_bulk( data = ds, scenario = "random", scaling_factor = "add_1", # use add_1 column as scaling factor nsamples = 10, ncells = 100, BPPARAM = BiocParallel::MulticoreParam(workers = 4), run_parallel = TRUE )
One other numerical measurement can be spike-ins. Usually the number of reads mapped to spike-in molecules per cell is given in the cell annotations. If this is the case, they can be stored in the dataset annotation using the spike_in_col
parameter, where you indicate the name of the column from the annotation
dataframe in which the spike-in information is stored. To calculate a scaling factor from this, the number of reads are also necessary, so we will add this information as well (as above using the read_number_col
parameter).
The scaling factor with spike-ins is calculated as the "% of reads NOT mapped to spike-in reads", or: (n_reads - n_spike_in)/n_reads
for each cell. We apply it like this:
ds <- SimBu::dataset( annotation = annotation, count_matrix = counts, name = "test_dataset", spike_in_col = "spikes" ) # give the name in the annotation file, that contains spike-in information sim_spike <- SimBu::simulate_bulk( data = ds, scenario = "random", scaling_factor = "spike_in", # use spike-in scaling factor nsamples = 10, ncells = 100, BPPARAM = BiocParallel::MulticoreParam(workers = 4), run_parallel = TRUE )
Census is an
approach which tries to convert TPM counts into relative transcript counts. This
basically means, you get the mRNA counts per cell, which can differ between cell-types. \
[@Qiu2017] state in their paper, that it should only be applied to TPM/FPKM normalized
data, but I tried it out with raw expression counts as well, which worked as well. \
Census calculates a vector with a scaling value for each cell in a sample.
You can switch this feature on, by setting the scaling_factor
parameter to census
.
ds <- SimBu::dataset( annotation = annotation, count_matrix = counts, tpm_matrix = tpm, name = "test_dataset" ) sim_census <- SimBu::simulate_bulk( data = ds, scenario = "random", scaling_factor = "census", # use census scaling factor nsamples = 10, ncells = 100, BPPARAM = BiocParallel::MulticoreParam(workers = 4), run_parallel = TRUE )
In our analysis we found, that Census is basically a complicated way of estimating the number of expressed genes per cell. It will remain to the user to decide if he/she wants to use census or simply the number of expressed genes (as shown above) as scaling factor.
utils::sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.