generateBulkCellMatrix | R Documentation |
Generate training and test cell composition matrices for the simulation of
pseudo-bulk RNA-Seq samples with known cell composition using single-cell
expression profiles. The resulting ProbMatrixCellTypes
object contains a matrix that determines the proportion of the different cell
types that will compose the simulated pseudo-bulk samples. In addition, this
object also contains other information relevant for the process. This
function does not simulate pseudo-bulk samples, this task is performed by the
simBulkProfiles
or trainDigitalDLSorterModel
functions (see Documentation).
generateBulkCellMatrix( object, cell.ID.column, cell.type.column, prob.design, num.bulk.samples, n.cells = 100, train.freq.cells = 2/3, train.freq.bulk = 2/3, proportions.train = c(10, 5, 20, 15, 35, 15), proportions.test = c(10, 5, 20, 15, 35, 15), prob.zero = c(0.5, 0.5, 0.5, 0.5, 0.5, 0.5), balanced.type.cells = FALSE, verbose = TRUE )
object |
|
cell.ID.column |
Name or column number corresponding to the cell names of expression matrix in cells metadata. |
cell.type.column |
Name or column number corresponding to the cell type of each cell in cells metadata. |
prob.design |
Data frame with the expected frequency ranges for each cell type present in the experiment. This information can be estimated from literature or from the single-cell experiment itself. This data frame must be constructed by three columns with specific headings (see examples):
|
num.bulk.samples |
Number of bulk RNA-Seq sample proportions (and thus
simulated bulk RNA-Seq samples) to be generated taking into account
training and test data. We recommend seting this value according to the
number of single-cell profiles available in
|
n.cells |
Number of cells that will be aggregated in order to simulate one bulk RNA-Seq sample (100 by default). |
train.freq.cells |
Proportion of cells used to simulate training pseudo-bulk samples (2/3 by default). |
train.freq.bulk |
Proportion of bulk RNA-Seq samples to the total number
( |
proportions.train |
Vector of six integers that determines the proportions of bulk samples generated by the different methods (see Details and Torroja and Sanchez-Cabo, 2019. for more information). This vector represents proportions, so its entries must add up 100. By default, a majority of random samples will be generated without using predefined ranges. |
proportions.test |
|
prob.zero |
Probability of producing cell type proportions equal to
zero. It is a vector of six elements corresponding to the six methods of
producing cell type proportions (see |
balanced.type.cells |
Boolean indicating whether the training and test
cells will be split in a balanced way considering the cell types
( |
verbose |
Show informative messages during the execution ( |
First, the available single-cell profiles are split into training and test
subsets (2/3 for training and 1/3 for test by default (see
train.freq.cells
)) to avoid falsifying the results during model
evaluation. Next, num.bulk.samples
bulk samples proportions are built
and the single-cell profiles to be used to simulate each pseudo-bulk RNA-Seq
sample are set, being 100 cells per bulk sample by default (see
n.cells
argument). The proportions of training and test pseudo-bulk
samples are set by train.freq.bulk
(2/3 for training and 1/3 for
testing by default). Finally, in order to avoid biases due to the composition
of the pseudo-bulk RNA-Seq samples, cell type proportions (w_1,...,w_k,
where k is the number of cell types available in single-cell profiles)
are randomly generated by using six different approaches:
Cell proportions are randomly sampled from a truncated
uniform distribution with predefined limits according to a priori knowledge
of the abundance of each cell type (see prob.design
argument). This
information can be inferred from the single-cell experiment itself or from
the literature.
A second set is generated by randomly permuting cell type labels from a distribution generated by the previous method.
Cell proportions are randomly sampled as by method 1 without replacement.
Using the last method for generating proportions, cell types labels are randomly sampled.
Cell proportions are randomly sampled from a Dirichlet distribution.
Pseudo-bulk RNA-Seq samples composed of the same cell type are generated in order to provide 'pure' pseudo-bulk samples.
If you want to inspect the distribution of cell type proportions generated by
each method during the process, they can be visualized by the
showProbPlot
function (see Documentation).
A DigitalDLSorter
object with
prob.cell.types
slot containing a list
with two
ProbMatrixCellTypes
objects (training and test). For
more information about the structure of this class, see
?ProbMatrixCellTypes
.
Torroja, C. and Sánchez-Cabo, F. (2019). digitalDLSorter: A Deep Learning algorithm to quantify immune cell populations based on scRNA-Seq data. Frontiers in Genetics 10, 978. doi: doi: 10.3389/fgene.2019.00978
simBulkProfiles
ProbMatrixCellTypes
set.seed(123) # reproducibility # simulated data sce <- SingleCellExperiment::SingleCellExperiment( assays = list( counts = matrix( rpois(30, lambda = 5), nrow = 15, ncol = 10, dimnames = list(paste0("Gene", seq(15)), paste0("RHC", seq(10))) ) ), colData = data.frame( Cell_ID = paste0("RHC", seq(10)), Cell_Type = sample(x = paste0("CellType", seq(2)), size = 10, replace = TRUE) ), rowData = data.frame( Gene_ID = paste0("Gene", seq(15)) ) ) DDLS <- loadSCProfiles( single.cell.data = sce, cell.ID.column = "Cell_ID", gene.ID.column = "Gene_ID" ) probMatrixValid <- data.frame( Cell_Type = paste0("CellType", seq(2)), from = c(1, 30), to = c(15, 70) ) DDLS <- generateBulkCellMatrix( object = DDLS, cell.ID.column = "Cell_ID", cell.type.column = "Cell_Type", prob.design = probMatrixValid, num.bulk.samples = 10, verbose = TRUE )
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.