generateBulkCellMatrix | R Documentation |
Generate training and test cell composition matrices for the simulation of
pseudo-bulk RNA-Seq samples with known cell composition using single-cell
expression profiles. The resulting ProbMatrixCellTypes
object contains a matrix that determines the proportion of the different cell
types that will compose the simulated pseudo-bulk samples. In addition, this
object also contains other information relevant for the process. This
function does not simulate pseudo-bulk samples, this task is performed by the
simBulkProfiles
or trainDDLSModel
functions (see Documentation).
generateBulkCellMatrix(
object,
cell.ID.column,
cell.type.column,
prob.design,
num.bulk.samples,
n.cells = 100,
train.freq.cells = 3/4,
train.freq.bulk = 3/4,
proportion.method = c(10, 5, 20, 15, 35, 15),
prob.sparsity = 0.5,
min.zero.prop = NULL,
balanced.type.cells = FALSE,
verbose = TRUE
)
object |
|
cell.ID.column |
Name or column number corresponding to the cell names of expression matrix in cells metadata. |
cell.type.column |
Name or column number corresponding to the cell type of each cell in cells metadata. |
prob.design |
Data frame with the expected frequency ranges for each cell type present in the experiment. This information can be estimated from literature or from the single-cell experiment itself. This data frame must be constructed by three columns with specific headings (see examples):
|
num.bulk.samples |
Number of bulk RNA-Seq sample proportions (and thus
simulated bulk RNA-Seq samples) to be generated taking into account
training and test data. We recommend seting this value according to the
number of single-cell profiles available in
|
n.cells |
Number of cells that will be aggregated in order to simulate one bulk RNA-Seq sample (100 by default). |
train.freq.cells |
Proportion of cells used to simulate training pseudo-bulk samples (2/3 by default). |
train.freq.bulk |
Proportion of bulk RNA-Seq samples to the total number
( |
proportion.method |
Vector of six integers that determines the proportions of bulk samples generated by the different methods (see Details and Torroja and Sanchez-Cabo, 2019. for more information). This vector represents proportions, so its entries must add up 100. By default, a majority of random samples will be generated without using predefined ranges. |
prob.sparsity |
It only affects the proportions generated by the first method (Dirichlet distribution). It determines the probability of having missing cell types in each simulated spot, as opposed to a mixture of all cell types. A higher value for this parameter will result in more sparse simulated samples. |
min.zero.prop |
This parameter controls the minimum number of cell types
that will be absent in each simulated spot. If |
balanced.type.cells |
Boolean indicating whether the training and test
cells will be split in a balanced way considering the cell types
( |
verbose |
Show informative messages during the execution ( |
First, the available single-cell profiles are split into training and test
subsets (2/3 for training and 1/3 for test by default (see
train.freq.cells
)) to avoid falsifying the results during model
evaluation. Next, num.bulk.samples
bulk samples proportions are built
and the single-cell profiles to be used to simulate each pseudo-bulk RNA-Seq
sample are set, being 100 cells per bulk sample by default (see
n.cells
argument). The proportions of training and test pseudo-bulk
samples are set by train.freq.bulk
(2/3 for training and 1/3 for
testing by default). Finally, in order to avoid biases due to the composition
of the pseudo-bulk RNA-Seq samples, cell type proportions (w_1,...,w_k
,
where k
is the number of cell types available in single-cell profiles)
are randomly generated by using six different approaches:
Cell proportions are randomly sampled from a truncated
uniform distribution with predefined limits according to a priori knowledge
of the abundance of each cell type (see prob.design
argument). This
information can be inferred from the single-cell experiment itself or from
the literature.
A second set is generated by randomly permuting cell type labels from a distribution generated by the previous method.
Cell proportions are randomly sampled as by method 1 without replacement.
Using the last method for generating proportions, cell types labels are randomly sampled.
Cell proportions are randomly sampled from a Dirichlet distribution.
Pseudo-bulk RNA-Seq samples composed of the same cell type are generated in order to provide 'pure' pseudo-bulk samples.
If you want to inspect the distribution of cell type proportions generated by
each method during the process, they can be visualized by the
showProbPlot
function (see Documentation).
A DigitalDLSorter
object with
prob.cell.types
slot containing a list
with two
ProbMatrixCellTypes
objects (training and test). For
more information about the structure of this class, see
?ProbMatrixCellTypes
.
Torroja, C. and Sánchez-Cabo, F. (2019). digitalDLSorter: A Deep Learning algorithm to quantify immune cell populations based on scRNA-Seq data. Frontiers in Genetics 10, 978. doi: \Sexpr[results=rd]{tools:::Rd_expr_doi("10.3389/fgene.2019.00978")}
simBulkProfiles
ProbMatrixCellTypes
set.seed(123) # reproducibility
# simulated data
sce <- SingleCellExperiment::SingleCellExperiment(
assays = list(
counts = matrix(
rpois(30, lambda = 5), nrow = 15, ncol = 10,
dimnames = list(paste0("Gene", seq(15)), paste0("RHC", seq(10)))
)
),
colData = data.frame(
Cell_ID = paste0("RHC", seq(10)),
Cell_Type = sample(x = paste0("CellType", seq(2)), size = 10,
replace = TRUE)
),
rowData = data.frame(
Gene_ID = paste0("Gene", seq(15))
)
)
DDLS <- createDDLSobject(
sc.data = sce,
sc.cell.ID.column = "Cell_ID",
sc.gene.ID.column = "Gene_ID",
sc.filt.genes.cluster = FALSE,
sc.log.FC = FALSE
)
probMatrixValid <- data.frame(
Cell_Type = paste0("CellType", seq(2)),
from = c(1, 30),
to = c(15, 70)
)
DDLS <- generateBulkCellMatrix(
object = DDLS,
cell.ID.column = "Cell_ID",
cell.type.column = "Cell_Type",
prob.design = probMatrixValid,
num.bulk.samples = 10,
verbose = TRUE
)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.