scDeconv: Deconvolve bulk RNA data using scRNA-seq data

View source: R/scdeconv.R

scDeconvR Documentation

Deconvolve bulk RNA data using scRNA-seq data

Description

Deconvolve bulk RNA-seq or bulk RNA microarray data using scRNA-seq data.

Usage

scDeconv(
  Seuratobj,
  targetcelltypes = NULL,
  celltypecolname = "annotation",
  samplebalance = FALSE,
  pseudobulkdat = NULL,
  geneversion = "hg19",
  genekey = "SYMBOL",
  targetdat = NULL,
  targetlogged = FALSE,
  manualmarkerlist = NULL,
  markerremovecutoff = 0.6,
  minrefgenenum = 500,
  saveref = FALSE,
  refcutoff = 0.95,
  refadjustcutoff = 0.4,
  resscale = FALSE,
  plot = FALSE,
  pddat = NULL,
  threads = 1
)

Arguments

Seuratobj

An object of class Seurat generated with the Seurat R package from scRNA-seq data, should contain read count data, normalized data, and cell meta data. The meta data should contain a column recording the cell type name of each cell.

targetcelltypes

The cell types whose content need to be deconvolved. If NULL, all the cell types included in Seuratobj will be included. Default is NULL.

celltypecolname

In the "meta.data" slot of Seuratobj, which column records the cell type information for each cell and the name of this column should be transferred to this parameter. Default value is "annotation".

samplebalance

At the beginning of making the cell reference matrix, the scRNA-seq cell counts contained in Seuratobj will be sampled and used to generate 100 pseudo-bulk RNA-seq sample, for each cell type, and during generating them, the number of single cells can be sampled is always different for each cell type. If want to adjust this bias and make the single cell numbers used to make pseudo-bulk RNA-seq data same for different cell types, set this parameter as TRUE. Then, the cell types with too many candidate cells will be down-sampled while the ones with much fewer cells will be over-sampled. The down-sampling is performed with bootstrapping, and the over-sampling is conducted with SMOTE (Synthetic Minority Over-sampling Technique). This is a time-consuming step and the default value of this parameter is FALSE, so that no such adjustment will be done during sampling the pseudo-bulk samples.

pseudobulkdat

If the scRNA-seq data transferred via Seuratobj is large, the pseudo-bulk RNA-seq data generation step will become time- consuming, and if this same scRNA-seq data needs to be used repeatedly for deconvolving different bulk datasets, to save time, it is recommended to use the function prepseudobulk to generate and save the pseudo-bulk RNA-seq data at the first time, and then the data can be transferred to this parameter pseudobulkdat, so that scDeconv can skip its own pseudo-bulk data generation step and use the data here to generate the final RNA deconvolution reference. The default value of this parameter is NULL, and in this case, the synthesis step will not be skipped.

geneversion

To calculate the TPM value of the genes in the reference matrix, the effective length of the genes will be needed. This parameter is used to define from which genome version the effective gene length will be extracted. For human genes, "hg19" or "hg38" can be used, for mouse, "mm10" can be used. Default is "hg19".

genekey

The type of the gene IDs used in the Seuratobj, it is "SYMBOL" in most cases, and the default value of this parameter is also "SYMBOL", but sometimes it may be "ENTREZID", "ENSEMBL", or other types.

targetdat

The target cell mixture gene expression data need to be deconvolved. Should be a matrix with each column representing one sample and each row representing one gene. The gene ID type here should be the same as that transferred to the parameter genekey. Row names are gene IDs and column names are sample IDs.

targetlogged

Whether the gene expression values in targetdat are log2 transformed values or not.

manualmarkerlist

During making the reference matrix from scRNA-seq data, for each cell type, the genes specially expressed in it with a high level will be deemed as markers and used to generate the reference, but it cannot be ensured that some known classical markers can be selected, and so if want to make sure these markers can be used for the reference, a list can be used as an input to this parameter, with its element names as the cell type names and the elements as vectors with the gene IDs of these classical markers. It should be noted that before the final reference is determined, all the marker genes need to go through several filter steps, such as extremely highly expressed genes and collinearity contributing genes removal, to improve the reference quality, so that the classical genes provided via this parameter will be definitely used for reference generation, but may also be filtered out before the final one is made. The default value of this parameter is NULL.

markerremovecutoff

During the reference matrix generation from the scRNA-seq data, the gene expression values in the targetdat matrix are used to calculate the correlation with the scRNA-seq selected markers in this targetdat matrix and the ones with a high correlation to the first principle component of these marker genes will also be used to make the reference. The cutoff of the correlation coefficient is set by this parameter and the default value is 0.6.

minrefgenenum

Because the genes to generate the reference matrix need to go through several filter steps and in some cases, only a small number of them can fulfill all the filter conditions, which makes the gene number in the reference is very small and then influences the next deconvolution. To avoid this extreme case, a cutoff for the reference gene number need to be defined here, so that once the gene number in the reference has been filtered to this level, the filter process will be ended to guarantee the gene number of the reference. This parameter is used to set this cutoff, and its default value is 500.

saveref

Whether need to save the finally generated reference matrix, and the adjusted cell mixture matrix to be deconvolved as rds files in the working directory automatically. Default is FALSE.

refcutoff

To improve the robustness of the deconvolution result, some extremely highly expressed genes in the reference need to be filtered out due to their large variance. This cutoff is used to set the percent of genes can be kept in the reference while the other genes with a higher expression level will be filtered. The default value is 0.95, meaning the top 5% most highly expressed genes will be removed from the reference.

refadjustcutoff

For some similar cell types, their gene expressions in the reference matrix are highly correlated, which makes the downstream deconvolution difficult. To relive this problem, for each similar cell pair, some genes largely contributing to their correlation will be found and removed, so that their correlation in the reference can be reduced. This parameter is used to set the cutoff of the cell pair correlation, and if a cell pair has a Pearson correlation coefficient greater than it, the contributing gene filter process will be used to reduce the coefficient until it becomes smaller than this value. The default is 0.4.

resscale

For each sample, whether its cell contents result should be scaled so that the sum of different cell types is 1. Default is FALSE.

plot

Whether generate a box plot and heatmaps for the cell contents deconvolved. Default is FALSE.

pddat

If set plot as TRUE, this parameter can be used to show the sample group information, so that the box plot generated will also compare the group difference for each cell type, and heatmaps with this comparison will also be generated. It should be a data frame recording the sample groups, and must include 2 columns. One is named as "sampleid", recording the sample IDs same as the column names of targetdat, the other is "Samplegroup", recording the sample group to which each sample belongs. It can also be NULL, meaning all the samples are from the same group, and in this case, only a box plot showing the cell content for each cell type will be made when plot is set as TRUE.

threads

Number of threads need to be used to do the computation. Its default value is 1.

Value

A list containing the generated RNA reference, the adjusted target data to be deconvolved, and the cell deconvolution result for the samples. The gene values in the adjusted target data are non-log transformed ones.


yuabrahamliu/scDeconv documentation built on March 28, 2024, 3:15 p.m.