scRef: Make cell deconvolution reference from scRNA-seq data

View source: R/reference.R

scRefR Documentation

Make cell deconvolution reference from scRNA-seq data

Description

Make cell deconvolution reference matrix from scRNA-seq data.

Usage

scRef(
  Seuratobj,
  targetcelltypes = NULL,
  celltypecolname = "annotation",
  pseudobulknum = 10,
  samplebalance = FALSE,
  pseudobulkpercent = 0.9,
  pseudobulkdat = NULL,
  geneversion = "hg19",
  genekey = "SYMBOL",
  targetdat = NULL,
  targetlogged = FALSE,
  manualmarkerlist = NULL,
  markerremovecutoff = 0.6,
  minrefgenenum = 500,
  savefile = FALSE,
  threads = 1,
  cutoff = 0.95,
  adjustcutoff = 0.4
)

Arguments

Seuratobj

An object of class Seurat generated with the Seurat R package from scRNA-seq data, should contain read count data, normalized data, and cell meta data. The meta data should contain a column recording the cell type name of each cell.

targetcelltypes

The cell types whose content need to be deconvolved. If NULL, all the cell types included in Seuratobj will be included. Default is NULL.

celltypecolname

In the "meta.data" slot of Seuratobj, which column records the cell type information for each cell and the name of this column should be transferred to this parameter. Default value is "annotation".

pseudobulknum

At the beginning of making the cell reference matrix, the scRNA-seq cell counts contained in Seuratobj will be sampled and used to generate some pseudo-bulk RNA-seq samples, for each cell type. The parameter pseudobulknum here defines how many pseudo-bulk RNA-seq data for each cell type need to be generated. Default is 10.

samplebalance

During generating the pseudo-bulk RNA-seq data, the number of single cells can be sampled is always different for each cell type. If want to adjust this bias and make the single cell numbers used to make pseudo-bulk RNA-seq data same for different cell types, set this parameter as TRUE. Then, the cell types with too many candidate cells will be down-sampled while the ones with much fewer cells will be over-sampled. The down-sampling is performed using bootstrapping, and the over-sampling is conducted with SMOTE (Synthetic Minority Over-sampling Technique). This is a time-consuming step and the default value of this parameter is FALSE.

pseudobulkpercent

If the parameter samplebalance is FALSE, for the pseudo-bulk sampling for each cell type, a percent of single cells for each cell type will be randomly sampled and this parameter is used to set this percent value and should be a number between 0 and 1, but if the parameter samplebalance is set as TRUE, bootstrapping and SMOTE will be performed to do the sampling and this parameter will be omitted.

pseudobulkdat

If the scRNA-seq data transferred via Seuratobj is large, the pseudo-bulk RNA-seq data generation step will become time- consuming, and if this same scRNA-seq data needs to be used repeatedly for deconvolving different bulk datasets, to save time, it is recommended to use the function prepseudobulk to generate and save the pseudo-bulk RNA-seq data at the first time, and then the data can be transferred to this parameter pseudobulkdat, so that scRef can always skip its own pseudo-bulk data generation step and directly use the data here to further generate the final RNA deconvolution reference. The default value of this parameter is NULL, and in this case, the synthesis step will not be skipped and scRef will synthesize the pseudo-bulk data itself.

geneversion

To calculate the TPM value of the genes in the reference matrix, the effective length of the genes will be needed. This parameter is used to define from which genome version the effective gene length will be extracted. For human genes, "hg19" or "hg38" can be used, for mouse, "mm10" can be used. Default is "hg19".

genekey

The type of the gene IDs used in the Seuratobj, it is "SYMBOL" in most cases, and the default value of this parameter is also "SYMBOL", but sometimes it may be "ENTREZID", "ENSEMBL", or other types.

targetdat

The target cell mixture gene expression data need to be deconvolved. Should be a matrix with each column representing one sample and each row representing one gene. The gene ID type here should be the same as that transferred to the parameter genekey. Row names are gene IDs and column names are sample IDs. The default value of it is NULL. In this case, the reference matrix generation step will only base on the scRNA-seq data provided by the previous parameter Seuratobj, but if provide a matrix to be deconvolved to this parameter, both the reference matrix and this cell mixture matrix will be further processed, including combining the 2 matrices to remove their batch difference, selecting more genes into the reference matrix based on the correlation between the genes and the selected marker genes in the cell mixture , etc. It is recommended to provide the cell mixture matrix via this parameter, especially when the cell mixture is from RNA microarray, rather than RNA-seq data, so that the combination process will be performed to reduce the platform difference between RNA microarray and scRNA-seq.

targetlogged

Whether the gene expression values in targetdat are log2 transformed values or not.

manualmarkerlist

During making the reference matrix, for each cell type, the genes specially expressed in it with a high level will be deemed as markers and further used to generate the reference. However, it cannot be ensured that some known classical markers can be selected, and so if want to make sure these markers can be used to make the reference, a list can be used as an input to this parameter, with its element names as the cell type names and the elements as vectors with the gene IDs of these classical markers. It should be noted that before the final reference is determined, all the marker genes need to go through several filter steps, such as extremely highly expressed genes and collinearity contributing genes removal, to improve the reference quality, so that the classical genes provided via this parameter will be definitely used for reference generation, but may also be filtered out before the final one is returned. The default value of this parameter is NULL.

markerremovecutoff

When a gene expression matrix is provided to the parameter targetdat, the gene expression values in it will be used to calculate the correlation with the scRNA-seq selected markers in this cell mixture matrix and the ones with a high Pearson correlation to the first principle component of these marker genes will also be used to make the reference. The cutoff of the Pearson correlation coefficient is set by this parameter and the default value is 0.6.

minrefgenenum

Because the genes to generate the reference matrix need to go through several filter steps and in some cases, only a small number of them can fulfill all the filter conditions, which makes the gene number in the reference is very small and then influences the next deconvolution. To avoid this extreme case, a cutoff for the reference gene number need to be defined here, so that once the gene number in the reference has been filtered to this level, the filter process will be ended to guarantee the gene number of the reference. This parameter is used to set this cutoff, and its default value is 500.

savefile

Whether need to save the finally generated reference matrix, and the adjusted cell mixture matrix (if provided to targetdat), as rds file(s) in the working directory automatically. Default is FALSE.

threads

Number of threads need to be used to do the computation. Its default value is 1.

cutoff

To improve the robustness of the deconvolution result, some extremely highly expressed genes in the reference need to be filtered out due to their large variance. This cutoff is used to set the percent of genes can be kept in the reference while the other genes with a higher expression level will be filtered. The default value is 0.95, meaning the top 5% most highly expressed genes will be removed from the reference.

adjustcutoff

For some similar cell types, their gene expressions in the reference matrix have a large correlation, which makes the downstream deconvolution difficult. To relive this problem, for each similar cell pair, some genes largely contributing to their correlation will be found and removed, so that their correlation in the reference can be reduced. This parameter adjustcutoff is used to set the cutoff of the cell pair correlation, and if a cell pair has a Pearson correlation coefficient greater than this value, the contributing gene filter process will be used to reduce the coefficient until it becomes smaller than this value. The default value is 0.4.

Value

A list with the final reference matrix as its element, and if the cell mixture data matrix to be deconvolved is provided to the parameter targetdat, a adjusted one will also be returned as an element of this list. The gene values in this adjusted matrix are non-log transformed values.


yuabrahamliu/scDeconv documentation built on March 28, 2024, 3:15 p.m.