epDeconv: Deconvolve bulk DNA methylation data using RNA reference

View source: R/epdeconv.R

epDeconvR Documentation

Deconvolve bulk DNA methylation data using RNA reference

Description

Deconvolve bulk DNA methylation data with RNA reference and also by a paired bulk RNA-bulk DNA methylation dataset

Usage

epDeconv(
  rnaref = NULL,
  Seuratobj = NULL,
  targetcelltypes = NULL,
  celltypecolname = "annotation",
  samplebalance = FALSE,
  pseudobulkdat = NULL,
  geneversion = "hg19",
  genekey = "SYMBOL",
  manualmarkerlist = NULL,
  rnamat,
  methylmat,
  learnernum = 10,
  rnamatlogged,
  resscale = FALSE,
  threads = 1,
  lassoerrortype = "min",
  targetmethyldat = NULL,
  plot = FALSE,
  pddat = NULL,
  targetmethylpddat = NULL
)

Arguments

rnaref

The RNA reference recording the signature of each cell type. Each row is one gene, and each column is one cell type. Each entry should be a gene TPM value. Column names are cell type names and row names are gene names. The default is NULL and in this case, it can be synthesized from the scRNA-seq data transferred to the parameter Seuratobj.

Seuratobj

An object of class Seurat generated with the Seurat R package from scRNA-seq data, should contain read count data, normalized data, and cell meta data. The meta data should contain a column recording the cell type name of each cell. When rnaref is set as NULL, but this parameter is provided with the matching data, it will be used to make the RNA reference for the downstream deconvolution.

targetcelltypes

When use Seuratobj to make the RNA reference, this parameter defines the cell types should be coverred by the reference. If NULL, all the cell types included in Seuratobj will be included. Default is NULL.

celltypecolname

When use Seuratobj to make the RNA reference, this parameter indicates which column in its "meta.data" slot records the cell type information for each cell and the name of this column should be transferred to this parameter. Default value is "annotation".

samplebalance

When use Seuratobj to make the RNA reference, at the beginning, the scRNA-seq cell counts data in Seuratobj will be sampled and used to make 100 pseudo-bulk RNA-seq samples, for each cell type, and during synthesizing, the number of single cells can be sampled is always different for each cell type. If want to adjust the bias and make the single cell numbers used to generate pseudo-bulk RNA-seq data same for different cell types, set this parameter as TRUE. Then, the cell types with too many candidate cells will be down-sampled while the ones with much fewer cells will be over-sampled. The down-sampling is performed with bootstrapping, and the over-sampling is via SMOTE (Synthetic Minority Over-sampling Technique). This is a time-consuming step and the default is FALSE, so that no such adjustment will be performed during generating the pseudo-bulk samples.

pseudobulkdat

If the scRNA-seq data transferred via Seuratobj is large, the pseudo-bulk RNA-seq data generation step will become time- consuming, and if this same scRNA-seq data needs to be used repeatedly for deconvolving different bulk datasets, to save time, it is recommended to use the function prepseudobulk to generate and save the pseudo-bulk RNA-seq data at the first time, and then the data can be transferred to this parameter pseudobulkdat, so that epDeconv can skip its own pseudo-bulk data generation step and use the data here to generate the final RNA deconvolution reference. The default value of this parameter is NULL, and in this case, the synthesis step will not be skipped.

geneversion

To calculate the TPM value of the genes when generating the reference matrix, the effective length of the genes will be needed. This parameter is used to define from which genome version the effective gene length will be extracted. For human genes, "hg19" or "hg38" can be used, for mouse, "mm10" can be used. Default is "hg19".

genekey

The type of the gene IDs used in the Seuratobj, it is "SYMBOL" in most cases, and the default value of this parameter is also "SYMBOL", but sometimes it may be "ENTREZID", "ENSEMBL", or other types.

manualmarkerlist

During making the reference matrix from scRNA-seq data, for each cell type, the genes specially expressed in it with a high level will be deemed as markers and used to generate the reference, but it cannot be ensured that some known classical markers can be selected, and so if want to make sure these markers can be used for the reference, a list can be used as an input to this parameter, with its element names as the cell type names and the elements as vectors with the gene IDs of these classical markers. It should be noted that before the final reference is determined, all the marker genes need to go through several filter steps, such as extremely highly expressed genes and colinearity contributing genes removal, to improve the reference quality, so that the classical genes provided via this parameter will be definitely used for reference generation, but may also be filtered out before the final one is made. The default value of this parameter is NULL.

rnamat

The RNA data of the paired bulk RNA-bulk methylation dataset. Its sample cell contents will be first deconvolved via the RNA reference provided to the parameter rnaref, or generated by Seuratobj, then downstream steps will be started to fulfill the bulk methylation data deconvolution. Should be a matrix with each column representing a sample and each row for one gene. Row names are gene names and column names are sample IDs. If the reference matrix is transferred via rnaref and generated with the function scRef, and both the scRNA-seq and this paired RNA dataset were transferred to it, the result reference matrix can be transferred to rnaref and the adjusted paired RNA data returned by scRef can be transferred to this parameter.

methylmat

The DNA methylaiton data of the paired bulk RNA-bulk DNA methylaiton dataset. Should be a matrix with each column representing a sample and each row representing a feature. Row names are feature names and column names are sample IDs. The sample IDs should be the same as the ones in rnamat, because they are data for paired samples.

rnamatlogged

A logical value indicating whether the gene values in rnamat are log2 transformed or not.

resscale

For each sample, whether its cell contents result should be scaled so that the sum of different cell types is 1. Default is FALSE.

threads

Number of threads need to be used to do the computation. Its default value is 1.

lassoerrortype

The base learners of the bagging model to deconvolve the DNA methylation data are LASSO models and the lambda value for each of them (regularization coefficient) is selected from a grid search. This parameter is used to determine whether the lambda value should be the one giving the minimum cross-validation error (set it as "min"), or the one giving an error within 1 standard error of the minimum (set it as "1se"). Default is "min".

targetmethyldat

The target cell mixture methylation data need to be deconvolved. Should be a matrix with each column representing one sample and each row for one feature. Row names are feature names and column names are sample IDs. It is recommended to adjust the batch difference between this dataset and methylmat with ComBat in advance, and using methylmat as the reference batch when adjusting, so that the cell deconvolution model trained from methylmat can be transferred to these data with the influence from batch difference minimized. The default value of this parameter is NULL, and it won't influence the deconvolution model training, and the model returned by this function can still be used on other cell mixture data via the function methylpredict.

plot

Whether generate box plots, heatmaps, and scatter plots for the deconvolution results for the paired RNA data, paired methylaiton data, and target methylation data. Default is FALSE.

pddat

If set plot as TRUE, this parameter can be used to show the sample group information of the paired bulk RNA-bulk DNA methylation data, so that their box plots will also compare the group difference for each cell type, and heatmaps with this comparison will also be generated. It should be a data frame recording the sample groups, and must include 2 columns. One is named as "sampleid", recording the sample IDs same as the column names of rnamat and methylmat, the other column is "Samplegroup", recording the sample group to which each sample belongs. It can also be NULL, meaning all the samples are from the same group.

targetmethylpddat

If plot is TRUE, and targetmethyldat is also provided, this parameter can be used to indicate the sample group information of the target DNA methylaiton data, its format requirment and effect are similar to pddat on the paired dataset.

leanernum

The base leaner number for the bagging model to deconvolve DNA methylation data. Default is 10.

Value

A list containing several slots recording the deconvolution results for the paired RNA and paired DNA methylation data (slots "rnacellconts" and "methylcellconts"), the base leaners of the cell deconvolution model (slots "modellist" and "modelcoeflist"), the weights of the base learners (slots "normweights" and "weights"), the gene subsets used by each RNA data deconvolution base learner (slot "rnageneidxlist"), and the paired RNA-methylation sample cell contents correlation (expressed as R square) deconvolved by each base learner (slot "rnamethylsqrs"). If the target DNA methylation data is provided to the parameter targetmethyldat, a slot recording its cell contents result predicted by the model will also be returned (slot "methyltargetcellcounts").


yuabrahamliu/scDeconv documentation built on March 28, 2024, 3:15 p.m.