CellMap: What the Package Does (One Line, Title Case)

View source: R/cellmapTraining.R

cellmapTraining

R Documentation

Create a cellmap profile for desired cell types from sc/sn RNAseq data.

Description

This function creates a cellmap profile include specified cell types from a set of sc/sn RNAseq data. The TPM of full length or counts of 3’end sc/sn RNAseq data is recommended.

Usage

cellmapTraining(
  strData,
  strPrefix,
  cellTypeMap = NULL,
  cellCol = NULL,
  modelForm = "log2",
  DEGmethod = "edgeR",
  batchMethod = "Full",
  sampleN = 5,
  seqDepth = 2e+06,
  normDepth = 1e+06,
  geneCutoffCPM = 4,
  geneCutoffDetectionRatio = 0.8,
  selFeatureN = 100,
  DEGlogFCcut = 1,
  DEGqvalcut = 0.05,
  DEGbasemeancut = 16,
  mixN = 10,
  setN = 1000,
  topN = 50,
  strBulk = "",
  strRate = "",
  geneNameReady = F,
  ensemblPath = "Data/",
  ensemblV = 97,
  rmseCutoff = 0.1,
  tailR = 0.8,
  maxRMrate = 0.75,
  maxIteration = 30,
  core = 2
)

Arguments

`strData`	A vector of paths to the expression matrix of all sc/sn RNAseq datasets (.rds). Each expression matrix with rows are genes (official gene symbol is required as first column); and columns are cells with cell type and data set information encoded in to the column names (cellType\|dataset\|…).
`strPrefix`	A string indicates the prefix with path of the result files. There are three files produced: two PDF files contains figures of the profile quality as well as performance on pseudo mixture and input bulk if provided; and an RDS file contains the profile which can be provided to cellmap function.
`cellTypeMap`	A named vector indicates the cell types of the profile which the cellmap needed to train for. The names of the vector are the cell type names defined in the column names from expression matrix of RDS files, while the values are the cell type will be used in the final profile. For instance, many Exhibitory and Inhibitory cells are both defined in the data, while the neuron is one of the interested cell types. Thus, we can create a vector c(Exhibitory=Neuron, Inhibitory=Neuron) to let the cellmap know all Exhibitory and Inhibitory cells are now called Neuron. If NULL, all cell types defined in the data matrix will be used as original name. Please note that: Neuron and Neurons will be considered different cell types. Default is NULL.
`cellCol`	A named vector indicate the color of the cell types. The names of the vector are the cell type names while the value are R-color, “#FFFFFF” is preferred. If NULL, colors will be assigned to each of cell types. Default is NULL.
`modelForm`	‘linear’ or ‘log2’ for the profile model. log2 is preferred since some genes might have dominant expression values which will bias towards those genes. Default is log2.
`DEGmethod`	One from ‘edgeR’, ‘DESeq2’, ‘voom’ or ‘Top’ can be chosen. This indicates the method for identifying the cell type signature genes. Default is edgeR.
`batchMethod`	One from ‘Full’, ‘Partial’, ‘Separate’ or ‘None’ can be chosen. This indicates the method for batch removal. (Please check the publication for details.) In short, if the cell types of interested are mostly overlapped among datasets, ‘Full’ is preferred, while ‘Separate’ is for minimal overlap. Default is ‘Full’.
`sampleN`	A numeric indicates the number of pseudo pure samples to be generated for each cell type from each dataset. Default is 5.
`seqDepth`	A numeric indicates the total measurements (counts) for a pseudo pure sample. Default is 2M.
`normDepth`	A numeric indicates the sequence depth to normalize the pseudo pure samples, such as CPM. Default is 1M.
`geneCutoffCPM`	A numeric indicates the minimal normalized expression of a gene to be considered. Default is 4.
`geneCutoffDetectionRatio`	A numeric indicates the minimal ratio of data sets where a gene expressed for a cell type. Default is 0.8.
`selFeatureN`	A numeric indicates the maximin number of signature genes for each cell type in an iteration. Default is 100.
`DEGlogFCcut`	A numeric indicates the minimal log fold change for a gene to be considered signature. Default is 1.
`DEGqvalcut`	A numeric indicates the maximin FDR for a gene to be considered signature. Default is 0.05.
`DEGbasemeancut`	A numeric indicates the minimal average expression for ‘DESeq2’, ‘voom’ or ‘Top’ methods. Default is 16.
`mixN`	A numeric indicates the number of pseudo mixture to be generated for each dataset during pseudo training process. Default is 10.
`setN`	A numeric indicates the number of random combinations of pseudo pure sample during pseudo training process for each iteration. Default is 1000.
`topN`	A numeric indicates the number of the top performing (based on RMSE) combinations of pseudo pure samples for each pseudo mixture are kept during the pseudo training. Default is 50.
`strBulk`	The path to the expression matrix with known cell type compositions. Expression matrix is tab separated with genes in rows, samples in columns. Default is ‘’.
`strRate`	The path to the cell type compositions of the above expression matrix. The composition matrix is tab separated with cell types in rows, samples in columns. Default is ‘’.
`geneNameReady`	A boolean to indicate if the gene names in the above bulk expression matrix is official symbol already. The `FALSE` option also works with the official symbol is used in the expression matrix. Default is `FALSE`, which enable to find official symbol by an R package called `biomaRt`. Default is `FALSE`
`ensemblPath`	The path to a folder where ensembl gene definition is/will be saved. The ensembl gene definition file will be saved if it never run before. Default is Data/ in the current working directory.
`ensemblV`	The version of the ensembl to be used for the input query bulk expression. Default is 97.
`rmseCutoff`	A numeric indicates the maximin RMSE for a bulk sample. All cell types from those bulk samples whose RMSE is larger than this value will be included in the next iteration training. Default is 0.1.
`maxRMrate`	The maximin ratio of total pseudo pure combinations to be removed during bulk training process. Default is 0.75.
`maxIteration`	The maximin iterations. If the RMSE for all bulk samples are less than indicated rmseCutoff, the iteration will stopped. Default is 10.
`core`	The number of computation nodes could be used. Default is 2.
`trailR`	The maximin ratio of pseudo pure combinations to be removed for each bulk sample during bulk training process. Default is 0.8.

Examples

strData <- c(system.file("extdata","GSE103723.rds",package="cellmap"),
  system.file("extdata","GSE104276.rds",package="cellmap"))
cellTypeMap <- c(Macrophage=Microglia,Astrocytes=Astrocytes,Oligodendrocytes=Oligodendrocytes,
  GABAergic=Neuron,Glutamatergic=Neuron,Excitatory=Neuron,Inhibitory=Neuron)
cellmapTraining(strData,strPrefix="~/cellmap_profile",cellTypeMap,core=16)

interactivereport/CellMap documentation built on March 17, 2024, 2:01 a.m.