dropSplit: Automatically identify cell-containing and empty droplets for...
In zh542370159/dropSplit: Automatically identify cell-containing and empty droplets for droplet-based scRNAseq data using dropSplit

dropSplit

R Documentation

Automatically identify cell-containing and empty droplets for droplet-based scRNAseq data using dropSplit.

Description

dropSplit is designed to identify true cells from droplet-based scRNAseq data. It consists of four main steps:

Pre-define droplets as 'Cell', 'Uncertain', 'Empty' and 'Discarded' droplets according to the RankMSE curve.
Simulate 'Cell' and 'Uncertain' droplets under a depth of 'Empty' used for model construction and prediction.
Iteratively buid the model, classify 'Uncertain' droplets, and update the training set use newly predicted 'Empty'.
Make classification with the optimal model.

dropSplit provides some special droplet QC metrics such as CellEntropy or CellGini which can help identification. In general, user can use the predefined parameters in the XGBoost and get the important features that help in cell identification. It also provides a automatic XGBoost hyperparameters-tuning function to optimize the model.

Usage

dropSplit(
  counts,
  do_plot = TRUE,
  Cell_score = 0.9,
  Empty_score = 0.2,
  downsample_times = NULL,
  CE_ratio = 2,
  fill_RankMSE = FALSE,
  smooth_num = 3,
  smooth_window = 100,
  Cell_rank = NULL,
  Uncertain_rank = NULL,
  Empty_rank = NULL,
  Cell_min_nCount = 500,
  Empty_min_nCount = 10,
  Empty_max_num = 50000,
  Gini_control = TRUE,
  Gini_threshold = NULL,
  max_iter = 6,
  preCell_mask = FALSE,
  preEmpty_mask = TRUE,
  FDR = 0.05,
  remove_outliers = FALSE,
  xgb_params = NULL,
  xgb_nrounds = 20,
  xgb_thread = 8,
  xgb_early_stopping_rounds = NULL,
  modelOpt = FALSE,
  verbose = 1,
  seed = 0,
  ...
)

Arguments

`counts`	A `matrix` object or a `dgCMatrix` object which columns represent droplets and rows represent features.
`do_plot`	Whether to plot during the cellcalling. Default is `TRUE`.
`Cell_score`	A cutoff value of `dropSplitScore` to determine if a droplet is cell-containing. Range between 0.5 and 1. Default is 0.9.
`Empty_score`	A cutoff value of `dropSplitScore` to determine if a droplet is empty. Range between 0 and 0.5. Default is 0.2.
`downsample_times`	Number of repetitions of downsampling for 'Cell' and 'Uncertain' droplets. If `NULL`, will be determined by `CE_ratio`. Default is `NULL`.
`CE_ratio`	Ratio value between down-sampled 'Cells' and 'Empty' droplets. The actual value will be slightly higher than the set. Default is 2.
`fill_RankMSE`	Whether to fill the RankMSE by nCount. Default is `TRUE`.
`smooth_num`	Number of times to smooth(take a mean value within a window length `smooth_window`) the squared error. Default is 3.
`smooth_window`	Window length used to smooth the squared error. Default is 100.
`Cell_rank, Uncertain_rank, Empty_rank`	Custom Rank value to mark the droplets as Cell, Uncertain and Empty labels for the data to be trained. Default is automatic. But useful when the default value is considered to be wrong from the RankMSE plot.
`Cell_min_nCount`	Minimum nCount for 'Cell' droplets. Default is 500.
`Empty_min_nCount`	Minimum nCount for 'Empty' droplets. Default is 10.
`Empty_max_num`	Number of pre-defined 'Empty' droplets. Default is 50000.
`Gini_control`	Whether to control cell quality by CellGini. Default is `TRUE`.
`Gini_threshold`	A cutoff of the CellGini metric. The higher, the more conservative and will get a lower number of cells. Default is automatic.
`max_iter`	An integer specifying the number of iterations to use to rebuild the model with new defined droplets. Default is 6.
`preCell_mask`	logical; Whether to mask pre-defined 'Cell' droplets when prediction. If `TRUE`, XGBoostScore for all droplets pre-defined as 'Cell' will be set to 1; Default is `FALSE`.
`preEmpty_mask`	logical; Whether to mask pre-defined 'Empty' droplets when prediction. There is a little different with parameter `preCell_mask`. If `TRUE`, XGBoostScore will not change, but the final classification will not be 'Cell' in any case. Default is `TRUE`.
`FDR`	FDR cutoff for droplets that predicted as 'Cell' or 'Empty' from pre-defined 'Uncertain'. Note, statistic tests and the FDR control only performed on the difference between averaged `XGBoostScore` and 0.5. Default is 0.05.
`remove_outliers`	Whether remove outliers for 'Cell' droplets according to the `dropSplitScore`. Default is `FALSE`.
`xgb_params`	The `list` of XGBoost parameters.
`xgb_nrounds`	Max number of boosting iterations.
`xgb_thread`	Number of thread used in `xgb.cv`.
`xgb_early_stopping_rounds`	If NULL, the early stopping function is not triggered. If set to an integer k, training with a validation set will stop if the performance doesn't improve for k rounds. Setting this parameter engages the `cb.early.stop` callback.
`modelOpt`	Whether to optimize the model using `xgbOptimization`. Will take long time for large datasets. If `TRUE`, will overwrite the parameters list in `xgb_params`. The following parameters are only used in `xgbOptimization`.
`verbose`	If 0, xgboost will stay silent. If 1, it will print information about performance. If 2, some additional information will be printed out. Note that setting verbose > 0 automatically engages the cb.print.evaluation(period=1) callback function.
`seed`	Random seed used in simulation. Default is 0.
`...`	Other arguments passed to `xgbOptimization`.

Value

A list of six objects:

meta_info

A DataFrame object of evaluation metrics to be used in dropSplit and classification task. Important columns in the meta_info:

preDefinedClass
CellGini
CellGiniScore
XGBoostScore
pvalue
FDR
dropSplitClass
dropSplitScore

train

The dataset trained in the final XGBoost model. It consists of two pre-defined droplets: Cell(raw + simulated) and Empty.

train_label

Labels for the train. 0 represents 'Empty', 1 represents 'Cell'.

to_predict

The dataset that to be predicted. It consists of all three pre-defined droplets: Cell(raw + simulated), Uncertain(simulated) and Empty.

model

The XGBoost model used in dropSplit for classification.

importance_matrix

A data.frame of feature importances in the classification model.

Examples

library(dropSplit)
# Simulate a counts matrix including 20000 empty droplets, 2000 large cells and 200 small cells.
simple_counts <- simSimpleCounts()
true <- strsplit(colnames(simple_counts), "-")
true <- as.data.frame(Reduce(function(x, y) rbind(x, y), true))
colnames(true) <- c("label", "Type", "Cluster", "Cell")
rownames(true) <- colnames(simple_counts)
true_label <- true$label
table(true_label)

## DropSplit ---------------------------------------------------------------
result <- dropSplit(simple_counts)
qc <- QCplot(result$meta_info)
dropSplitClass <- result$meta_info$dropSplitClass
table(true_label, result$meta_info$dropSplitClass)

# compare with the true labels
result$meta_info$true_label <- true_label
qc_true <- QCplot(result$meta_info, colorBy = "true_label")
qc_true$CellEntropy$Merge

# QC plot using all metrics
qc <- QCplot(result$meta_info)
qc$RankMSE$Merge
qc$CellEntropy$Merge
qc$CellEfficiency$Merge

# Feature importance plot
fp <- ImportancePlot(result$meta_info, result$train, result$importance_matrix, top_n = 20)
fp$Importance
fp$preDefinedClassExp
fp$dropSplitClassExp

zh542370159/dropSplit documentation built on June 19, 2022, 2:49 p.m.

zh542370159/dropSplit index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

zh542370159/dropSplit
Automatically identify cell-containing and empty droplets for droplet-based scRNAseq data using dropSplit

dropSplit: Automatically identify cell-containing and empty droplets for...
In zh542370159/dropSplit: Automatically identify cell-containing and empty droplets for droplet-based scRNAseq data using dropSplit

Automatically identify cell-containing and empty droplets for droplet-based scRNAseq data using dropSplit.

Description

Usage

Arguments

Value

Examples

Related to dropSplit in zh542370159/dropSplit...

R Package Documentation

Browse R Packages

We want your feedback!

zh542370159/dropSplit Automatically identify cell-containing and empty droplets for droplet-based scRNAseq data using dropSplit

dropSplit: Automatically identify cell-containing and empty droplets for... In zh542370159/dropSplit: Automatically identify cell-containing and empty droplets for droplet-based scRNAseq data using dropSplit

Automatically identify cell-containing and empty droplets for droplet-based scRNAseq data using dropSplit.

Description

Usage

Arguments

Value

Examples

Related to dropSplit in zh542370159/dropSplit...

R Package Documentation

Browse R Packages

We want your feedback!

zh542370159/dropSplit
Automatically identify cell-containing and empty droplets for droplet-based scRNAseq data using dropSplit

dropSplit: Automatically identify cell-containing and empty droplets for...
In zh542370159/dropSplit: Automatically identify cell-containing and empty droplets for droplet-based scRNAseq data using dropSplit