dropSplit: Automatically identify cell-containing and empty droplets for...

View source: R/dropSplit.R

dropSplitR Documentation

Automatically identify cell-containing and empty droplets for droplet-based scRNAseq data using dropSplit.

Description

dropSplit is designed to identify true cells from droplet-based scRNAseq data. It consists of four main steps:

  1. Pre-define droplets as 'Cell', 'Uncertain', 'Empty' and 'Discarded' droplets according to the RankMSE curve.

  2. Simulate 'Cell' and 'Uncertain' droplets under a depth of 'Empty' used for model construction and prediction.

  3. Iteratively buid the model, classify 'Uncertain' droplets, and update the training set use newly predicted 'Empty'.

  4. Make classification with the optimal model.

dropSplit provides some special droplet QC metrics such as CellEntropy or CellGini which can help identification. In general, user can use the predefined parameters in the XGBoost and get the important features that help in cell identification. It also provides a automatic XGBoost hyperparameters-tuning function to optimize the model.

Usage

dropSplit(
  counts,
  do_plot = TRUE,
  Cell_score = 0.9,
  Empty_score = 0.2,
  downsample_times = NULL,
  CE_ratio = 2,
  fill_RankMSE = FALSE,
  smooth_num = 3,
  smooth_window = 100,
  Cell_rank = NULL,
  Uncertain_rank = NULL,
  Empty_rank = NULL,
  Cell_min_nCount = 500,
  Empty_min_nCount = 10,
  Empty_max_num = 50000,
  Gini_control = TRUE,
  Gini_threshold = NULL,
  max_iter = 6,
  preCell_mask = FALSE,
  preEmpty_mask = TRUE,
  FDR = 0.05,
  remove_outliers = FALSE,
  xgb_params = NULL,
  xgb_nrounds = 20,
  xgb_thread = 8,
  xgb_early_stopping_rounds = NULL,
  modelOpt = FALSE,
  verbose = 1,
  seed = 0,
  ...
)

Arguments

counts

A matrix object or a dgCMatrix object which columns represent droplets and rows represent features.

do_plot

Whether to plot during the cellcalling. Default is TRUE.

Cell_score

A cutoff value of dropSplitScore to determine if a droplet is cell-containing. Range between 0.5 and 1. Default is 0.9.

Empty_score

A cutoff value of dropSplitScore to determine if a droplet is empty. Range between 0 and 0.5. Default is 0.2.

downsample_times

Number of repetitions of downsampling for 'Cell' and 'Uncertain' droplets. If NULL, will be determined by CE_ratio. Default is NULL.

CE_ratio

Ratio value between down-sampled 'Cells' and 'Empty' droplets. The actual value will be slightly higher than the set. Default is 2.

fill_RankMSE

Whether to fill the RankMSE by nCount. Default is TRUE.

smooth_num

Number of times to smooth(take a mean value within a window length smooth_window) the squared error. Default is 3.

smooth_window

Window length used to smooth the squared error. Default is 100.

Cell_rank, Uncertain_rank, Empty_rank

Custom Rank value to mark the droplets as Cell, Uncertain and Empty labels for the data to be trained. Default is automatic. But useful when the default value is considered to be wrong from the RankMSE plot.

Cell_min_nCount

Minimum nCount for 'Cell' droplets. Default is 500.

Empty_min_nCount

Minimum nCount for 'Empty' droplets. Default is 10.

Empty_max_num

Number of pre-defined 'Empty' droplets. Default is 50000.

Gini_control

Whether to control cell quality by CellGini. Default is TRUE.

Gini_threshold

A cutoff of the CellGini metric. The higher, the more conservative and will get a lower number of cells. Default is automatic.

max_iter

An integer specifying the number of iterations to use to rebuild the model with new defined droplets. Default is 6.

preCell_mask

logical; Whether to mask pre-defined 'Cell' droplets when prediction. If TRUE, XGBoostScore for all droplets pre-defined as 'Cell' will be set to 1; Default is FALSE.

preEmpty_mask

logical; Whether to mask pre-defined 'Empty' droplets when prediction. There is a little different with parameter preCell_mask. If TRUE, XGBoostScore will not change, but the final classification will not be 'Cell' in any case. Default is TRUE.

FDR

FDR cutoff for droplets that predicted as 'Cell' or 'Empty' from pre-defined 'Uncertain'. Note, statistic tests and the FDR control only performed on the difference between averaged XGBoostScore and 0.5. Default is 0.05.

remove_outliers

Whether remove outliers for 'Cell' droplets according to the dropSplitScore. Default is FALSE.

xgb_params

The list of XGBoost parameters.

xgb_nrounds

Max number of boosting iterations.

xgb_thread

Number of thread used in xgb.cv.

xgb_early_stopping_rounds

If NULL, the early stopping function is not triggered. If set to an integer k, training with a validation set will stop if the performance doesn't improve for k rounds. Setting this parameter engages the cb.early.stop callback.

modelOpt

Whether to optimize the model using xgbOptimization. Will take long time for large datasets. If TRUE, will overwrite the parameters list in xgb_params. The following parameters are only used in xgbOptimization.

verbose

If 0, xgboost will stay silent. If 1, it will print information about performance. If 2, some additional information will be printed out. Note that setting verbose > 0 automatically engages the cb.print.evaluation(period=1) callback function.

seed

Random seed used in simulation. Default is 0.

...

Other arguments passed to xgbOptimization.

Value

A list of six objects:

meta_info

A DataFrame object of evaluation metrics to be used in dropSplit and classification task. Important columns in the meta_info:

  • preDefinedClass

  • CellGini

  • CellGiniScore

  • XGBoostScore

  • pvalue

  • FDR

  • dropSplitClass

  • dropSplitScore

train

The dataset trained in the final XGBoost model. It consists of two pre-defined droplets: Cell(raw + simulated) and Empty.

train_label

Labels for the train. 0 represents 'Empty', 1 represents 'Cell'.

to_predict

The dataset that to be predicted. It consists of all three pre-defined droplets: Cell(raw + simulated), Uncertain(simulated) and Empty.

model

The XGBoost model used in dropSplit for classification.

importance_matrix

A data.frame of feature importances in the classification model.

Examples

library(dropSplit)
# Simulate a counts matrix including 20000 empty droplets, 2000 large cells and 200 small cells.
simple_counts <- simSimpleCounts()
true <- strsplit(colnames(simple_counts), "-")
true <- as.data.frame(Reduce(function(x, y) rbind(x, y), true))
colnames(true) <- c("label", "Type", "Cluster", "Cell")
rownames(true) <- colnames(simple_counts)
true_label <- true$label
table(true_label)

## DropSplit ---------------------------------------------------------------
result <- dropSplit(simple_counts)
qc <- QCplot(result$meta_info)
dropSplitClass <- result$meta_info$dropSplitClass
table(true_label, result$meta_info$dropSplitClass)

# compare with the true labels
result$meta_info$true_label <- true_label
qc_true <- QCplot(result$meta_info, colorBy = "true_label")
qc_true$CellEntropy$Merge

# QC plot using all metrics
qc <- QCplot(result$meta_info)
qc$RankMSE$Merge
qc$CellEntropy$Merge
qc$CellEfficiency$Merge

# Feature importance plot
fp <- ImportancePlot(result$meta_info, result$train, result$importance_matrix, top_n = 20)
fp$Importance
fp$preDefinedClassExp
fp$dropSplitClassExp

zh542370159/dropSplit documentation built on June 19, 2022, 2:49 p.m.