dropSplit | R Documentation |
dropSplit is designed to identify true cells from droplet-based scRNAseq data. It consists of four main steps:
Pre-define droplets as 'Cell', 'Uncertain', 'Empty' and 'Discarded' droplets according to the RankMSE curve.
Simulate 'Cell' and 'Uncertain' droplets under a depth of 'Empty' used for model construction and prediction.
Iteratively buid the model, classify 'Uncertain' droplets, and update the training set use newly predicted 'Empty'.
Make classification with the optimal model.
dropSplit provides some special droplet QC metrics such as CellEntropy or CellGini which can help identification. In general, user can use the predefined parameters in the XGBoost and get the important features that help in cell identification. It also provides a automatic XGBoost hyperparameters-tuning function to optimize the model.
dropSplit( counts, do_plot = TRUE, Cell_score = 0.9, Empty_score = 0.2, downsample_times = NULL, CE_ratio = 2, fill_RankMSE = FALSE, smooth_num = 3, smooth_window = 100, Cell_rank = NULL, Uncertain_rank = NULL, Empty_rank = NULL, Cell_min_nCount = 500, Empty_min_nCount = 10, Empty_max_num = 50000, Gini_control = TRUE, Gini_threshold = NULL, max_iter = 6, preCell_mask = FALSE, preEmpty_mask = TRUE, FDR = 0.05, remove_outliers = FALSE, xgb_params = NULL, xgb_nrounds = 20, xgb_thread = 8, xgb_early_stopping_rounds = NULL, modelOpt = FALSE, verbose = 1, seed = 0, ... )
counts |
A |
do_plot |
Whether to plot during the cellcalling. Default is |
Cell_score |
A cutoff value of |
Empty_score |
A cutoff value of |
downsample_times |
Number of repetitions of downsampling for 'Cell' and 'Uncertain' droplets. If |
CE_ratio |
Ratio value between down-sampled 'Cells' and 'Empty' droplets. The actual value will be slightly higher than the set. Default is 2. |
fill_RankMSE |
Whether to fill the RankMSE by nCount. Default is |
smooth_num |
Number of times to smooth(take a mean value within a window length |
smooth_window |
Window length used to smooth the squared error. Default is 100. |
Cell_rank, Uncertain_rank, Empty_rank |
Custom Rank value to mark the droplets as Cell, Uncertain and Empty labels for the data to be trained. Default is automatic. But useful when the default value is considered to be wrong from the RankMSE plot. |
Cell_min_nCount |
Minimum nCount for 'Cell' droplets. Default is 500. |
Empty_min_nCount |
Minimum nCount for 'Empty' droplets. Default is 10. |
Empty_max_num |
Number of pre-defined 'Empty' droplets. Default is 50000. |
Gini_control |
Whether to control cell quality by CellGini. Default is |
Gini_threshold |
A cutoff of the CellGini metric. The higher, the more conservative and will get a lower number of cells. Default is automatic. |
max_iter |
An integer specifying the number of iterations to use to rebuild the model with new defined droplets. Default is 6. |
preCell_mask |
logical; Whether to mask pre-defined 'Cell' droplets when prediction. If |
preEmpty_mask |
logical; Whether to mask pre-defined 'Empty' droplets when prediction. There is a little different with parameter |
FDR |
FDR cutoff for droplets that predicted as 'Cell' or 'Empty' from pre-defined 'Uncertain'. Note, statistic tests and the FDR control only performed on the difference between averaged |
remove_outliers |
Whether remove outliers for 'Cell' droplets according to the |
xgb_params |
The |
xgb_nrounds |
Max number of boosting iterations. |
xgb_thread |
Number of thread used in |
xgb_early_stopping_rounds |
If NULL, the early stopping function is not triggered. If set to an integer k, training with a validation set will stop if the performance doesn't improve for k rounds. Setting this parameter engages the |
modelOpt |
Whether to optimize the model using |
verbose |
If 0, xgboost will stay silent. If 1, it will print information about performance. If 2, some additional information will be printed out. Note that setting verbose > 0 automatically engages the cb.print.evaluation(period=1) callback function. |
seed |
Random seed used in simulation. Default is 0. |
... |
Other arguments passed to |
A list of six objects:
A DataFrame
object of evaluation metrics to be used in dropSplit and classification task. Important columns in the meta_info
:
preDefinedClass
CellGini
CellGiniScore
XGBoostScore
pvalue
FDR
dropSplitClass
dropSplitScore
The dataset trained in the final XGBoost model. It consists of two pre-defined droplets: Cell(raw + simulated) and Empty.
Labels for the train
. 0 represents 'Empty', 1 represents 'Cell'.
The dataset that to be predicted. It consists of all three pre-defined droplets: Cell(raw + simulated), Uncertain(simulated) and Empty.
The XGBoost model used in dropSplit for classification.
A data.frame
of feature importances in the classification model.
library(dropSplit) # Simulate a counts matrix including 20000 empty droplets, 2000 large cells and 200 small cells. simple_counts <- simSimpleCounts() true <- strsplit(colnames(simple_counts), "-") true <- as.data.frame(Reduce(function(x, y) rbind(x, y), true)) colnames(true) <- c("label", "Type", "Cluster", "Cell") rownames(true) <- colnames(simple_counts) true_label <- true$label table(true_label) ## DropSplit --------------------------------------------------------------- result <- dropSplit(simple_counts) qc <- QCplot(result$meta_info) dropSplitClass <- result$meta_info$dropSplitClass table(true_label, result$meta_info$dropSplitClass) # compare with the true labels result$meta_info$true_label <- true_label qc_true <- QCplot(result$meta_info, colorBy = "true_label") qc_true$CellEntropy$Merge # QC plot using all metrics qc <- QCplot(result$meta_info) qc$RankMSE$Merge qc$CellEntropy$Merge qc$CellEfficiency$Merge # Feature importance plot fp <- ImportancePlot(result$meta_info, result$train, result$importance_matrix, top_n = 20) fp$Importance fp$preDefinedClassExp fp$dropSplitClassExp
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.