scrub_doublets: Standard pipeline for preprocessing, doublet simulation, and...
In iaaaka/Rscrublet: A tool for identifying doublets in single-cell RNA-seq data.

Description Usage Arguments Value Examples

View source: R/rscrublet.R

Automatically sets a threshold for calling doublets, but it's best to check this by running plot_doublet_histogram afterwards and adjusting threshold manually

scrub_doublets(
  E_obs,
  total_counts = NULL,
  sim_doublet_ratio = 2,
  n_neighbors = NULL,
  expected_doublet_rate = 0.1,
  stdev_doublet_rate = 0.02,
  random_state = 0,
  synthetic_doublet_umi_subsampling = 1,
  use_approx_neighbors = TRUE,
  distance_metric = "euclidean",
  get_doublet_neighbor_parents = FALSE,
  min_counts = 3,
  min_cells = 3,
  min_gene_variability_pctl = 85,
  log_transform = FALSE,
  mean_center = TRUE,
  normalize_variance = TRUE,
  n_prin_comps = 30,
  doublets_parents = NULL,
  log_pseudocount = 1,
  verbose = TRUE
)

`E_obs`	sparse dgTMatrix (it will automatically coerced to dgTMatrix) n_cells*n_genes containing raw (unnormalized) UMI-based transcript counts.
`total_counts`	numerical vector of total UMI counts per cell. If NULL (default), this is calculated as the row sums of `E_obs`.
`sim_doublet_ratio`	Number of doublets to simulate relative to the number of observed transcriptomes.
`n_neighbors`	Number of neighbors used to construct the KNN graph of observed transcriptomes and simulated doublets. If NULL (default), this is set to round(0.5 * sqrt(n_cells))
`expected_doublet_rate`	The estimated doublet rate for the experiment.
`stdev_doublet_rate`	Uncertainty in the expected doublet rate.
`random_state`	Random seed for doublet simulation, approximate nearest neighbor search, and PCA/TruncatedSVD.
`synthetic_doublet_umi_subsampling`	Rate for sampling UMIs when creating synthetic doublets. If 1.0, each doublet is created by simply adding the UMIs from two randomly sampled observed transcriptomes. For values less than 1, the UMI counts are added and then randomly sampled at the specified rate.
`use_approx_neighbors`	logical (default TRUE). Use approximate nearest neighbor method (annoy) for the KNN classifier. Current implementation of exact KNN is much slower.
`distance_metric`	Distance metric used when finding nearest neighbors. One of 'euclidean', 'manhattan', 'hamming', 'Angular' if use_approx_neighbors is true, or any method acceptable by dist
`get_doublet_neighbor_parents`	logical, If TRUE, return the parent transcriptomes that generated the doublet neighbors of each observed transcriptome. This information can be used to infer the cell states that generated a given doublet state
`min_counts`	Used for gene filtering prior to PCA. Genes expressed at fewer than `min_counts` in fewer than `min_cells` (see below) are excluded.
`min_cells`	Used for gene filtering prior to PCA. Genes expressed at fewer than `min_counts` (see above) in fewer than `min_cells` are excluded.
`min_gene_variability_pctl`	Used for gene filtering prior to PCA. Keep the most highly variable genes (in the top min_gene_variability_pctl percentile), as measured by the v-statistic (Klein et al., Cell 2015).
`log_transform`	logical (default: FALSE) If TRUE, log-transform the counts matrix (`log10(log_pseudocount+TPM)`)
`mean_center`	logical, If TURE (default), center the data such that each gene has a mean of 0.
`normalize_variance`	logical, If TURE (default) If True, normalize the data such that each gene has a variance of 1.
`n_prin_comps`	Number of principal components used to embed the transcriptomes prior to k-nearest-neighbor graph construction (default 30)
`doublets_parents`	integer matrix with two columns that provide indexes for doublet parents. This option might be used to simulate doublets from specific parents, for instance for comparison with python version
`log_pseudocount`	pseudocount for log transformation (default 1)
`verbose`	If TRUE, print progress updates.

list with following items:

doublet_scores_obs - Doublet scores for observed transcriptomes
doublet_scores_sim - Doublet scores for simulated doublets.
doublet_errors_obs - Standard error in the doublet scores for observed transcriptomes.
doublet_errors_sim - Standard error in the doublet scores for simulated doublets.
doublet_neighbor_parent - parent transcriptomes that generated the doublet neighbors of each observed transcriptome. This information can be used to infer the cell states that generated a given doublet state
expected_doublet_rate - equal to expected_doublet_rate parameter

# run rscrublet of 8k pbmc example dataset
scrr = scrub_doublets(E_obs = pbmc8k,expected_doublet_rate=0.06,min_counts=2, min_cells=3, min_gene_variability_pctl=85, n_prin_comps=30)
# set threshould automatically 
scrr=call_doublets(scrr)
# examine score distribution
plot_doublet_histogram(scrr)
# find predicted doublets
rownames(pbmc8k)[scrr$predicted_doublets]