scrub_doublets: Standard pipeline for preprocessing, doublet simulation, and...

Description Usage Arguments Value Examples

View source: R/rscrublet.R

Description

Automatically sets a threshold for calling doublets, but it's best to check this by running plot_doublet_histogram afterwards and adjusting threshold manually

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
scrub_doublets(
  E_obs,
  total_counts = NULL,
  sim_doublet_ratio = 2,
  n_neighbors = NULL,
  expected_doublet_rate = 0.1,
  stdev_doublet_rate = 0.02,
  random_state = 0,
  synthetic_doublet_umi_subsampling = 1,
  use_approx_neighbors = TRUE,
  distance_metric = "euclidean",
  get_doublet_neighbor_parents = FALSE,
  min_counts = 3,
  min_cells = 3,
  min_gene_variability_pctl = 85,
  log_transform = FALSE,
  mean_center = TRUE,
  normalize_variance = TRUE,
  n_prin_comps = 30,
  doublets_parents = NULL,
  log_pseudocount = 1,
  verbose = TRUE
)

Arguments

E_obs

sparse dgTMatrix (it will automatically coerced to dgTMatrix) n_cells*n_genes containing raw (unnormalized) UMI-based transcript counts.

total_counts

numerical vector of total UMI counts per cell. If NULL (default), this is calculated as the row sums of E_obs.

sim_doublet_ratio

Number of doublets to simulate relative to the number of observed transcriptomes.

n_neighbors

Number of neighbors used to construct the KNN graph of observed transcriptomes and simulated doublets. If NULL (default), this is set to round(0.5 * sqrt(n_cells))

expected_doublet_rate

The estimated doublet rate for the experiment.

stdev_doublet_rate

Uncertainty in the expected doublet rate.

random_state

Random seed for doublet simulation, approximate nearest neighbor search, and PCA/TruncatedSVD.

synthetic_doublet_umi_subsampling

Rate for sampling UMIs when creating synthetic doublets. If 1.0, each doublet is created by simply adding the UMIs from two randomly sampled observed transcriptomes. For values less than 1, the UMI counts are added and then randomly sampled at the specified rate.

use_approx_neighbors

logical (default TRUE). Use approximate nearest neighbor method (annoy) for the KNN classifier. Current implementation of exact KNN is much slower.

distance_metric

Distance metric used when finding nearest neighbors. One of 'euclidean', 'manhattan', 'hamming', 'Angular' if use_approx_neighbors is true, or any method acceptable by dist

get_doublet_neighbor_parents

logical, If TRUE, return the parent transcriptomes that generated the doublet neighbors of each observed transcriptome. This information can be used to infer the cell states that generated a given doublet state

min_counts

Used for gene filtering prior to PCA. Genes expressed at fewer than min_counts in fewer than min_cells (see below) are excluded.

min_cells

Used for gene filtering prior to PCA. Genes expressed at fewer than min_counts (see above) in fewer than min_cells are excluded.

min_gene_variability_pctl

Used for gene filtering prior to PCA. Keep the most highly variable genes (in the top min_gene_variability_pctl percentile), as measured by the v-statistic (Klein et al., Cell 2015).

log_transform

logical (default: FALSE) If TRUE, log-transform the counts matrix (log10(log_pseudocount+TPM))

mean_center

logical, If TURE (default), center the data such that each gene has a mean of 0.

normalize_variance

logical, If TURE (default) If True, normalize the data such that each gene has a variance of 1.

n_prin_comps

Number of principal components used to embed the transcriptomes prior to k-nearest-neighbor graph construction (default 30)

doublets_parents

integer matrix with two columns that provide indexes for doublet parents. This option might be used to simulate doublets from specific parents, for instance for comparison with python version

log_pseudocount

pseudocount for log transformation (default 1)

verbose

If TRUE, print progress updates.

Value

list with following items:

Examples

1
2
3
4
5
6
7
8
# run rscrublet of 8k pbmc example dataset
scrr = scrub_doublets(E_obs = pbmc8k,expected_doublet_rate=0.06,min_counts=2, min_cells=3, min_gene_variability_pctl=85, n_prin_comps=30)
# set threshould automatically 
scrr=call_doublets(scrr)
# examine score distribution
plot_doublet_histogram(scrr)
# find predicted doublets
rownames(pbmc8k)[scrr$predicted_doublets]

iaaaka/Rscrublet documentation built on Dec. 20, 2021, 5:57 p.m.