qc_diagnostic: Run a bunch of default quality checks on scRNAseq data from...

View source: R/qc_diagnostics.R

qc_diagnosticR Documentation

Run a bunch of default quality checks on scRNAseq data from 10X genomics (Cell Ranger)

Description

Taking Cell Rangers output, namely the (i) feature, (ii) barcode, (iii) matrix files within respective folder(s), filtered_feature_bc_matrix and optionally additionally raw_feature_bc_matrix, a very default Seurat Object is generated and quality metrics are computed. Moreover, these metrics are used to cluster the cells. Groups of cells with low quality scores (e.g. high mt-fraction plus low number of detected features) will likely clusters together. This allows to filter them easily. If raw_feature_bc_matrix is provided, SoupX may be run. In that case the whole pipeline in run twice. Namely, once with the original count matrix and once with a corrected count matrix after running SoupX with default settings. If raw_feature_bc_matrix is not available, only DecontX is available to check for ambient RNA contamination. Both of these metrics (SoupX and DecontX estimation of ambient RNA) become part of clustering by qc metrics if set to TRUE. scDblFinder is run to detect doublets which is another quality metric. In addition to qc metrics, a number of principle components (PCs) from feature expression may be added to clustering and dimension reduction as also the feature composition of low quality transcriptomes may be skewed. This will likely cause these cells to cluster separately even more. Apart from clustering with meta data, also a pure analysis with feature expression values only is run. That may allow cluster-wise application of additional filters for qc metrics after very definite low quality transcriptomes have been eliminated in a first round based on qc metric clustering. If data_dirs contains multiple samples then integration of samples is done with harmony. (i) Detection of soup (ambient RNA) by SoupX and decontX, (ii) detection of doublets and (iii) calculation of residuals from the linear model of nCount_RNA_log vs nFeature_RNA_log is done sample-wise when multiple data_dirs are detected/provided. Results are written into the common Seurat object though, the merged and harmonzized PCA space of which is subject for clustering the cells based on feature expression (phenotypes)

Usage

qc_diagnostic(
  data_dirs,
  nhvf = 2000,
  npcs = 10,
  resolution = 0.8,
  resolution_SoupX = 0.6,
  resolution_meta = 0.8,
  n_PCs_to_meta_clustering = 2,
  scDblFinder = T,
  SoupX = F,
  decontX = F,
  return_SoupX = T,
  cells = NULL,
  invert_cells = F,
  feature_rm = NULL,
  feature_aggr = NULL,
  ...
)

Arguments

data_dirs

list or vector of parent direction(s) which will be search for folders called "filtered_feature_bc_matrix"; on the same level where each of these folders is found, a raw_feature_bc_matrix folder may exist to enable SoupX; if one "raw_feature_bc_matrix" is missing, SoupX is disable for all others

nhvf

number of highly variable features for every of the procedures

npcs

number or principle components to calculate, e.g. 12 for diverse data sets and 8 for isolated subsets

resolution

resolution (louvain algorithm) for clustering based on feature expression

resolution_SoupX

resolution (louvain algorithm) for SoupX analysis

resolution_meta

resolution(s) (louvain algorithm) for clustering based on qc meta data and optionally additional PC dimensions (n_PCs_to_meta_clustering)

n_PCs_to_meta_clustering

how many principle components (PCs) from phenotypic clustering to add to qc meta data; this will generate a mixed clustering (PCs from phenotypes (RNA) and qc meta data like pct mt and nCount_RNA); the more PCs are added the greater the phenotypic influence becomes; one or more integers can be supplied to explore the effect; pass 0, to have no PCs included in meta clustering; e.g. when n_PCs_to_meta_clustering = 3 PCs 1-3 are used, when n_PCs_to_meta_clustering = 1 only PC 1 is used.

scDblFinder

logical, whether to run doublet detection algorithm from scDblFinder

SoupX

logical whether to run SoupX. If TRUE, raw_feature_bc_matrix is needed.

decontX

logical whether to run celda::decontX to estimate RNA soup (contaminating ambient RNA molecules)

return_SoupX

logical whether to return a full Seurat-object and diagnostics from SoupX (TRUE) or whether to run SoupX without these returns and just have the Soup-metric included as an additional quality-control metric along with pct_mt and nCount_RNA etc. Will be set to FALSE if more than one data_dir with filtered_feature_bc_matrix is supplied. So, only possible when data set are provided one by one.

cells

vector of cell names to include, consider the trailing '-1' in cell names

invert_cells

invert cell selection, if TRUE cell names provides in 'cells' are excluded

feature_rm

character vector of features to remove from count matrices; removal is done after aggregation (if feature_aggr is provided)

feature_aggr

named list of character vectors of features to aggregate; names of of list entries are names of the aggregated feature; aggregation of counts is simply done by addition; aggregation is done before feature removal (if feature_rm is provided)

...

additional arguments to SoupX::autoEstCont

Value

a list of Seurat object and data frame with marker genes for clusters based on feature expression


Close-your-eyes/scexpr documentation built on April 21, 2023, 10:27 a.m.