{.tabset}

Overview

An initial stage to read in scRNA-seq data, remove empty droplets (optionally), and perform quality control, and cell and gene filtering.

More information about scRNA-seq quality control can be found in OSCA.

r emoji::emoji("gear") Config file: config/single_sample/01_input_qc.yaml

r emoji::emoji("clipboard") HTML report target (in config/pipeline.yaml): DRAKE_TARGETS: ["report_input_qc"]

r emoji::emoji("scroll") Example report (used config)

r emoji::emoji("ladder") Structure

Config parameters

Config for this stage is stored in config/single_sample/01_input_qc.yaml file. Directory with this file is read from the SCDRAKE_SINGLE_SAMPLE_CONFIG_DIR environment variable upon {scdrake} load or attach, and saved as scdrake_single_sample_config_dir option. This option is used as the default argument value in several {scdrake} functions.


Input files

INPUT_DATA:
  type: "cellranger"
  path: "/path/to/dir"
  delimiter: ","
  target_name: "target_name"

Type: list of named character scalars

This parameter is specifying input data type and path. There are four possible types of input data:


INPUT_QC_REPORT_RMD_FILE: "Rmd/single_sample/01_input_qc.Rmd"

Type: character scalar

A path to RMarkdown file used for HTML report of this pipeline stage. For spatial extension, the default RMarkdown file is 01_input_qc_spatial.Rmd


Subsetting of imported data

INPUT_DATA_SUBSET: null
## Example of simple subsetting by cluster numbers
INPUT_DATA_SUBSET:
  subset_by: "cluster_kmeans_k6"
  values: ["3", "4"]
  negate: false

Type: null (default) or named list

The imported data can be subsetted by a simple selection of values present in a column of cell metadata (colData()). In the example above, cells assigned to clusters "3" and "4" of the cluster_kmeans_k6 columns will be kept. You can also negate the selection by specifying negate: true.


Spatial extension

SPATIAL: False

Type: logical scalar

If True, pipeline enables spatial extension.

In the input_qc stage, the spatial extension consists only in enabling pseudo tissue visualization, and in adding coordinates (array_col and array_row from tissue_positions.csv) to the Cell Metadata.


SPATIAL_LOCKS: Null

Type: Null or character scalar

A path to tissue_position.csv from SpaceRanger spatial output folder used for adding coordinates (array_col and array_row) to Cell Metadata. Only void when SPATIAL is enabled.


Removal of empty droplets

See ?DropletUtils::emptyDrops for more details.


EMPTY_DROPLETS_ENABLED: True

Type: logical scalar

If False, skip calculation and removal of empty droplets.

You might consider turning off this procedure if the input is not a raw feature-barcode matrix from cellranger, but an already processed dataset (table or SingleCellExperiment object). Otherwise DropletUtils::emptyDrops() will fail as there are not enough empty droplets having the total UMI count < EMPTY_DROPLETS_LOWER (100 by default).


EMPTY_DROPLETS_LOWER: 100

Type: positive integer scalar

A lower bound on the total UMI count at or below which all barcodes are assumed to correspond to empty droplets.


EMPTY_DROPLETS_FDR_THRESHOLD: 0.01

Type: numeric scalar <0; 1>

A threshold for FDR adjusted p-values for null hypothesis that barcode comes from ambient environment. In other words, probability that a droplet was empty and contained ambient RNA.

Cell filtering

Two types of cell filtering are performed: dataset-sensitive (using MAD threshold), and custom (using custom thresholds).

In both filtering types, violation of only one metric threshold leads to removal of a cell.

The choice of cell filtering thresholds depends very much on your dataset. What works for an ordinary high-quality dataset might not work well for a low-quality or special dataset with rare cell types. The quality control chapter in OSCA gives an excellent background to the QC of scRNA-seq data.


ENABLE_CELL_FILTERING: True

Type: logical scalar

If False, both types of cell filtering will be disabled.

The filtering thresholds in both filtering types will be overridden such that every cell will pass the filtering (e.g. MAD_THRESHOLD: .inf or MIN_UMI_CF: -.inf). In the end, SAVE_DATASET_SENSITIVE_FILTERING will have no effect as both filtered SCE targets will be identical. All QC metrics and plots will still be calculated and created.


SAVE_DATASET_SENSITIVE_FILTERING: True

Type: logical scalar

If True, proceed to other stages with dataset filtered by dataset-sensitive filtering (target sce_qc_filter_genes), otherwise with dataset filtered by custom filtering (target sce_custom_filter_genes). Note that the selected SCE target will be referred to as sce_final_input_qc in subsequent stages.

Dataset-sensitive cell filtering
MAD_THRESHOLD: 3

Type: positive numeric scalar

A threshold for maximum MAD (median absolute deviation) for QC cell metrics:

MAD threshold of 3 will retain 99% of non-outlier values that follow a normal distribution.

"Lower tail" means cells having a metric value less than -MAD_THRESHOLD MAD will be discarded. For "upper tail", it is +MAD_THRESHOLD MAD.

Violation of only one metric threshold leads to removal of a cell.

To disable the dataset-sensitive filtering, set MAD_THRESHOLD: .inf. That will force passing of each cell as every QC metric will be always lower than positive infinity MAD.


DATASET_SENSITIVE_FILTERS_OPERATOR: "&"

Type: a character scalar ("&" | "|")

How to join the QC filters:

Custom cell filtering
MIN_UMI_CF: 1000

Type: positive integer scalar

A threshold for minimum number of UMI per cell, i.e. cells with UMI less than MIN_UMI_CF will be removed.

To disable this filter, set MIN_UMI_CF: -.inf


MAX_UMI_CF: 50000

Type: positive integer scalar

A threshold for maximum number of UMI per cell, i.e. cells with UMI greater than MAX_UMI_CF will be removed.

To disable this filter, set MAX_UMI_CF: .inf


MIN_FEATURES: 1000

Type: positive integer scalar

A threshold for minimum number of features (genes) detected per cell, i.e. cells with detected features less than MIN_FEATURES will be removed.

To disable this filter, set MIN_FEATURES: -.inf


MAX_MITO_RATIO: 0.2

Type: numeric scalar <0; 1>

A threshold for maximum ratio of expressed mitochondrial genes per cell, i.e. cells with mitochondrial genes detected in more than (MAX_MITO_RATIO * 100)% of all genes will be removed.

To disable this filter, set MAX_MITO_RATIO: .inf


CUSTOM_FILTERS_OPERATOR: "&"

Type: a character scalar ("&" | "|")

How to join the QC filters:

Gene filtering

Gene filtering thresholds are applied in both types of cell filtering (after it is performed). The filter is computed by the get_gene_filter() function as following:

num_cells <- min_ratio_cells * ncol(sce)
is_expressed <- rowSums(counts(sce) >= min_umi) >= num_cells

A gene is considered expressed when number of its UMIs across all cells is greater than min_umi and at the same time it is expressed in at least min_ratio_cells ratio of cells.


ENABLE_GENE_FILTERING: True

Type: logical scalar

If False, gene filtering will be disabled.

The filtering thresholds will be overridden such that every gene will pass the filtering (MIN_UMI: 0 and MIN_RATIO_CELLS: 0).


MITO_REGEX: "^MT-"
RIBO_REGEX: "^RP[SL]"

Type: character scalar

Regexes used to match and count occurences of mitochondrial and ribosomal features. The latter is currently not used for gene filtering.


MIN_UMI: 1

Type: zero or positive integer scalar

A threshold for minimum number of UMI per cell, i.e. genes with UMI < MIN_UMI will be removed.


MIN_RATIO_CELLS: 0.01

Type: numeric scalar <0; 1>

A minimum ratio of cells expressing a gene, i.e. genes expressed in less than (MIN_RATIO_CELLS * 100)% of cells will be removed.


Output files

INPUT_QC_BASE_OUT_DIR: "01_input_qc"

Type: character scalar

A path to base output directory for this stage. It will be created under BASE_OUT_DIR specified in 00_main.yaml config.


INPUT_QC_REPORT_HTML_FILE: "01_input_qc.html"

Type: character scalar

A name of HTML report file for this stage. Created in INPUT_QC_BASE_OUT_DIR. Subdirectories are not allowed.

HTML output parameters

INPUT_QC_KNITR_MESSAGE: False
INPUT_QC_KNITR_WARNING: False
INPUT_QC_KNITR_ECHO: False

Type: logical scalar

These are passed to knitr::opts_chunk() and used for rendering of stage's HTML report.

Outputs

Here you can find description of the most important targets for this stage. However, for a full overview, you have to inspect the source code of the get_input_qc_subplan() function.

Targets can be loaded from the {drake} cache (.drake directory by default) using the drake::loadd() or drake::readd() functions.

HTML report target name: report_input_qc

SingleCellExperiment objects

sce_raw: an untouched SCE object as loaded from:


sce_valid_cells: a SCE object with empty droplets removed


sce_unfiltered: sce_valid_cells with added columns:


sce_qc_filter, sce_qc_filter_genes: a SCE object with cells filtered by dataset-sensitive filtering. The latter with filtered genes.

sce_custom_filter, sce_custom_filter_genes: a SCE object with cells filtered by custom filtering. The latter with filtered genes.


sce_selected: a SCE object selected from either sce_qc_filter_genes or sce_custom_filter_genes - depends on the SAVE_DATASET_SENSITIVE_FILTERING parameter

sce_final_input_qc: sce_selected with added gene annotation (gene_annotation target) to rowData(). This is the final SCE object which will be used in the next stage norm_clustering.

Cell filters

cell_qc: a DataFrame with cell QC metrics


The following list of filters are computed by scater::isOutlier() from cell_qc columns, which are shown for individual filters (e.g. qc_lib: total column). In those, FALSE refers to cells not passing a filter.

qc_filters: a list of named lists with logical values for dataset-sensitive cell filtering:

custom_filters: a list of named lists with logical values for custom cell filtering:

qc_filter, custom_filter: a summarized filter (by "or" operation) of qc_filters and custom_filters, respectively

Gene filters

sce_qc_gene_filter, sce_custom_gene_filter: gene filters (logical vectors) obtained from get_gene_filter() for SCE objects with cells filtered by dataset-sensitive and custom filter, respectively. In those, FALSE refers to genes not passing the filter.

Plots

sce_unfiltered_plotlist: a list of various violin plots for cell QC metrics ({ggplot2} objects), generated using sce_unfiltered target. Cells are colored by discard_qc column - blue points pass the dataset-sensitive filter, while the orange ones do not.

For example, the following code is used to plot total UMI counts:

scdrake::plot_colData(
  sce_unfiltered,
  y = "total",
  colour_by = "discard_qc",
  title = "Total count",
  scale_y = ggplot2::scale_y_log10()
)

sce_qc_filter_genes_plotlist, sce_custom_filter_genes_plotlist: similar to sce_unfiltered_plotlist, but in each of them, cells are colored by the other filter. That is, cells in sce_qc_filter_genes_plotlist are colored by discard_custom column, and thus, showing which cells would be discarded in custom-filtered dataset. Vice versa for sce_custom_filter_genes_plotlist.

Other targets

config_input_qc: a list holding parameters for this stage


barcode_ranks: a DataFrame with barcode ranks for knee plot, computed by DropletUtils::barcodeRanks() on UMI counts from sce_raw target

empty_droplets: a DataFrame of statistics for empty droplets, computed DropletUtils::emptyDrops() on UMI counts from sce_raw target


sce_history: a tibble with cell and gene "history", showing their numbers in unfiltered and filtered data.

sce_history_plot: a {ggplot2} object summarizing information from sce_history target.


gene_annotation: a dataframe with gene annotation, processed such that:

gene_annotation can be also retrieved from rowData(sce_final_input_qc)



bioinfocz/scdrake documentation built on Sept. 19, 2024, 4:43 p.m.