An initial stage to read in scRNA-seq data, remove empty droplets (optionally), and perform quality control, and cell and gene filtering.
More information about scRNA-seq quality control can be found in OSCA.
r emoji::emoji("gear")
Config file: config/single_sample/01_input_qc.yaml
r emoji::emoji("clipboard")
HTML report target (in config/pipeline.yaml
): DRAKE_TARGETS: ["report_input_qc"]
r emoji::emoji("scroll")
Example report
(used config)
r emoji::emoji("ladder")
Structure
cellranger
outputSingleCellExperiment
object as a Rds file or as a target from an existing {drake}
cache (i.e. from a different
{scdrake}
project)SingleCellExperiment
object to cells of interest based on their metadata values,
e.g. a set of clusters from a particular clustering method (optional)DropletUtils::emptyDrops()
) (optional)scater::perCellQCMetrics()
,
theory):Config for this stage is stored in config/single_sample/01_input_qc.yaml
file.
Directory with this file is read from the SCDRAKE_SINGLE_SAMPLE_CONFIG_DIR
environment variable upon {scdrake}
load or
attach, and saved as scdrake_single_sample_config_dir
option. This option is used as the default argument value in
several {scdrake}
functions.
INPUT_DATA: type: "cellranger" path: "/path/to/dir" delimiter: "," target_name: "target_name"
Type: list of named character scalars
This parameter is specifying input data type and path. There are four possible types of input data:
type: "cellranger"
: an output from
cellranger,
that is, files of a raw feature-barcode matrix - barcodes.tsv
, features.tsv
, and matrix.mtx
(can be also gzipped having a .gz
extension). Internally, BC matrix is imported via DropletUtils::read10xCounts()
download_pbmc1k()
and download_pbmc1k()
functions.type: "table"
: a delimited file (table) representing a feature-barcode matrix, with an additional column named
ENSEMBL
containing Ensembl IDs of features (rows).type: "sce"
: a saved SingleCellExperiment
in Rds format (i.e. saved by the saveRDS()
function).
The object must contain feature-barcode matrix with raw UMI counts in the counts
slot of assays
(such that this BC matrix can be retrieved by counts(sce)
). colData()
and use it later (e.g. cell grouping/clusters
to compute markers or to include batch effect during that).type: "sce_drake_cache"
: a saved SingleCellExperiment
in a drake
cache, e.g. from an another scdrake
project
path: "path/to/file/or/dir"
: a path to input file or directory (can be relative to project root directory).
type: "cellranger"
, this is a path to directory containing the barcodes.tsv
, features.tsv
, and matrix.mtx
files (can be also gzipped having a .gz
extension)type: "sce_drake_cache"
, a path to drake
cache directory (usually named as ".drake"
)For type: "sce"
or "table"
, a path to Rds or text file
delimiter: ","
: used when type: "table"
. Specifies the field delimiter in the table.
target_name: "target_name"
: used when type: "sce_drake_cache"
. Specifies a name of target in drake
cache to be
imported, e.g. "sce_final_input_qc"
INPUT_QC_REPORT_RMD_FILE: "Rmd/single_sample/01_input_qc.Rmd"
Type: character scalar
A path to RMarkdown file used for HTML report of this pipeline stage. For spatial extension, the default RMarkdown file is 01_input_qc_spatial.Rmd
INPUT_DATA_SUBSET: null ## Example of simple subsetting by cluster numbers INPUT_DATA_SUBSET: subset_by: "cluster_kmeans_k6" values: ["3", "4"] negate: false
Type: null
(default) or named list
The imported data can be subsetted by a simple selection of values present in a column of cell metadata (colData()
).
In the example above, cells assigned to clusters "3"
and "4"
of the cluster_kmeans_k6
columns will be kept.
You can also negate the selection by specifying negate: true
.
SPATIAL: False
Type: logical scalar
If True
, pipeline enables spatial extension.
In the input_qc stage, the spatial extension consists only in enabling pseudo tissue visualization, and in adding coordinates (array_col and array_row from tissue_positions.csv) to the Cell Metadata.
SPATIAL_LOCKS: Null
Type: Null or character scalar
A path to tissue_position.csv from SpaceRanger spatial output folder used for adding coordinates (array_col and array_row) to Cell Metadata. Only void when SPATIAL is enabled.
See ?DropletUtils::emptyDrops
for more details.
EMPTY_DROPLETS_ENABLED: True
Type: logical scalar
If False
, skip calculation and removal of empty droplets.
You might consider turning off this procedure if the input is not a raw feature-barcode matrix from cellranger
,
but an already processed dataset (table or SingleCellExperiment
object).
Otherwise DropletUtils::emptyDrops()
will fail as there are not enough empty droplets having the total UMI count
< EMPTY_DROPLETS_LOWER
(100 by default).
EMPTY_DROPLETS_LOWER: 100
Type: positive integer scalar
A lower bound on the total UMI count at or below which all barcodes are assumed to correspond to empty droplets.
EMPTY_DROPLETS_FDR_THRESHOLD: 0.01
Type: numeric scalar <0; 1>
A threshold for FDR adjusted p-values for null hypothesis that barcode comes from ambient environment. In other words, probability that a droplet was empty and contained ambient RNA.
Two types of cell filtering are performed: dataset-sensitive (using MAD threshold), and custom (using custom thresholds).
In both filtering types, violation of only one metric threshold leads to removal of a cell.
The choice of cell filtering thresholds depends very much on your dataset. What works for an ordinary high-quality dataset might not work well for a low-quality or special dataset with rare cell types. The quality control chapter in OSCA gives an excellent background to the QC of scRNA-seq data.
ENABLE_CELL_FILTERING: True
Type: logical scalar
If False
, both types of cell filtering will be disabled.
The filtering thresholds in both filtering types will be overridden such that every cell will pass the filtering
(e.g. MAD_THRESHOLD: .inf
or MIN_UMI_CF: -.inf
). In the end, SAVE_DATASET_SENSITIVE_FILTERING
will have no
effect as both filtered SCE targets will be identical. All QC metrics and plots will still be calculated and created.
SAVE_DATASET_SENSITIVE_FILTERING: True
Type: logical scalar
If True
, proceed to other stages with dataset filtered by dataset-sensitive filtering (target sce_qc_filter_genes
),
otherwise with dataset filtered by custom filtering (target sce_custom_filter_genes
).
Note that the selected SCE target will be referred to as sce_final_input_qc
in subsequent stages.
MAD_THRESHOLD: 3
Type: positive numeric scalar
A threshold for maximum MAD (median absolute deviation) for QC cell metrics:
MAD threshold of 3 will retain 99% of non-outlier values that follow a normal distribution.
"Lower tail" means cells having a metric value less than -MAD_THRESHOLD
MAD will be discarded.
For "upper tail", it is +MAD_THRESHOLD
MAD.
Violation of only one metric threshold leads to removal of a cell.
To disable the dataset-sensitive filtering, set MAD_THRESHOLD: .inf
.
That will force passing of each cell as every QC metric will be always lower than positive infinity MAD.
DATASET_SENSITIVE_FILTERS_OPERATOR: "&"
Type: a character scalar ("&" | "|"
)
How to join the QC filters:
&
operator), i.e., remove only cells that violate ALL filters (permissive)|
operator), i.e., remove cells that violate AT LEAST ONE filter (strict)MIN_UMI_CF: 1000
Type: positive integer scalar
A threshold for minimum number of UMI per cell, i.e. cells with UMI less than MIN_UMI_CF
will be removed.
To disable this filter, set MIN_UMI_CF: -.inf
MAX_UMI_CF: 50000
Type: positive integer scalar
A threshold for maximum number of UMI per cell, i.e. cells with UMI greater than MAX_UMI_CF
will be removed.
To disable this filter, set MAX_UMI_CF: .inf
MIN_FEATURES: 1000
Type: positive integer scalar
A threshold for minimum number of features (genes) detected per cell,
i.e. cells with detected features less than MIN_FEATURES
will be removed.
To disable this filter, set MIN_FEATURES: -.inf
MAX_MITO_RATIO: 0.2
Type: numeric scalar <0; 1>
A threshold for maximum ratio of expressed mitochondrial genes per cell,
i.e. cells with mitochondrial genes detected in more than (MAX_MITO_RATIO
* 100)% of all genes will be removed.
To disable this filter, set MAX_MITO_RATIO: .inf
CUSTOM_FILTERS_OPERATOR: "&"
Type: a character scalar ("&" | "|"
)
How to join the QC filters:
&
operator), i.e., remove only cells that violate ALL filters (permissive)|
operator), i.e., remove cells that violate AT LEAST ONE filter (strict)Gene filtering thresholds are applied in both types of cell filtering (after it is performed).
The filter is computed by the get_gene_filter()
function as following:
num_cells <- min_ratio_cells * ncol(sce) is_expressed <- rowSums(counts(sce) >= min_umi) >= num_cells
A gene is considered expressed when number of its UMIs across all cells is greater than min_umi
and
at the same time it is expressed in at least min_ratio_cells
ratio of cells.
ENABLE_GENE_FILTERING: True
Type: logical scalar
If False
, gene filtering will be disabled.
The filtering thresholds will be overridden such that every gene will pass the filtering
(MIN_UMI: 0
and MIN_RATIO_CELLS: 0
).
MITO_REGEX: "^MT-" RIBO_REGEX: "^RP[SL]"
Type: character scalar
Regexes used to match and count occurences of mitochondrial and ribosomal features. The latter is currently not used for gene filtering.
MIN_UMI: 1
Type: zero or positive integer scalar
A threshold for minimum number of UMI per cell, i.e. genes with UMI < MIN_UMI
will be removed.
MIN_RATIO_CELLS: 0.01
Type: numeric scalar <0; 1>
A minimum ratio of cells expressing a gene,
i.e. genes expressed in less than (MIN_RATIO_CELLS
* 100)% of cells will be removed.
INPUT_QC_BASE_OUT_DIR: "01_input_qc"
Type: character scalar
A path to base output directory for this stage. It will be created under BASE_OUT_DIR
specified in 00_main.yaml
config.
INPUT_QC_REPORT_HTML_FILE: "01_input_qc.html"
Type: character scalar
A name of HTML report file for this stage. Created in INPUT_QC_BASE_OUT_DIR
. Subdirectories are not allowed.
INPUT_QC_KNITR_MESSAGE: False INPUT_QC_KNITR_WARNING: False INPUT_QC_KNITR_ECHO: False
Type: logical scalar
These are passed to knitr::opts_chunk()
and used for rendering of stage's HTML report.
Here you can find description of the most important targets for this stage.
However, for a full overview, you have to inspect the
source code of the
get_input_qc_subplan()
function.
Targets can be loaded from the {drake}
cache (.drake
directory by default) using the drake::loadd()
or
drake::readd()
functions.
HTML report target name: report_input_qc
SingleCellExperiment
objectssce_raw
: an untouched SCE object as loaded from:
DropletUtils::read10xCounts()
(cellranger
input)readr::read_delim()
(delimited table input)SingleCellExperiment
object (Rds file or {drake}
cache)sce_valid_cells
: a SCE object with empty droplets removed
sce_unfiltered
: sce_valid_cells
with added columns:
cell_qc
target containing DataFrame with cell QC metricsdiscard_qc
: dataset-sensitive filterdiscard_custom
: custom filtersce_qc_filter
, sce_qc_filter_genes
: a SCE object with cells filtered by dataset-sensitive filtering.
The latter with filtered genes.
sce_custom_filter
, sce_custom_filter_genes
: a SCE object with cells filtered by custom filtering.
The latter with filtered genes.
sce_selected
: a SCE object selected from either
sce_qc_filter_genes
or sce_custom_filter_genes
- depends on the SAVE_DATASET_SENSITIVE_FILTERING
parameter
sce_final_input_qc
: sce_selected
with added gene annotation (gene_annotation
target) to rowData()
.
This is the final SCE object which will be used in the next stage norm_clustering
.
cell_qc
: a DataFrame with cell QC metrics
The following list of filters are computed by scater::isOutlier()
from cell_qc
columns,
which are shown for individual filters (e.g. qc_lib
: total
column).
In those, FALSE
refers to cells not passing a filter.
qc_filters
: a list of named lists with logical values for dataset-sensitive cell filtering:
qc_lib
(total
): low number of UMI (lower tail)qc_nexprs
(detected
): low number of detected genes = non-zero UMI count (lower tail)qc_mito
(subsets_mito_percent
): high percentage of expressed mitochondrial genes expression (upper tail)custom_filters
: a list of named lists with logical values for custom cell filtering:
low_count
(total
): low number of UMIs.high_count
(total
): high number of UMIs.low_expression
(detected
): low number of detected genes.high_mito
(subsets_mito_percent
): high percentage of expressed mitochondrial genes.qc_filter
, custom_filter
: a summarized filter (by "or" operation) of qc_filters
and custom_filters
, respectively
sce_qc_gene_filter
, sce_custom_gene_filter
: gene filters (logical vectors) obtained from get_gene_filter()
for
SCE objects with cells filtered by dataset-sensitive and custom filter, respectively.
In those, FALSE
refers to genes not passing the filter.
sce_unfiltered_plotlist
: a list of various violin plots for cell QC metrics ({ggplot2}
objects), generated using
sce_unfiltered
target. Cells are colored by discard_qc
column - blue points pass the dataset-sensitive filter,
while the orange ones do not.
For example, the following code is used to plot total UMI counts:
scdrake::plot_colData( sce_unfiltered, y = "total", colour_by = "discard_qc", title = "Total count", scale_y = ggplot2::scale_y_log10() )
sce_qc_filter_genes_plotlist
, sce_custom_filter_genes_plotlist
: similar to sce_unfiltered_plotlist
,
but in each of them, cells are colored by the other filter. That is, cells in sce_qc_filter_genes_plotlist
are
colored by discard_custom
column, and thus, showing which cells would be discarded in custom-filtered dataset.
Vice versa for sce_custom_filter_genes_plotlist
.
config_input_qc
: a list holding parameters for this stage
barcode_ranks
: a DataFrame with barcode ranks for
knee plot,
computed by DropletUtils::barcodeRanks()
on UMI counts from sce_raw
target
empty_droplets
: a DataFrame of statistics for empty droplets, computed DropletUtils::emptyDrops()
on UMI counts
from sce_raw
target
sce_history
: a tibble with cell and gene "history", showing their numbers in unfiltered and filtered data.
sce_history_plot
: a {ggplot2}
object summarizing information from sce_history
target.
gene_annotation
: a dataframe with gene annotation, processed such that:
,
).gene_annotation
can be also retrieved from rowData(sce_final_input_qc)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.