calculateQCMetrics: Calculate QC metrics

Description Usage Arguments Details Value Cell-level QC metrics Feature-level QC metrics Compacted output Renamed metrics Author(s) Examples

View source: R/calculateQCMetrics.R

Description

Compute quality control (QC) metrics for each feature and cell in a SingleCellExperiment object, accounting for specified control sets.

Usage

1
2
3
calculateQCMetrics(object, exprs_values = "counts", feature_controls = NULL,
  cell_controls = NULL, percent_top = c(50, 100, 200, 500),
  detection_limit = 0, use_spikes = TRUE, compact = FALSE)

Arguments

object

A SingleCellExperiment object containing expression values, usually counts.

exprs_values

A string indicating which assays in the object should be used to define expression.

feature_controls

A named list containing one or more vectors (a character vector of feature names, a logical vector, or a numeric vector of indices), used to identify feature controls such as ERCC spike-in sets or mitochondrial genes.

cell_controls

A named list containing one or more vectors (a character vector of cell (sample) names, a logical vector, or a numeric vector of indices), used to identify cell controls, e.g., blank wells or bulk controls.

percent_top

An integer vector. Each element is treated as a number of top genes to compute the percentage of library size occupied by the most highly expressed genes in each cell. See pct_X_top_Y_features below for more details.

detection_limit

A numeric scalar to be passed to nexprs, specifying the lower detection limit for expression.

use_spikes

A logical scalar indicating whether existing spike-in sets in object should be automatically added to feature_controls, see ?isSpike.

compact

A logical scalar indicating whether the metrics should be returned in a compact format as a nested DataFrame.

Details

This function calculates useful quality control metrics to help with pre-processing of data and identification of potentially problematic features and cells.

Underscores in assayNames(object) and in feature_controls or cell_controls can cause theoretically cause ambiguities in the names of the output metrics. While problems are highly unlikely, users are advised to avoid underscores when naming their controls/assays.

Value

A SingleCellExperiment object containing QC metrics in the row and column metadata.

Cell-level QC metrics

Denote the value of exprs_values as X. Cell-level metrics are:

total_X:

Sum of expression values for each cell (i.e., the library size, when counts are the expression values).

log10_total_X:

Log10-transformed total_X after adding a pseudo-count of 1.

total_features_by_X:

The number of features that have expression values above the detection limit.

log10_total_features_by_X:

Log10-transformed total_features_by_X after adding a pseudo-count of 1.

pct_X_in_top_Y_features:

The percentage of the total that is contained within the top Y most highly expressed features in each cell. This is only reported when there are more than Y features. The top numbers are specified via percent_top.

If any controls are specified in feature_controls, the above metrics will be recomputed using only the features in each control set. The name of the set is appended to the name of the recomputed metric, e.g., total_X_F. A pct_X_F metric is also calculated for each set, representing the percentage of expression values assigned to features in F.

In addition to the user-specified control sets, two other sets are automatically generated when feature_controls is non-empty. The first is the "feature_control" set, containing a union of all feature control sets; and the second is an "endogenous" set, containing all genes not in any control set. Metrics are also computed for these sets in the same manner described above, suffixed with _feature_control and _endogenous instead of _F.

Finally, there is the is_cell_control field, which indicates whether each cell has been defined as a cell control by cell_controls. If multiple sets of cell controls are defined (e.g., blanks or bulk libraries), a metric is_cell_control_C is produced for each cell control set C. The union of all sets is stored in is_cell_control.

All of these cell-level QC metrics are added as columns to the colData slot of the SingleCellExperiment object. This allows them to be inspected by the user and makes them readily available for other functions to use.

Feature-level QC metrics

Denote the value of exprs_values as X. Feature-level metrics are:

mean_X:

Mean expression value for each gene across all cells.

log10_mean_X:

Log10-mean expression value for each gene across all cells.

n_cells_by_X:

Number of cells with expression values above the detection limit for each gene.

pct_dropout_by_X:

Percentage of cells with expression values below the detection limit for each gene.

total_X:

Sum of expression values for each gene across all cells.

log10_total_X:

Log10-sum of expression values for each gene across all cells.

If any controls are specified in cell_controls, the above metrics will be recomputed using only the cells in each control set. The name of the set is appended to the name of the recomputed metric, e.g., total_X_C. A pct_X_C metric is also calculated for each set, representing the percentage of expression values assigned to cells in C.

In addition to the user-specified control sets, two other sets are automatically generated when cell_controls is non-empty. The first is the "cell_control" set, containing a union of all cell control sets; and the second is an "non_control" set, containing all genes not in any control set. Metrics are computed for these sets in the same manner described above, suffixed with _cell_control and _non_control instead of_C.

Finally, there is the is_feature_control field, which indicates whether each feature has been defined as a control by feature_controls. If multiple sets of feature controls are defined (e.g., ERCCs, mitochondrial genes), a metric is_feature_control_F is produced for each feature control set F. The union of all sets is stored in is_feature_control.

These feature-level QC metrics are added as columns to the rowData slot of the SingleCellExperiment object. They can be inspected by the user and are readily available for other functions to use.

Compacted output

If compact=TRUE, the QC metrics are stored in the "scater_qc" field of the colData and rowData as a nested DataFrame. This avoids cluttering the metadata with QC metrics, especially if many results are to be stored in a single SingleCellExperiment object.

Assume we have a feature control set F and a cell control set C. The nesting structure in scater_qc in the colData is:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
  scater_qc
  |-- is_cell_control
  |-- is_cell_control_C
  |-- all
  |   |-- total_counts
  |   |-- total_features_by_counts
  |   \-- ...
  +-- endogenous
  |   |-- total_counts
  |   |-- total_features_by_counts
      |-- pct_counts
  |   \-- ...
  +-- feature_control
  |   |-- total_counts
  |   |-- total_features_by_counts
      |-- pct_counts
  |   \-- ...
  \-- feature_control_F
      |-- total_counts
      |-- total_features_by_counts
      |-- pct_counts
      \-- ...

The nesting in scater_qc in the rowData is:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
  scater_qc
  |-- is_feature_control
  |-- is_feature_control_F
  |-- all
  |   |-- total_counts
  |   |-- total_features_by_counts
  |   \-- ...
  +-- non_control 
  |   |-- total_counts
  |   |-- total_features_by_counts
      |-- pct_counts
  |   \-- ...
  +-- cell_control
  |   |-- total_counts
  |   |-- total_features_by_counts
      |-- pct_counts
  |   \-- ...
  \-- cell_control_C
      |-- total_counts
      |-- total_features_by_counts
      |-- pct_counts
      \-- ...

No suffixing of the metric names by the control names is performed here. This is not necessary when each control set has its own nested DataFrame.

Renamed metrics

Several metric names have been changed in scater 1.7.5:

All of the old metric names will be kept alongside the new metric names when compact=FALSE. Otherwise, only the new metric names will be stored. The old metric names may be removed in future releases of scater.

Author(s)

Davis McCarthy, with (many!) modifications by Aaron Lun

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
data("sc_example_counts")
data("sc_example_cell_info")
example_sce <- SingleCellExperiment(
    assays = list(counts = sc_example_counts), 
    colData = sc_example_cell_info
)
example_sce <- calculateQCMetrics(example_sce)

## with a set of feature controls defined
example_sce <- calculateQCMetrics(example_sce, 
feature_controls = list(set1 = 1:40))

## with a named set of feature controls defined
example_sce <- calculateQCMetrics(example_sce, 
     feature_controls = list(ERCC = 1:40))

scater documentation built on May 2, 2018, 3:36 a.m.